You can simulate it with a bunch of parallel arrays, one array for each field of...

RodgerTheGreat · on Oct 24, 2011

This is a good point, but I'd wager parallel arrays are slower because the components of a "struct" don't have spatial locality in memory- thus you have poor cache coherence. Of course, this isn't an issue if you're primarily passing the flyweight objects around and not accessing all the fields.

beagle3 · on Oct 24, 2011

> I'd wager parallel arrays are slower because the components of a "struct" don't have spatial locality in memory

How much would you wager? I'm willing to take your money :)

> Of course, this isn't an issue if you're primarily passing the flyweight objects around and not accessing all the fields.

If you examine codebases, you see that this is actually the prevalent use case for objects in general.

It turns out that in practice, parallel arrays are often significantly faster, because more often than not you _do_ access a lot of (memory consecutive) records, and you _don't_ access all the fields.

Assume e.g. your structure has student names and numeric grades. For display purposes, you are likely to read both name and grade.

However, for just about every other purpose, you are likely to scan/process grades without ever looking at the name (e.g., to compute grade statistics; or to sort by grade). And in this case, you actually have much _better_ cache locality than you would have otherwise -- and also much better memory usage compared to individual allocation of objects.

Furthermore, if you take this approach, you find that your functions become type-concrete but use-generic, reducing your code footprint (both LOC and I-cache), because e.g. instead of "calcAverageGrade(student[] x)", you can just use a routine called "calcAverage(float[] x)", and call it ont he grade vector.

APL, J and K take this approach to its logical conclusion. The result is programs that are 10-1000 times shorter than the idiomatically comparable Python, Java, C#, C++ etc., and that have runtimes the compare favorably with C and C++.

dpark · on Oct 24, 2011

> It turns out that in practice, parallel arrays are often significantly faster, because more often than not you _do_ access a lot of (memory consecutive) records, and you _don't_ access all the fields.

This doesn't hold if you're doing random access (which it sounds like they are, given that they're passing around lists of indices). In the random-access scenario, parallel arrays will require roughly O(M) times as many trips to main memory as the struct array, where M is the number of fields accessed.

beagle3 · on Oct 24, 2011

O(L/M) times as many trips, where L is the <average cache line size> divided by <average field size>. With 64 byte cache lines and 8 byte fields, this is at most 8, even if you have 200 fields in your object. In my experience, when I was still doubting this myself, I was unable to ever measure more than 20% slowdown on the least favorable (for columnar storage) workload, and I was able to measure about 200% speedup on a favorable workload.

Note that even if they do pass around a list of indices, accesses are often (in practice) batched - because records that arrived together tend to be needed together. And the indices are often sorted (because of how they were generated, or because they are easy to sort) which then produces a very predictable access pattern for memory and disk.

We can theorize about probable and improbable access patterns until tomorrow. My experience, and everyone that used column-major order that I'm aware of, is that in practice it almost always wins against row-major ("common") ordering, whether on disk or in memory.

The fastest databases around (kdb+, vertica) use column major order. That should tell you something (about disk, but also about memory)

dpark · on Oct 24, 2011

> O(L/M) times as many trips

Sorry, I'm really confused. Is this supposed to be "O(M/L)"?

The note about clumped records is valid. That is often the case. Without knowing more about their workload, it's difficult to say whether they have that behavior, but it's quite likely.

As for the fastest DBs, they use column order for a number of reasons. The big thing is that their access patterns are rarely random. They're optimized for scanning on keys. DBs like Fastbit can be extremely fast for whole-table scans. They aren't so fast if you need to find a specific entry. For maximum speed there, you need a dedicated index, and then it doesn't really matter whether the data is stored row-major or column-major.

beagle3 · on Oct 24, 2011

> Sorry, I'm really confused. Is this supposed to be "O(M/L)"?

No. As soon as you need to read more than <cache line bytes>, it no longer matters that that the the fields are near each other. So, e.g., if your record is 128 bytes but your cache line is 32 bytes, you have 4 cache misses per record. If you use only 4 fields from the record, you have 4 cache misses per record -- so, we're on parity.

As M increases beyond the cache line size, the advantage (compared to column major) no longer increases. So it is at most a constant in practice.

> The big thing is that their access patterns are rarely random.

That's true for almost all programs, almost all of the time, actually -- which is why I was willing to be the other side of your original wager.

> They aren't so fast if you need to find a specific entry. For maximum speed there, you need a dedicated index, and then it doesn't really matter whether the data is stored row-major or column-major.

And again, that's true for many workloads as well.

The main advantage of row-major is simpler appending; if you write a lot and read little, row-major is faster; this is usually the case for e.g. audit logs.

In most typical cases, column major is equal or better.

dpark · on Oct 24, 2011

> No. As soon as you need to read more than <cache line bytes>, it no longer matters that that the the fields are near each other. So, e.g., if your record is 128 bytes but your cache line is 32 bytes, you have 4 cache misses per record. If you use only 4 fields from the record, you have 4 cache misses per record -- so, we're on parity. As M increases beyond the cache line size, the advantage (compared to column major) no longer increases. So it is at most a constant in practice.

Huh? Why would you have 4 cache misses for a 128-byte record? If you need to access 4 fields and they all fall onto the same cache line, you get only one cache miss. If your fields are spread far apart, you might have 4 cache misses, but that isn't a given. With parallel arrays, the fields will be in separate arrays and necessitate 4 cache misses.

> That's true for almost all programs, almost all of the time, actually -- which is why I was willing to be the other side of your original wager.

I didn't make the original wager. :)

beagle3 · on Oct 24, 2011

> Why would you have 4 cache misses for a 128-byte record?

I wrote earlier "if your cache line is 32 bytes".

If you need all the fields - you will have 4 cache misses. If you need only some of the fields, and they all fall into 1 cache line, then -- as you said, you will only have one cache miss.

My point (poorly worded, upon rereading, but I didn't have much time, dayjob and all...) was that the potential advantage of record-major does NOT increase beyond what is packed into one cache line -- so it is a small constant, rather than deserving of O() notation.

> I didn't make the original wager. :)

Ah, apologies. (But maybe you'd still let me take your money? :) )

dpark · on Oct 24, 2011

> My point (poorly worded, upon rereading, but I didn't have much time, dayjob and all...) was that the potential advantage of record-major does NOT increase beyond what is packed into one cache line -- so it is a small constant, rather than deserving of O() notation.

I see what you mean, now. You're right that it's a constant multiplier.