C++ encapsulation for Data-Oriented Design

yoklov · on Aug 31, 2015

This is awful. Some reasons, in no particular order

- Straight up, this isn't data oriented. You're dealing with one object at a time.

- This code is vastly harder to understand in its entirety, much less extend, than the data oriented approach. This is template craziness to it's fullest degree, and what it's doing is going to be unfamiliar to the vast majority of programmers. Hell, I've read an article, and have been staring at for 15 minutes now and I'm not fully sure. Admittedly, I'm a bit rusty with my templates, but that's a very bad sign for maintainability.

- Memory access patterns aren't the only reason to do DOD, a big one is being able to easily simd optimize your code. This defeats that.

- If you're not going to simd optimize your code, then storing everything as SoA is often not a great idea (you want to group the things you touch at the same time, so that you have fewer cache misses).

- You can still have encapsulation with data oriented code. It's just encapsulated at a different granularity (you encapsulate access to the data arrays).

Veedrac · on Sept 1, 2015

I think the idea here is just to put names onto traditional SOA. Thus, instead of doing `foo_xs[k] += foo_dxs[k]`, one does `foo.x += foo.dx` and lets it all magically resolve itself.

In theory, this should help the compiler automatically vectorize (than AOS), but even if it doesn't it still allows you to use a slightly higher-level interface for loading your values for manual SIMD.

Note that at the end, the poster says

> Traversal is at the core of DOD, as the paradigm is oriented towards handling large numbers of like objects. In a later entry we will extend this framework to provide for easy object traversal and measure the resulting performance as compared with OOP.

I also agree that recursive templates are a pain, but "modern" C++ kind'a expects you to use them anyway. As recursive templates go, this one isn't all that bad to use.

kllrnohj · on Sept 1, 2015

I suspect it's far more likely for "foo_xs[k] += foo_dxs[k]" to be auto-vec'd than for "foo.x += foo.dx" via templates to be auto-vec'd. The first is very straightforward and obvious with a clean auto-vec route, the later is not.

Veedrac · on Sept 1, 2015

I've obviously not benchmarked, but from what I can see all the complexity is in the templates, which only exist at compile-time. From what it looks like to me, it should all dissolve before the optimizer gets to peek at it.

joaquintides · on Sept 6, 2015

Hi Veedrac, some results at

https://news.ycombinator.com/item?id=10177262

jamii · on Aug 31, 2015

I would worry that trying to shoehorn the code back into an entity-at-a-time OO-style interface still encourages poor access patterns. I much prefer the bitsquid approach which treats component managers as the objects and provides nice bulk-access apis for each:

http://bitsquid.blogspot.com/2014/08/building-data-oriented-...

http://bitsquid.blogspot.com/2014/09/building-data-oriented-...

http://bitsquid.blogspot.com/2014/10/building-data-oriented-...

louithethrid · on Aug 31, 2015

Best approach imo would be a IDE which keeps the whole object-orientated approach as a purely developer-view only set of restrictiosns and allows to adapt the underlying datastructure to the algorithmic hottest spots in the code, while keeping from oo- only the component interfaces.

Thanks for the Bitsquid links.

kibibu · on Sept 1, 2015

If you have a few hours, have a look at Jon Blow's new language, Jai.

Most of the content is in video form so not easy to grok, but there is essentially a way to switch from Array of Structs to Struct of Arrays with a single keyword. This gets you many of the benefits of Data-Oriented Programming with minimal effort.

scotty79 · on Sept 1, 2015

I really hate private, protected, sealed and such. At least few times I found myself in situation that I needed to use some library in ways author haven't anticipated or create workaround for a bug there.

I had to recompile it and duplicate large parts of the library code just because somebody put private, protected or sealed for no apparent reason in few places.

In JS keeping classes only as local variables not exposed in any way to user of the library has the same effect. Please don't do it. If you want me not to shoot myself in the foot then create good api, not just try to hide the gun in the closet. I'll find it anyway and the more time I spend looking for it, the less time I'll have to understand how to operate it safely.

okasaki · on Sept 1, 2015

It'd be nice to see some benchmarks...

JoeAltmaier · on Sept 1, 2015

We're at a crossroads in CPU design. The hardware guys implement heuristics to make ordinary code work "pretty good", with cache lines and multiple layered caches and register calculus and stalling pipelines.

Then compiler folks reverse-engineer all that to get "really good" performance. But they can't make one code for every model of CPU because this stuff changes with each release.

How about, CPU leaves these decisions up to the compiler? Flatten and simplify, and make (allow) compiler writers to know exactly what's going on, and even control it!

kllrnohj · on Sept 1, 2015

That is not at all how compilers & cpu engineers work together.

Compilers are not reverse-engineering the CPU, and different CPUs do not behave that radically different to make this non-portable.

The biggest issue by far are the languages themselves. Java, for example, is inherently anti-CPU-friendly by design. You can't get flattened objects, almost everything is done via reference instead of value, etc... This was done for other tradeoffs, but the majority of languages in use were not built for raw CPU throughput.

Veedrac · on Sept 1, 2015

I'm not sure what you're suggesting. Software prefetching? Software branch prediction? Those sound like pretty poor ideas if that is indeed what you mean - a lot of time will be spent emulating something that hardware does really well.

JoeAltmaier · on Sept 1, 2015

E.g. the Mill architecture, which has no register-collision issues at all. At the cost of the compiler keeping track of the address of results as they pass through the mill.

And hardware doesn't do it well at all, at least not in the super-optimization cases discussed in the OP. They're working hard to reverse-engineer around the hardware; which is at least as hard as the compiler programming caches and pipelines explicitely.

kllrnohj · on Sept 1, 2015

Mill has two issues:

1) It doesn't exist. Theory is irrelevant, build an actual chip and run some actual software on it and then let's talk.

2) VLIW of 33 operations? Ain't no way in hell they are keeping that fed. Not happening. AMD couldn't keep a VLIW of 5 fed on their GPUs. They dropped to VLIW4, then switched to a RISC SIMD approach. And that's with a compiler that only has to deal with a very constrained grammar that almost never encounters branches. Which is about as far from general purpose workloads as you can get.

Veedrac · on Sept 1, 2015

Issue #1 is a big deal, but they've never said otherwise.

Issue #2 is less of a reason to worry. The 33 instructions is an upper limit, and will likely be a lot lower on most implementations. They also don't expect, nor need, it to be saturated.

Further, their instruction set is much more amiable to parallelism than normal instruction sets because

* Their instructions have special values (NaN and NaR) that enable extra parallelism.

* Their instructions avoid side effects that prevent parallelism. This includes not just condition codes, but effectively the whole belt mechanism.

* Their instructions can move data across themselves horizontally (under certain phase boundaries), so certain data dependencies are allowed inside an instruction. In essence, you're running at three-to-five times the clock rate but each clock is only allowed to run a given subset of instructions.

There's an example they give where the whole of a vectorized strcpy iteration is done in one instruction, with no set-up or tear-down needed. This is of course a best-case scenario, in that every technique they have is in use simultaneously, but all of the techniques they give are generally applicable and (relatively speaking) simple for compilers to use.

JoeAltmaier · on Sept 1, 2015

Here's another: http://www.technologyreview.com/news/539416/startup-attempts...

Veedrac · on Sept 1, 2015

> At the cost of the compiler keeping track of the address of results as they pass through the mill.

The compiler already does this. Traditional architectures then require you to use these results to make a register mapping, which is then basically (kind-of) undone in hardware with register renaming.

The Mill (kind'a) just doesn't do these two steps.

> And hardware doesn't do it well at all

It really does, though. Compiler-aided prefetch, even if magically perfect, wouldn't even help here since the problem is in the data layout.