Hacker News new | past | comments | ask | show | jobs | submit login
Data-Oriented Design (2018) (dataorienteddesign.com)
389 points by DeathArrow on July 3, 2023 | hide | past | favorite | 133 comments



Some of the best advice I ever got for writing composable, high-performance code was “work on structs of arrays, not arrays of structs”. I hear many echoes of that advice in this text. Turns out that entity-component architectures work well in line-of-business applications too, not just games.

Alas, many developers in enterprise are rusted onto a record-keeping CRUD model and struggle to think in columns rather than rows. The idea of inserting an entity id into a “published” table, instead of setting a boolean “published” field to true, doesn’t always come naturally. Yet once you realise how readily polymorphic this is, you may start wanting to use such approaches to data for everything. Rich new opportunities then arise from cross-pollinating component data. Some may question why it is structurally permissible that, say, a network interface can have a birthday, or why an invoice has an IPv6 address, why my cat is in the DHCP pool, whilst limegreen is deleted and $5 on Tuesdays. This of course is half the fun.

I don’t accept that it’s wholly incompatible with OO, though, a thesis you’ll see dotted around the place. I’ve even taken this approach with Ruby using Active Record for persistence; not normally a domain where the words “high performance” are bandied about. That worked particularly because Ruby’s object system, being more Smalltalk-ish than C++/Java-ish, strongly favours composition over inheritance.


> I don’t accept that it’s wholly incompatible with OO

It's not incompatible with the mechanics of OO, but it does require that programmers change how they approach problems. For instance, a common way to write code in an OO language is to focus solely on the thing you want to think about (a user, a blog post, a money transaction, what have you) and to implement it in isolation of everything else, to hide all of its data, and then to think about what methods need to be exposed to be useful to other parts of the system. The idea of encapsulation is quite strong.

In DOD, it is more common for data related to different domains to be accessible and let the subsystems pick and choose what they need to do their work. Nothing about Java or Ruby would prevent this, but programmers definitely have mental barriers.


> high-performance code was “work on structs of arrays, not arrays of structs”

Wikipedia's article "Array of Structure (AoS) and Structure of Arrays (SoA)" explains the trade off between performance (SoA) and intuitiveness/lang support (AoS): https://en.wikipedia.org/wiki/AoS_and_SoA

They also get into software support for SoA, including data frames as implemented by R, Python's Panda's package, and Julia's DataFrames.jl package which allows you to access SoA like AoS.


I’d push back on this by writing the Pandas and Polars etc of the world could be a lot better if they supported generic types. The fact the DataFrame is not connected to your struct / types is a big problem because everyone everywhere has to write at least two functions and probably more like 14 functions to enable basic CRUD operations on Struct of Arrays correctly given Structs.

For example, you need a StructOfArraysToArrayOfStructs function and ArrayOfStructsToStructOfArrays function and StructToRow and RowToStruct at least. Making sure those work for all the various types of thing in a programming language like python is no easy feat. Not having those functions built in, means everyone has to make sure those functions work on their own.


I'd also mention Clojure also has a very performant 'tech.ml.dataset' equivalent to dataframes

Maybe I'm wrong, but AoS VS SoA seems like an old C false dichotomy that's been already effectively resolved

If you need to choose between the two then the answer is probably neither - use an appropriate table datastructure

There is an extra layer of abstraction and software, but I don't really see any downsides. I'd love to hear some arguments against


The downside is losing the conceptual understanding of why one or the other organization is the fastest for the data in question. From the POV of a general table data structure how does it decide which kind of memory organization is the most performant?

Looking at the examples referenced it looks like sugar over SoA so things look comfortably like normal but SoA isn't necessarily the most performant layout in memory depending on how things are being accessed.


Isn't SoA basically always more performant though? You're either traversing individual "columns" and leveraging cache locality, or you're not and then you can traverse multiple columns in parallel to rebuild the structs on the fly.

I could only see this degrade or be suboptimal if you have a very small struct that still all in-cache (like an array of 3d coordinates or something)


If you’re always accessing all the members then no. SoA shines when you want to access a different set of a few members of a struct in different contexts. Then there’s also the option of building temporary arrays of just the data you need then spinning through it which can be faster for some things. For example rendering pipelines. Even more so if you cache the results and can use them next frame due to temporal locality. There’s no silver bullet you’ve got to sit down and look at your access patterns and profile.


SOA is optimal only if you are iterating sequentially over many elements on a subset of the columns.

Really the right layout is access dependent.


In the data side of the world, "structs of arrays" translate to column based indexes i.e. Snowflake and OLAP. "Arrays of structs" translate to relational databases with its page/row based indexes.

FWIW, I am a big fan of Snowflake and think it will eat everyone else's lunch. I also find it amusing that Snowflake "supports" foreign keys but don't enforce it. In other words, Snowflake is as "nosql" as I care to go.


> "supports" foreign keys but don't enforce it

As in, you can have a sort of 'link' to another table, that not only mignt be null, but might be a value which doesn't exist over there?

What does the 'support' add over just having a column in which you tell yourself (but not your DBMS) you're storing values which correspond to keys in another table?

(You can take it further - s/keys/values in a column/ - many to one relationships without an intermediary table! Amazing! ...more 'NoSQL' than I care to go I think, for almost anything.)


https://docs.snowflake.com/en/sql-reference/constraints-over...

Note Snowflake supports defining and maintaining constraints, but does not enforce them, except for NOT NULL constraints, which are always enforced.


I don't understand what it means (and that page doesn't explain) for a 'constraint' not to be enforced? Do you somehow get some sort of warning, but not an error, i.e. the insert/update is allowed to work, everything proceeds as normal, but it's there to check on and correct if you want?


Or an in-memory "data frame" like in R, Python's Pandas, and Polars.


As I see it, there are two types of DOD, one that you've mentioned and where you “work on structs of arrays, not arrays of structs”.

The other one means just giving up of encapsulation, separate the data from the methods that work with the data, think the whole app in terms on how the data flows through it and model it so everything is easy to understand and change. For added correctness, you can use immutable data structures and pure functions.


Isn't this the very definition of functional programming?


No. Functional programming is programming with functions as values. It is not incompatible with data hiding (for example through modules).


"Programming with functions as values" is merely one of the side effects, a feature, of writing your programs functionally. It does not define the paradigm.


This is exactly right. Good summary.


I've tried to push entity-component-systems (ECS) for non-game applications. A financial company in London took that advice to manage the complexity of their system since it was such a good fit.

For those who are curious, here's a very brief introduction to ECS: https://dev.to/ovid/the-unknown-design-pattern-1l64


Would you mind to elaborate a bit on your system? I tried to do something similar for an algo trading system, but in the end I hadn't enough entities to justify the approach.


I think there's a widespread misconception about the reason ECS exists and what problems it solves. It's a design for a general-purpose game engine where it solves the problem of having some conception of the game world and the things that make it up without knowing in advance what those things will be. This is a common issue with loads of different approaches in game engines.

ECS contrary to popular imagination doesn't have to be implemented with data-oriented principles and there are lots of implementations, particularly in dynamic languages that use the same design without the performance boost you might get once you do follow them. Once you unpick the data-oriented rabbit hole you quickly discover it also largely complicates the original design.

If you have a specific thing in mind then use the data-oriented principles on that and skip building something general purpose. This works for games as well as business applications. It'll be much simpler and easier to make performant.


"I don’t accept that it’s wholly incompatible with OO, though, a thesis you’ll see dotted around the place."

Agreed. Arrays of Structs is not the part that is really incompatible.

The part that clashes a bit with traditional OOP is ECS, where data and code are meant to be kept separate. But of course I'm talking about very traditional OOP. It is entirely possible to use an OOP languages + classes to implement ECS. It's just not gonna be "traditional".

EDIT: To quote your other reply: "records don’t have to identify strongly with a class hierarchy".


Not sure what you all mean by OOP to be honest. Is this a case of OOP actually meaning "good programming", then morphing into the most fashionable (and hopefully best) practices of the day?

It wouldn't be the first time… https://loup-vaillant.fr/articles/deaths-of-oop


Well, I made the case to emphasize the word traditional. By "traditional" I mean OOP that follows the popular practices and pedagogy.

Using ECS means totally eschewing things like encapsulation and inheritance. You stop grouping functions and methods together. You basically have "free records" and "free procedures" (or "free methods" if the language requires classes, Kingdom of Nouns style). ECS is a procedural/functional paradigm.

You can argue that this is not OOP anymore, and guess what: I'll probably support you on that! :)


Working with JavaScript has caused me to ask this question: "When should I create a Class, and when a Function?"

This is not a trivial question because class-instances are basically collections of Functions, and Functions can return class-instances. So a trivial answer would be: "It does not matter, you can do everything with both or either one".

No, you shouldn't do everything with both, you should do some specific types of things with classes and other types of things with functions. But what?

I've come to this Rule-of-Thumb: Use classes only to represent and encapsulate data, use functions to process and transform class-instances.

In practice this means that in most cases classes should not have methods which take arguments. Exceptions: You can have setters. You can have a new() method which creates a new instance with data that differs from the data of the recipient. But it should be all about data-creation and extraction.

Classes are still truly useful over plain records, because a class abstracts over what data is stored, and what are the names by which it is accessed, and whether the methods return a stored field or a calculated one.

The idea that "Everything is an instance of a Class" as in Smalltalk is a bit flawed, or at least too trivial an advice. It is of course also true that (even in JavaScript) Functions are objects. And in Smalltalk you can use BlockClosures, which really are "functions".

This division of design makes it easier for me to think about the structure of the whole application. I no longer have classes for everything. Instead I have classes which represent and provide access to and creation of data, and a set of functions which do the processing of those class-instances.

Why is this good? Because classes are, and easily become complicated, since they can have many methods, they can have both static and instance-methods, and methods can be local or inherited and you can even use the 'super' to add to the confusion. So therefore if you can find a way to force your classes to be as simple as possible, do that. Don't make them do complicated things since they are complicated to start with.

Dividing the design into two parts, classes + functions, divides the complexity of the whole app into two parts, each of which contains about 1/2 of the complexity of the whole app. The complexity is now data-complexity PLUS function-complexity whereas with use-classes-for-everything the app-complexity would be more like data-complexity TIMES function complexity (I conjecture).

Keep data (= classes) and functions separate. Divide and conquer complexity.


Oh, yes. I totally agree with that philosophy.

The part about classes only acting on themselves resonates heavily with me. To me it's the same with components in frontend apps: the more independent they are, the better.

I'm probably never going to be copy-pasting my classes and components into other apps, but a good class/component should be copy-paste-able without needing many dependencies. This avoids the "banana gorilla jungle problem", as named by Joe Armstrong.

"Free Functions" also help with keeping those classes even more self-contained and decoupled: you can use Function types for callbacks and things like that, instead of having to couple the class to some other class or interface. The only thing the class has to worry about is that the function conforms to the type.

-

"Keep data (= classes) and functions separate. Divide and conquer complexity."

Yep, 100% agree. Amen to that.


Right. Functions and data can not be decoupled because functions must know what properties of data they need to access. So functions by necessity depend on data. BUT data, if it is really "data" should NOT depend on functions. So we have reduced what would be bi-directional dependency problem into one directional one.

In a typical OOP design there would be no clear rules as to who can depend on whom and whom not.


> you may start wanting to use such approaches to data for everything

Please, dont. Arrays of structs is more natural for a reason, SOA is not universally faster and it has some of its own performance problems. If you really care about performance, AOSOA may well be your best bet. Again, please dont use it for everything.

> a dhcp pool full of cats is half the fun

Replacing the weird code that the last guy wrote for his own amusement is not fun.


> The idea of inserting an entity id into a “published” table, instead of setting a boolean “published” field to true, doesn’t always come naturally.

It doesn't come naturally because now you need a JOIN in your SQL just to fetch what was before a column. Or two queries instead of one.

Not to mention having closely related data spread in different tables increases cognitive load.

You just added a layer of indirection for what gain precisely?


Reduced column bloat, easy reusability, easy reporting and data extraction, ease of bulk transformation (especially stream processing), but all this is essentially repeating the book, which I recommend as an excellent read.

Contrary to the claim above, there is no additional layer of indirection simply from rotating one’s record-keeping perspective by ninety degrees. I have to reject outright the suggestion that joins are unnatural in a relational database. The opposite is true: there’s nothing more natural to a relational database than a join, and that’s even despite SQL barely paying lip service to Codd’s relational algebra. Heck, using a join instead of a predicate may even be faster, not least because decomposing entity data into recomposable per-component tables means less data to scan. And if it’s instead of an indexed query, well, an index is a relation too, so in terms of schema complexity that’s a wash. As for cognitive burden, how about the nails-down-a-blackboard dissonance of dealing with tables that grow ever further rightwards with each new product manager.

None of this is to downplay the sheer productive rapidity of letting your template generator spit out a bog-standard CRUD app, it’s more to undermine and challenge assumptions about what deserves to be the conventional default, and to suggest that we can safely and reasonably change perspective to allow a little programmer happiness in.


> Reduced column bloat, easy reusability, easy reporting and data extraction, ease of bulk transformation (especially stream processing)

Respectfully disagree. For one, reporting is easier when you don't have to JOIN more tables, not harder. Same for data extraction. And how can bulk transformation be harder just because a column is where it would intuitively reside and not in another table who knows where?

> I have to reject outright the suggestion that joins are unnatural in a relational database.

I'm afraid you misunderstood. It's unnatural to me, the developer, not to the database, to spread closely related columns across different tables.

And "tables that grow indefinetely" is doing a lot of heavy lifting for those arguments. Most tables don't grow indefinitely.

Like I said, smells like premature optimization. I optimize for humans first.


I think they're arguing that you get a performance gain. Personally the systems if deal with wouldn't gain from such and optimisation as there is too little data. So I'll stick to the simplest approach.


I agree with you. Splitting db tables can be a net positive in very specific cases but it smells like premature optimization.


This architecture completely ruins your write performance. Theres a reason double databases, one OLTP and one OLAP, is the norm.


Very insightful. I never thought about it like this, but it's common practice in embedded systems to approach problems in this way. One module may provide a statically-sized pool of like objects, and other modules extend behavior in relation to those objects by carrying buffers of pointers and/or indexes into that pool. It might feel inefficient due to the storage of so many pointers/indexes, but you're optimizing the size of the largest thing: the object pool itself.


Wouldn’t structs of arrays affect cache locality adversely, if say you’re say reading and processing the data in sequential order? You’d have to jump between memory addresses that are pretty far apart to read in a single record / set of fields, and potentially trigger cache misses since they’re likely spread out across different memory pages and you end up filling your L1/L2/L3 cache?


it depends on the access pattern. If you are performing an operation where you need to access every field on each object, then yes you'd be better off with array of structs. But at least in games, it seems more common that you'd need to e.g. get just the x-y coords of every monster, in which case struct-of-arrays is better.


>The idea of inserting an entity id into a “published” table, instead of setting a boolean “published” field to true, doesn’t always come naturally. Yet once you realise how readily polymorphic this is, you may start wanting to use such approaches to data for everything.

Wow thank you for this, it's a heuristic I didn't think about but is really powerful.

I need to think through some of the implications, cause I think there are some risks to this approach - namely co-mingling production and non-production data in the same infra. That means that there are data at rest and in transport that are following the same data pathways but have different production criticality. It puts a lot of risk on the filter working perfectly, rather than not even being available on the same infra.

Just spitballing:

Is the ostensible "prod_live_bool" flag manually set? If not then a bug in any automation would certainly cause a nasty data exposure issue

Are you doing column level security tokens? Wouldn't that need some kind of intra-table RBAC? In other words, it seems like if you want any RBAC within your table, then you have to bring it with you every time from the beginning because you have no idea what level of data sensitivity you'll have eventually, and refactoring an increasingly expanding table to inherit RBAC later ---- omg I can't even think of the amount of work that would be. O(n^2) level of manual work??

Does it lead to a canonical "live or not" lookup table/dict at scale?

I think the data security risks would prevent me from using this design pattern for critical applications in MOST cases, but I do love some of the patterns here and will be exploring this in the future for sure!


I interpreted `published` along the lines of blog posts, in which an article can be either a draft (visible only to the author) or published (visible to the world). This seems different from dev/prod venueing, where having separate databases altogether makes sense.

I understood the column-first approach more as an alternative to putting all columns for an entity in one table, especially when rows often don't populate every column. From that perspective, what's being described is a strong separation of concerns; applying this to dev/prod would be a weakening of this separation, and so probably not what is desired.


I don't think you're talking about the same thing as the parent comment.


> I don’t accept that it’s wholly incompatible with OO

I don't think it is, not even in Java. If you use its primitive array and value types, you can easily wrap this in a data wrapper class with functional interfaces, which then can be used quite elegantly in the rest of your code in an OO way - you just don't box all the records into objects.


Agreed. Honestly I think the hardest part here is programmer mindset - when you suggest that identity is fluid and records don’t have to correspond to a class hierarchy some folks get a panicked look going on, like you just claimed to eat their pets


Yep. Funny story, this was actually a coding interview homework for a question with order of millions of data rows from a file. The exercise required 10s timing requirement, so I decided to do it data centric from start. Interviewers found my style ‚non idiomatic‘, but probably never thought about why my version was done in fractions of a second.


Did you pass the interview?


I did and got the offer. In the end it didn't work out however (better offer elsewhere).


> Alas, many developers in enterprise are rusted onto a record-keeping CRUD model and struggle to think in columns rather than rows. The idea of inserting an entity id into a “published” table, instead of setting a boolean “published” field to true, doesn’t always come naturally. Yet once you realise how readily polymorphic this is, you may start wanting to use such approaches to data for everything

I don't really understand what's Polymorphic about this, or even beneficial. It seems like everytime I've had a boolean column in a long-standing application, it eventually needed to turn into something else


> Is your data layout defined by a single interpretation from a single point of view?

I think this might be the most important question at technology selection and architecture time. Answering it usually requires talking to the business and customers.

If you are certain there is exactly 1 valid "view" of the data that will be used throughout, then perhaps enshrining it in code makes sense. If you are even a tiny bit uncertain of this, a relational-style model probably works better. SQL is the end game for most businesses once they realize the game theory around this one...

I am curious what HN thinks as major reasons for why everyone seems to have moved away from 1 big SQL database. From my perspective, yeah we have "web scale" edge cases that threaten vertical scalability on writes, but most businesses will never touch this, including members of the F100.


At a previous F100 company -- a tech company whose products are widely used, we'll say -- we received guidance that RDBMS was verboten except with explicit approval. This had nothing to do with the best ways to model a given dataset, or achieving the best performance, and everything to do with schema flexibility and a history of outages caused by fucking up schema migrations. These problems weren't occurring in our NoSQL designs, and whatever benefits SQL databases offered didn't counter the huge benefits we gained from NoSQL's lack of rigid schema.

Of course, bad uses of key-value stores can have massive performance impacts, and huge monetary costs when leveraging cloud platforms like DynamoDB -- I've seen a lot of cases where people didn't properly structure their data for DDB, and ended up performing loads of scans and sending costs through the roof.


I read that as "We don't want things to crash immediately when the data model changes. We want things to keep chugging along until the last possible moment, when we will realize we've been silently corrupting everything"


> schema flexibility

If the business is of the notion that the schema is "flexible", then it is probably time to bring all of the MBAs into a conference room and have a come-to-jesus conversation about the limitations of information theory and human suffering.

At a certain point, when someone says "Widget", everyone in the organization needs to be on the same page. This goes well beyond any specific technology.


And then there are silent query failures. Want to change “name”: “John Smith” to “name”: {“first”: “John”, “last”: “Smith”}? Easy! No schema migration!

But you have to modify all your queries to support both old and new formats, or stop the world and change all the data (after modifying all your code, including dynamically generated queries).

And if you don’t, your queries fail silently.


> and everything to do with schema flexibility and a history of outages caused by fucking up schema migrations. These problems weren't occurring in our NoSQL designs, and whatever benefits SQL databases offered didn't counter the huge benefits we gained from NoSQL's lack of rigid schema

Yikes


"Data consistency issues? We'll catch those during integration"


Hybrid solutions are possible; e.g. JSONB in Postgres, where you can still index and join with decent performance.


Sounds like a company whose end users are mostly not their customers.


>I am curious what HN thinks as major reasons for why everyone seems to have moved away from 1 big SQL database.

We haven't moved away from it, but we have run into a certain class of problems that seem related to the 1 big SQL database architecture. We're a really old enterprise, with a lot of non-technical people creating a lot of technical solutions back in the day that have become calcified and therefore have to keep existing. One of the things we have is 5 levels of SQL data transformations from the operational database (the one that actually has applications) into different generations of datamodel as the "type" of business we did changed.

The problem is that as we accumulate ever more layers, we keep building on the layers before. The application that was built 10 years ago on abstraction layer 2 now needs some data from layer 4, let's create a new script that loops that data back into the previous layer and keep going. Eventually we've ended up with a huge amount of interdependent tables that all load data from other tables/views in weird and unintuitive ways, and the project to sort out the mess was deemed too expensive and postponed until the 2030's.

I think it's understandable that people see those problems and consider how we could have avoided them. Unfortunately, for reasons I don't fully grasp, it seems impossible to apply anything to software engineers that require discipline, and we have to somehow make it impossible to create the spaghetti. That's where separation comes in. If you can't read the data from some other service, then it's impossible to create a spaghetti mess that kills velocity for both parties.

The vertical separation of application becomes a software solution to the people problem of poor engineering discipline in enterprises.


I can relate to this where I work we have similar issues the data has grown beyond what the original database schema was designed around. Ours is older than 10 years though - it is 27 years old (originally from 1996).

We originally had a (in my opinion) pretty nicely designed schema and set of tables to represent the data but over time as more data has been added additional tables have grown in a hodge-podge fashion. We now have a Frankenstein's monster of a database. Similar to you there are all sorts of weird interdependencies across various tables.

An additional complication is We have a lot of legacy programs/apps that Read/write to the database so changing types, field widths etc. Is basically impossible because you would have to recompile legacy code / Rewrite Messaging queues etc.

The modern solution to this problem seems to be something called "data lake" which ingests data from the source DB but also applies transformations onto it. I don't really understand how it works something like a SQL view but also keeps the underlying data accessible as well. I don't know all the details has a lot of traction for our IT people at the moment.


> why everyone seems to have moved away from 1 big SQL database

Hacker News is probably not representative of the whole tech ecosystem. I think a majority of applications still uses one big SQL database.

I recently released an open-source framework [1] that is entirely based on Data-Oriented Design. I have received a lot of comments from people for whom it was the right design. Having all your data in the same place makes so many things easier!

[1] https://sql.ophir.dev


Web technologies are also not representative of all the whole tech ecosystem. I can’t fit an SQL database on my microcontroller.


> I can’t fit an SQL database on my microcontroller.

Perhaps not SQL Server or Oracle, but unless we are talking about a very limited device there are likely options.


What about SQLite?


>I am curious what HN thinks as major reasons for why everyone seems to have moved away from 1 big SQL database

For the places I worked:

1. We transitioned to microservices

2. Performance, 1 BIG database slows that

3. Ops/maintenance is very hard in a huge DB

4. In a huge DB there can be a lot of junk no one uses, no one remembers why is there, but no one is certain whether that junk is still needed

5. We had different optimization strategies for reads and writes

6. Teams need to have ownership on databases/data stores so we can move fast instead waiting for DBAs to reply to tickets.


4. In a huge DB there can be a lot of junk no one uses, no one remembers why is there, but no one is certain whether that junk is still needed

Of course no one knows how to even begin to come up with a way of addressing that problem.

So the only viable option is to keep on masking it. And keep propagating the junk data and zombie schemas ever forward.


I'll second @viraptor and @hops answers : It is the same cause as the rise of microservices and DevOps adoption, easier politics. I worked for a big old company, and most of the problems were political and administrative. One big SQL database is quite efficient until the entity that owns it does not agree with the new CTO strategy and any another critical part of the business. Add an incident that shows the low resilience of the model and it quickly becomes a political headache, while the technical solution still seems evident to everyone.


> reasons for why everyone seems to have moved away from 1 big SQL database

I'm sure there are going to be other answers for the code side of things, but for ops:

Depends a lot on the size of the service, but in some cases: We got enough data that 1 big SQL store makes ops hard. (Took me 3 days to drop a table recently in a way that wouldn't affect the users) And splitting data became easier than before with specialised backends. (A sharded 2nd layer cache of live data seems way simpler to achieve than say 2 decades ago)


For me it's parallelisable delivery (or fragility, depending on how you look at it). If a team owns its own data store, it can make whatever changes it needs to and not have to worry about any other part of the software being broken by those changes.


The trade-off being having a separate team trying to integrate all the data back together with ETL.


Well, if you need to integrate it. If you have some hideous dashboards you might need to, it's true, but at that point it's worth investing in a data person whose job it is to keep up with all the breaking data warehouse integrations. They'd have to anyway with any data approach.


For the same reason they went with microservices - its easier to service the technical boundaries you control, and is a political solution rather than a technical one.

Getting something based a DBA in change control was hard, but shipping some IaC templates can be done in a sprint!


Because over time the one big relational database turns into a big ball of mud. Change becomes expensive and has a large blast radius.

Contextual business domains should be the foundation of any complex architecture. You reduce complexity and change blast radius and speed up agility and feature adoption.


The entire advice is context-dependent.

Games just happen to have a lot of operations that need column-based access; but that's not true for all domains. When you go and blindly push the game best practices into other domain, you are just making everybody's life hard and most systems worse.


It's not just column-based access. Formatting your data into a struct of arrays exposes opportunites to pack your data more efficiently and greatly reduce your application's memory usage. Boolean struct fields can become bitsets. Nullable struct fields can become sparse (or dense) maps. Pointer/reference struct fields can become arrays of smaller-width integers that index into a pool. And so on. When everything runs on CPUs that frequently stall on memory accesses, the impact of these sorts of changes cannot be understated - the latency difference between L3 cache and RAM can be on the order of ~10x.


The advice of keeping the data you access frequently contiguous in memory applies to everything on modern hardware. If there is a program where performance is an issue at all, probably this will be one way to make sure performance is good.


Well, if you want to generalize, it's about keeping data with correlated accesses close to each other and aligned inside memory pages, and failing that, yes to keep it at least contiguous. It's not exactly about access frequency, except that you want to optimize the things you access more.

Yes, that's a generic advice for high performance applications that is at least generic enough to apply on anything that is close to a normal computer. You will still need further details if you are talking about things like HPC (ironically) or mainframes, but it's general enough to say people should do it without qualifications.


>Games just happen to have a lot of operations that need column-based access; but that's not true for all domains.

This was not at all obvious when ECS first came around. It took a lot of time to convince people away from the OOP way.


I feel like this is increasingly the only way to write high-performance code.

With newer hardware, the only thing that's expected to scale is logic density - SRAM (and cache sizes) have stopped scaling with the latest lithographies - and RAM bandwidth hasn't really been scaling for quite a while (I'd think it's even possible that per-core bandwidth has been decreasing) - memory access has been the bottleneck for a while.


> Games just happen to have a lot of operations that need column-based access.

And that's not even true for many code areas in typical games, only where there's at least a few thousands 'things' to process (e.g. particle systems or navigation/collision systems).

DOD makes a lot of sense within specific subsystems, but not necessarily in high level gameplay code (outside specific genres at least).


You can see that in bevy, it feels like they are slowly reinventing a relational database and query language, every time they discover another limitation in their pure ECS architecture they add another macguffin to make it work.


Its true, Bevy is quite limited with only simple parent/child relationships[0] and much of the community is looking for more structured relations.

As it stands its pretty common to hold a `HashMap<Index, Entity>` and manually manage the data structure through derefs or some system that keeps it consistent. Ideally only using it for lists of entities that remain static like a tilemap.

[0] https://taintedcoders.com/bevy/hierarchy/


Hrmmm. Not sure this line of thinking makes sense.

Game data access patterns are quite brutal. OOP for games results in extremely inefficient cache use, lots of random access, and lots of pointer chasing.

ECS isn’t a “natural” fit for games. It’s quite difficult and ECS systems are still far from a solved problem.

The two most popular game engines, Unreal and Unity, are decisively non-ECS for almost everything they do.

In any case, I think the underlying principles of DOD apply to all programs. Specific solutions vary, as always.


> The two most popular game engines, Unreal and Unity, are decisively non-ECS for almost everything they do.

Unity is promoting Data-Oriented Technology Stack which includes ECS. They even made some tools to help translate between Gameobject based workflows and ECS based workflows.


Although most of the team has been gutted and it's still not in a great state.


> In any case, I think the underlying principles of DOD apply to all programs.

Yeah, I do agree with that.


Databases are another common cause.


It's for high-performance computing with current CPU designs that are dependent on data locality for performance.

I agree that it's a harmful design for business data. Programmers want to push their runtime data model into the database and they have no interest in the operational, maintenance, and performance problems this causes. When someone suggests this kind of thing, I'll ask them "how do we diagnose performance problems with this technology when there are 100,000 concurrent users and millions of data elements?" The rows-and-columns people can answer this question.


> When someone suggests this kind of thing, I'll ask them "how do we diagnose performance problems with this technology when there are 100,000 concurrent users and millions of data elements?"

I don't understand; the exact same performance diagnostics work in both cases. Why is this different? There's nothing intrinsically less performant about this approach. You really think your checkerboard tables and long lists of columns with names like "VALUE12" and "VALUE13" and multiple different kinds of key/value pairs you jammed in there for different clients -- you think those are better performance!?

> 100,000 concurrent users

Do you actually have 100,000 concurrent users? Really? You don't, do you? You just kinda hope you will eventually. And again: this approach is not worse for that.

> millions of data elements

This is absolute peanuts for any modern database system. It's weird that this is your extreme example.


But it is true, much much more often than you probably realize. Just look at your tables and think about how repetitive they are. The reason you can't come up with a lot of "column-based" (as you put it, which is still narrow-minded IMO) operations is because you've never looked for them before. Of course you haven't: you've been stuck in the traditional mode where such things are basically impossible.

Do most of your tables have Name / Description type fields? Here's some "column-based operations": Allow translation of everything in your database. Generate natural-sounding text in a report, inserting these names and descriptions (from multiple different tables, of course). Free-text search of all your important database concepts. Detect similar names to the one the user is wanting to add, to prevent duplication. Clean whitespace. That's five off the top of my head.

Do most of your tables have Archived / Status / Soft-delete type fields? Allow a user to archive a record. Choose whether to include archived records in a query or not. Delete archived records after X days.

Do most of your tables have Comments fields? Allow multiple comments. Track who made a comment and when. Track responses to comments.

Do most of your tables track who last modified the record? Track all modifications. Show a list of recent modifications to any records.

The list goes on and on and on. You call these "column-based operations", which again, is short-sighted. They're more like "concern-based operations". And it turns out everything is a cross-cutting concern. You're shooting this idea down without nearly understanding it.


For those who disagree, please give an example of a domain that is devoid of important "concern-based" operations.


[flagged]


> Imagine being in such a state of mind that you read the above comment and downvote it!

  Please don't comment about the voting on comments. It never
  does any good, and it makes boring reading.
source: https://news.ycombinator.com/newsguidelines.html


I don't see that as disallowing "for those who disagree with me: what do you disagree with, exactly?" Isn't that a reasonable question?

Besides, isn't downvoting someone for nonsensical, petty reasons (and leaving no response as to why, of course) at least as harmful to the overall discussion as referring to those downvotes?


> "for those who disagree with me: what do you disagree with, exactly?" Isn't that a reasonable question?

It's a hopeless one. People who downvote generally think you're not worth actually answering. Also note that heated language often gets automatically downvoted. And with few exceptions ("eating babies is bad"), one sided opinions tend to be less popular than anything that appears "balanced".

It's especially tough if your one sided opinion is attacking a popular practice.


I have not down-voted any of your submissions. Instead, I replied with the above to help explain why you were experiencing what you have.

> Besides, isn't downvoting someone for nonsensical, petty reasons (and leaving no response as to why, of course) at least as harmful to the overall discussion as referring to those downvotes?

I would imagine so. However, the culture here AFAIK is to down/up vote as one sees fit and not provide an explanation of same.

HTH


Mike Acton's talk "Data-Oriented Design and C++" from CppCon 2014 is the best programming talk ever given in my opinion. A must watch:

https://youtu.be/rX0ItVEVjHc


It's fantastic, and also my favorite. And for those who might not know, he was the one who really mainstreamed Data-oriented design and ECS architecture in my eyes.

He previously was also leading the charge on Unity DOTS, though unfortunately it seems Unity is having a tailspin at the moment. The work on DOTS is solid, if incomplete.


Beat me to it, was about to post this talk.

For those reading, check the video out if you want to get a gist how to world-class performance is implemented.

Most of my career has been writing web apps and this talk showed me 'why would someone use C?'


Andrew Kelley gave an informative and entertaining talk on how DOD had inspired a lot of his work on the Zig compiler: https://vimeo.com/649009599


Thanks for sharing this! Hands down the most practical talk I have seen about DOD. That made so many things click for me.


Even beginners can learn to program in a data oriented way from the beginning.

Two books that teach this style of programming to beginners are:

1. How to Design Programs - https://htdp.org/

2. A Data-Centric Introduction to Computing - https://dcic-world.org/


Those are not this.


Can you elaborate a bit more on the reason?


The books you presented are, roughly speaking, introductions to programming with a focus on data science, functional programming, and common structures/ideas used in those which are, in other texts, not usually considered introductory material. "Data" in this sense means, like, collected facts about the world and how to model them.

Data-oriented design is a particular way of designing your programs where you focus on efficiently laying out your "data" - in a different sense, meaning "whatever it is I've got in storage" - within that storage - to compute with it as fast as possible.

The industry-standard tools used for the first thing are often using techniques developed in the second of the second thing, but that's not relevant for the pedagogical framing. The tools they are teaching (Scheme and Pyret) actually make it very hard to play with low-level data layout details. And the emphasis in these texts on "real [as in, world] data" is in direct contradiction to the DOD axiom that "data is not the problem domain... The data-oriented design approach doesn't build the real-world problem into the code."

A rule of thumb: Is anyone talking about GPUs, SIMD, or CPU cache sizes? If not, you're looking at something about data modeling or data science, not data orientation.

And this, sorry, is all super fucking obvious if you actually read the intro to all three things.


I think this may be more akin to what you original comment meant: https://blog.klipse.tech/dop/2022/06/22/principles-of-dop.ht...

I got confused by the terms (Data Oriented Design and Data Oriented Programming) and watched Mike Acton's talk by mistake https://www.youtube.com/watch?v=rX0ItVEVjHc (and what a lucky mistake)



Thanks for the link.

Found this interesting and against common advice.

"The bane of many projects, and the cause of their lateness, has been the insistence on not doing optimisation prematurely. The reason optimisation at late stages is so difficult is that many pieces of software are built up with instances of objects everywhere, even when not needed."

There are definitely applications where performance is of primary concern (maybe only a few) and others where they are not. In apps where it is, this gives me thought that maybe premature optimization is okay? Am I reading that right?

There's also this called Data-Oriented Programming https://www.manning.com/books/data-oriented-programming.

Are both these concepts the same?


No, DOP is basically just functional programming. There are some overlaps (e.g. separate code from data) but they're not related.


Data Oriented Design is more beginner friendly. Because it does not deal with people and business but only with purity of data modelling.

When I was young, my first step in a new project was painting the Entity Relationship Model. That gave me the foundation for everything else.

Nowadays, I try to understand the problems and the domain, work on capabilities and how to group/box them before I start doing data models.


I like the content of your comment. Everyone who has experience recognizes that our love of data and programming often gets sideswiped by business needs. I think though this article tries to say that if we focus on gathering data needs from the beginning, it might make the business needs conversation moot


I completely agree with that. I have problems that after a DDD phase we have a suitable entity model and afterwards conversations with business and UX are very often very interesting.


One key concept when I use DoD is to not abstract away the data. Less is more.

But when quickly reading the intro text, I found it doing the opposite, it talks too much and abstracts away the key concepts. Only me who found it a bit ironic of not drinking the wine?


Pretty great intro paragraph. The eloquent writing and interesting ideas motivate me to keep reading:

> Data is all we have. Data is what we need to transform in order to create a user experience. Data is what we load when we open a document. Data is the graphics on the screen, the pulses from the buttons on your gamepad, the cause of your speakers producing waves in the air, the method by which you level up and how the bad guy knew where you were so as to shoot at you. Data is how long the dynamite took to explode and how many rings you dropped when you fell on the spikes. It is the current position and velocity of every particle in the beautiful scene that ended the game which was loaded off the disc and into your life via transformations by machinery driven by decoded instructions themselves ordered by assemblers instructed by compilers fed with source-code.


I see these ideas in a lot of data-oriented-design literature and it always struck me as needlessly reductive. Maybe it's useful to set the scene and provide a "cold shower" to get you out of an OOP-abstraction state of mind. But aside from that it seems about as useful as an engineer saying "look around! everything is made of atoms! engineering is fundamentally about moving atoms!" Which is not wrong, just not really going to help actually do any engineering.


I love this book and have been very influenced by it.

However, it should definitely be called: Data-Oriented Design FOR GAME DEVELOPMENT.


Are you suggesting the advice in there isn't applicable in other domains? Or just that it uses games as an example?


The second, more or less.

The advice is certainly applicable in other domains, however the explanations and examples used in the book are heavily focused on software that will be executing repetitive code in a frequency-controlled loop, that loops over arrays of arrays. This also applies to simulation software.

There are some key characteristics on this kind of software that may or may not be present in other domains. Once deeply inside game development, these characteristics are like the oxygen you breath, and become intrinsically ingrained in how you think. This is the case with this book and author: I have the impression he thought gamedev-focused examples were general and not specific.

As I said, I was really positively influenced by this book, and I tend to go back to it from time to time. Always worth it. Just be aware of what it is focused on.


I like Data-Oriented Design, but beware of one thing: You organise your data like a database? You'll eventually be writing a database management system, unless you can use a framework like one of the many Entity-Component-System ones.


Everytime I hear about Data-Driven/Oriented Design I remember a paper from OOP course I had to read in University, it used Data-Driven Design as an example how to not to do things.

The paper in question is this from 1989: https://dl.acm.org/doi/pdf/10.1145/74877.74885

It highlights that:

"Even though the goal of data-driven design is to encapsulate data and algorithms, it inherently violates that encapsulation by making the structure of an object part of the definition of the object. This in turn leads to the definition of operations that reflect that structure (because they were designed with the structure in mind). Attempts to change the structure of an object transparently are destined to fail because other classes rely on that structure. This is the antithesis of encapsulation."

Then goes to show that Responsibility-Driven Design has better approach.

What we mean with Data-Driven Design have come a long way though, and isn't comparable to those days.

I find it a bit amusing that Data-Driven Design used to be an insult of sorts, like you didn't know how to do things OOP the right way.


You do realize that DoD has nothing to do with data-driven, right?

Data oriented design is about structure the data in regard of how it is processed and the hardware. This is opposite of object oriented design which models the data around your mental model.

For example in DoD you could design a map with keys in one array together and one array of values. This would make it much faster to iterate over keys while searching because of cache memory.

While in object oriented you would store an array of pairs.

Both approaches can use data-driven.

Edit: would -> could


While I agree with your point, the map example is not entirely well represented in your comment. The idea is not to just store keys and values in separate arrays, the idea is to look at your use case and model your data after the transformation you need. So if you have a case where storing keys and values as pairs because of your access pattern, then do that, if you have a case where you do a lot of searches through keys, then store them in separate arrays.

The point of DoD is to look at the data you have and the data you need it transformed into and then structure your data after.


There are also multiple definitions of "data-driven" floating around depending on qualifier/context. An OLAP RDBMS, for example, will certainly be written in a data-oriented way, and also have almost fully data-driven behavior, but will certainly not have data-driven (in the sense of that paper) design.


The example given as "data driven design" in this paper neither resembles modern data driven programming nor data oriented design.

I don't quite understand why the authors thought of this as a good example of "data driven design".

I know that functions in a data driven program tend to be very generic so the conclusion you cite is very much a mismatch from my experience.


Best SWE book I've ever read and I don't even work on something video game adjacent any more


It was a good read, a really good one.

However, to my liking, the book is too centered around game engine design and a little bit too practical, but, hey, the author did explore it all the way down to the bottom. I almost envy him at this moment for being able to explore it to the depth. It would be such a great joy to have firsthand experience in topics like this.

Personally, I'm more interested in generalized architecture, which can be applied to applications. The priority is certainly NOT performance, but the goal is still not very clear. I just vaguely guess it's possibly a way for better composable software.


How useful and practical for DoD in ML and AI field to do parallel (matrix) computing at scale? where performance is crucial as well, but not as crucial as gaming(milliseconds matter). DoD is a new paradigm for me.



Looked at it for 10 seconds and immediately found random 0s at the end of the document or unreplaced text like “Noel Llopis in his September 2009 article[#!NoelDOD!#]”


The intro describes that it's a free version of a book that contains most of the content except for a few sections.

The book was converted using an automated format from it's native design into html so a large portion could be made available for free and this automated conversion process might have some minor issues -- like the one you described.


Did anyone here by any chance create an epub out of this?

If not, any recommendations on utilities to convert several linked html files into a single epub?


I think standardebook's has a command line utility that they use for producing their ebooks, which you might be able to use to produce an ebook from a bunch of html files. Ultimately an epub is a zip of html anyways, I think.


I'll give this a try, thank you.


> [Abstraction heavy paradigms] structure the code around the description of the problem domain

My experience of Domain-Driven Design has been that it is extremely effective for driving conversations about the domain throughout the product life-cycle, but it produces frustrating codebases that are poorly attuned to the world outside the running process. Domain-Driven Design codebases want to be self-contained universes and treat external systems as details. OO design paradigms in general seem to have little respect for messages exchanged with external systems.

This wasn't true back when JavaBeans and object databases other distributed object systems were expected to take over the world, but the failure of those technologies shrank the OO world from distributed systems to isolated programs. These days, the messages exchanged with external systems are just data, without behavior. The marriage of data and behavior can only exist inside a single process. So object-oriented design turned inward and concentrated on the creation of rich inner worlds.

I think this is backwards, or at least incomplete. As Rich Hickey says, effective programs need to be situated in the real world. They are not ethereal abstract models. They have concrete functions, inputs and outputs. Having rich internal abstractions that mimic some aspects of reality is a means to an end, entirely subordinate to the purpose of executing interactions with other computing systems. Data-driven design embraces this reality. By treating data as essential, and behavior as something to be added when it is needed, it allows inputs and outputs to be first-class citizens.

> The data-oriented design approach doesn't build the real-world problem into the code. This could be seen as a failing of the data-oriented approach by veteran object-oriented developers, as examples of the success of object-oriented design come from being able to bring the human concepts to the machine

I think this perfectly sums up the confusion that OO modeling creates. You use code to write programs, or services, or cloud functions, things like that. The real-world problem that a program, service, or cloud function solves is interacting in a certain way with other programs, services, cloud functions.

The real-world problems that OO modeling paradigms want you to focus on are the domain problems. This is vitally important for design products, systems of programs that solve real human problems. If you are designing a system for managing medical records in a hospital, you need to model doctors, nurses, patients, labs, radiology images, patient stays, all those real-world things. However, when you are designing a piece of software to do one thing within that medical records system, the "real-world problem" your code is solving is limited to its role within the larger system.

Data-oriented design is a natural mental fit for writing programs or services that play a limited part in larger systems, which is always what you are doing when you are writing code. Object-oriented design wants to take on the complexity of the whole real world, which is the right perspective for product design and architecture, not for writing code.


> My experience of Domain-Driven Design has been that it is extremely effective for driving conversations about the domain throughout the product life-cycle, but it produces frustrating codebases that are poorly attuned to the world outside the running process.

The DDD blue book by Eric Evans consists of two parts: Strategic Design and Tactical Patters. Is it accurate to summarize your comment as that the strategic design works very well, but that most of the information out there leads one to then learn about the tactical patterns in a particular OOP contexts, which isn't a universal approach, and in many cases shouldn't be the way to elaborate findings from Strategic Design?

Note that searching the web for "DDD" yields mostly OOP-related tactical patterns guidance, and this is why I think many people are so sceptical about DDD. It is the strategic parts where most of the (low-hanging fruit) value is.

Other non-OOP 'tactical' guidance, such as functional / functional-reactive / actor-driven, etc. DDD is harder to come by.


In my copy of the blue book, "Strategic Design" is Part IV, and there is no "Tactical Patterns" part or chapter. To be honest, I've only quickly read through Part IV, cherry-picking a couple of concepts, because it concentrates on techniques for dealing with very large domain models and/or very large organizations.

I think the good and bad are interleaved together throughout the book. Part I Chapter 2, "Communication and the Use of Language," is the one chapter I wish everybody I work with would read and digest. I think establishing a consistent language about the domain that is shared across functional groups is critical, so the domain terminology used in source code and comments is consistent with the terminology engineers use when talking with product and customer support. Part I Chapter 3 contains the clearest statement of (what I think is) the core error: "Tightly relating the code to an underlying model [which context makes clear is the domain model] gives the code meaning and makes the model relevant." The rest of the book is like that for me, alternating between vigorously nodding my head and yelling "WHY WOULD YOU TELL PEOPLE THAT," sometimes within the same chapter.

I suspect overall there's a communication issue with the book, where he focuses entirely on the themes of DDD and only occasionally gives lip service to other aspects of design. For example, when he says that module names and structures should reflect domain concepts, I think, "Yeah, that's really nice to the extent you can accomplish that, but you also want your module names and structures to reflect the logical structure of your program, so somebody can look at it and see how it works." You have these two aspects that should ideally both be legible in the code, but the DDD book doesn't show much respect for competing design priorities. In fact, it often warns against the dangers of being led astray by other design perspectives, such as architectural ones. Like so many other OO methodologies, the overwhelming thrust of the book is that if you attend to what it's teaching you, everything else will take care of itself.

A priest once told me, a good priest will tell you when you need a lawyer, a good lawyer will tell you when you need a therapist, a good therapist will tell you when you need a doctor, and a good doctor will tell you when you need a priest. Be careful with an expert who always tells you that their expert perspective is what you need. The DDD book is definitely that kind of expert you have to be careful with.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: