This section is totally wrong IMO. What is the alternative? "Hard" deleting records from a table is usually a bad idea (unless it is for legal reasons), especially if that table's primary key is a foreign key in another table - imagine deleting a user and then having no idea who made an order. Setting a deleted/inactive flag is by far the least of two evils.
>when multiplied across all the analytics queries that you’ll run, this exclusion quickly starts to become a serious drag
I disagree, modern analytics databases filter cheaply and easily. I have scaled data orgs 10-50x and never seen this become an issue. And if this is really an issue, you can remove these records in a transform layer before it hits your analytics team, e.g. in your data warehouse.
>soft deletes introduce yet another place where different users can make different assumptions
Most order forms are snapshots of data at the instant of their lodgement, since they are sales contracts. It is a rookie blunder to link them relationally to master data for products and PII &c.
The record of an order is not intrinsically PII and thereby subject to rights of erasure. It may well be equally unlawful in some jurisdictions to irrevocably destroy it entire, it being necessary for accounting or tax audit, or even simply for mundane followup process, such as returns, that arise from actionable consumer rights. Ergo, such documents must fundamentally survive the erasure/redaction of any PII it does include.
I would say the structure of the records themselves can stay, but not the data itself.
If you have a user table, maybe you can just randomly hash the sensitive user data but keep the record.
Not 100% sure about this though, since you could probably derive the user with statistics like: if it's known that 1 person gets a specific disease every 10 years and you have an obfuscated record of a person connected with that disease, it's fairly straightforward to derive who that person is just through that connection.
How are you going to respond to a warranty claim if you've wrongly deleted the order data when a subject requests that you delete all PII you have about them?
I'm not sure how this changes anything. Is your PII in the order forms/sales contracts mentioned above? If yes, you'll have to delete those as well anyway, right? If it's not, the order forms/sales contracts themselves don't have to be linked to something that may potentially get deleted.
Please note that by "immutable" I don't mean that data won't get deleted eventually, just that it won't be deleted until nothing needs it anymore (and until then won't be mutated either), so basically the same thing that languages like Haskell mean by "immutable". Then, once you don't need it (= it's not observable anymore), it could perhaps get archived or erased, whatever you prefer.
I'm not quite sure what "master data" means in English (a non-native language to me) but Wikipedia tells me that it's "data about the business entities that provide context for business transactions" (and lists examples that sound relevant for this situation to me). Based on that I'm inclined to think that this would qualify.
Hard deletes are also awful from the perspective of data preservation. For example, when youtube removes a video they also delete all the metadata or any indication that it ever existed. Countless people have lost what they thought was a secure record of at least the title of songs or videos they saved to a playlist.
There is also a more sinister side, which is that the ability to hard delete something forever means that bad actors can fabricate old "deleted" documents and accuse someone of having created and then deleted them.
I do think that hard deletes may sometimes be required to comply with legal requirements (e.g. complete expungement of personal information relating to a user). If it is not required by statutory law, sometimes it is written into commercial contracts.
Exactly. I get OP's point (i.e. you can accidentally include softdeleted records in your results), but for some types of data hard deletes are an absolute no-go anyways, so you just have to live with it.
> if that table's primary key is a foreign key in another table - imagine deleting a user and then having no idea who made an order
Assuming you have constraints set up correctly (on delete no action or on delete restrict) then how could this ever happen? If you don’t have constraints set up correctly…
Assuming you're deleting the row because it shouldn't be used by read queries, constraints like you described prevent the problem of having orphaned records in the child table but also prevent you from achieving your goal. On delete cascade would allow you to achieve your goal and prevent the orphaned records but could lead to deleting more than intended (especially if the child table is also a parent table referenced by further foreign key constraints, its children could in turn have children, etc). Of course, with no action/restrict you could also manually cascade the delete, but if you actually don't want to delete a child row and there's not an appropriate alternative value for its foreign key then you're in a bit of a pickle.
So if you want to delete a user but keep the records of their orders and still know who made those orders, then some form of soft delete is probably your best option. I believe that's the point rm999 was making (in response to the article asserting that soft deletes are a "data model mistake"). Properly configured constraints can prevent an "oops" but don't really do anything to solve the problem of this sort of delete from some contexts but not others.
Foreign key constraints are a waste of effort for most at-scale web apps. They guard against a subset of problems that aren’t actually problems (orphaned rows) at a putative cost (for the db)
I disagree. Any fairly competent DBA will know how to setup the constraints correctly. It's not rocket science. If you can think logically enough to program, you can think logically enough to set up constraints correctly.
This should be the responsibility of the application developer creating the database schema and queries. A constraint is part of your application logic, not of the administration of the database.
Well, my team has never had an official DBA either. But we do all the tasks of a DBA. I've installed/upgraded, configured, tested backups, and hardened Oracle and Postgres more times than I can remember. We do all our own DML and DDL work. It really isn't hard to do it right.
That would just make the data loss problem worse still. I realise OP just chose an arbitrary example, but if you really are talking about users and orders, and if you delete a user, then really deleting the records for their associated orders is even worse than losing track of who made them.
I realize you might not be familiar with how database constraints work, but an on delete no action would totally prevent the user from being deleted. You literally cannot delete the user and their associated orders will definitely not be deleted.
Hard deletes most likely need to be supported, due to legal or contractual obligations. Designing with this in mind, makes everything a lot easier in the long run.
INAL, but... you might want to revisit that code. article 17, right to erasure is about erasure of personal data, not about making non-indentifiable. of course they dont define erase or delete :-)
> I don't see why the law should care in any way about a company populating NULL records.
It cares if the existence of this record still leaks private data. This is why talking about generic "records" here is pretty wrong - actual data is not interchangeable "records" where you can just slap on a generic cargo cult policy and think you're done.
Different use-cases require different data handling. Although, I do agree, for most CRUD cases it's enough to NULL out rows.
It's deleted because all connections (and meta data) has been erased.
In a database, if you null all fields but keep the entries and their inter table relationship intact, you still have identifiable data up to a certain point.
Imagine you have data of specific people and you only have one person per country, having a user and country relationship, if you null a single data of a specific user, you'll still be able to find the user just by analysing to which country that user was connected + some external information.
This is the classical problem of deriving personal data from statistical reports and why data anonymisatiom is so complex.
I think the biggest mistake some startups make wrt their data model is not really thinking about it at all. The data model winds up being the byproduct of all the features they've implemented and the framework and the libraries they've used, rather than something that was deliberately designed.
At the other end of the scale is a data model designed for extreme extensibility.
If you ever hear anyone bragging that their data model is entirely metadata driven, and can be used to model anything - without changing the database - that's a huge red flag, as is looking in and seeing tables called "element", "business object" and the like.
Unfortunately, for most serious Enterprise systems, a degree of flexibility is essential. It's being able to pick the right balance between hard coding first class domain objects into the database and allowing for extensibility that IMO marks the truly expert system designer.
One underlying reason for this is that DBMS systems have an unnecessary source of complexity: They have a separate Data Manipulation Language (DML) and a Data Description Language (DDL). They really ought to be unified, but few (any?) mainstream SQL databases are homoiconic in this way.
E.g.: It should be possible to take a query definition, request its columns ("schema only" execution), and then insert or merge the columns into a table definition somewhere. Something like:
SELECT SCHEMA( SELECT * FROM "blah" )
INTO "tablename"
When a black market is formed, it's a sign that there is an unmet demand. When you see the exact same "wrong" design pattern turn up over and over, it's a sign that the underlying system isn't meeting the needs of the developers.
The "bad" design that keeps cropping up over and over is the second system effect: Tables, relationships, and columns defined as data in a few simple tables, instead of being defined explicitly in the SQL schema as expected. This is less than optimal for lots of reasons: duplication of metadata, inefficient query plans, no foreign keys, inability to use most kinds of indexes effectively, etc...
However, the need is real: the ability to easily extend or generate table schemas without having to switch languages. You can argue that this is "not that hard", but you'd be absolutely wrong. It's obscenely difficult. I've tried, failed, and have given up. One use-case I had was automatically generating tables for importing data from PowerShell. My goal was to be able to write something like this:
And have the "Export-Sql" command automatically generate the table schema on the fly based on the input columns it sees. I even wanted to be able to represent object hierarchies as sets of related tables with parent-child foreign key relationships automatically put in. I got a proof-of-concept working, but there were just too many edge-cases. Things like maximum lengths for key or index columns, maximum row size, inability to switch column types on the fly, etc...
So what to do other people do? They also give up and resort to defining a single table that has the columns: "TableName, RowId, ColumnName, ColumnValue" and call it a day. I mean... what other options are there? Bang your head against the wall for months trying to deal with idiotic things like identifier length limits? Correctly escaping arbitrary input strings? Generating hundreds of distinct commands that cannot be parametrised and hope you don't have a SQL injection vulnerability lurking in there somewhere? It's nuts!
Have you seen just how much code it takes to take an arbitrary table, one with dozens of foreign key references, several filtered multi-column indexes, and views that depend on it, and then insert a column in a specific position? You have to drop everything, copy the table, rename, and then recreate everything. Doing this in code would be... I dunno... a few hundred thousand lines? That's absurd. You're not doing it. I'm not doing it. Microsoft did it once for SQL Server Management Studio, and I bet they're not rewriting that code in a hurry!
But go back to the "bad" schema example before: Is it really that bad? What if the database engine had the ability to store common prefixes just once, instead of repeating them for each row? What if the schema looked like this:
Database, Owner, Table, Column, Value
Starting to look familiar? A bit like [server].[database].[dbo].[Table] perhaps, familiar to every user of Microsoft SQL Server?
There's a natural hierarchy for describing the data, that lends itself well to being represented as a B-Tree, along with the data itself! That's what the "bad" schema is doing: it's representing the data with the most natural representation! It's not wrong at all!
The problem is that database engines have all sorts of features and optimisations that we want that isn't directly compatible with the naive implementation of the bad schema. Constraints, foreign keys, efficient storage, indexing, etc...
However, none of these features are fundamentally incompatible with an API that merely "pretends" that the data is stored in a single flat list that can have its schema updated with a simple insert. The database engine could simply factor out the schema part and the data part into different physical storage layouts, with all the usual efficiencies such as not having to repeat column names for every row.
To summarise: we could have our cake and eat it too. Database engines could be developed that use ordinary select/insert/delete/update statements to modify the schema, and have it perform just as efficiently as a database that uses clumsy statements like "alter table add column". In this world, a database schema upgrade could be as simple as a single literal "MERGE" statement!
While schema creation SQL can be a be a bit unwieldy, I'm not sure I appreciate which part is the problem or what you're trying to achieve.
Obviously you can SELECT * INTO FROM .. if you're just temporarily inserting data.
I'm not sure I see the value in automatically importing arbitrary data into a schemad database object. I think it's too complicated to be carried out by the database and should probably be done by some other piece of software and under human supervision/guidance.
Are you suggesting that INSERT/UPDATE statements also have the ability to modify objects? That sounds like it would add complexity without much gain as opposed to just running and alter table query.
I'm not saying that the process of writing schema modifying queries is painless but I'm not sure I'm convinced that we can have the cake and eat it
For your example I personally would've AWKED into a predefined schema or inserted the data as JSON
> Obviously you can SELECT * INTO FROM .. if you're just temporarily inserting data
Okay, bad example. I should have specified: Insert into an existing table, add the missing columns automatically. There are some special-cases where most database engines can automatically generate schemas, and "SELECT... INTO" is one of those few. Usually only allowed with an empty destination table.
> I'm not sure I see the value in automatically importing arbitrary data into a schemad database object.
There are lots of use-cases for this. You might not have them, but other people do all the time. Just about any large-scale enterprise software needs to be extensible in the field, for example. Think the likes of SAP or Siebel. I've seen similar problems crop up in CMDBs, job and ticket management software, etc... Famously, Jira is slow precisely because it is so bad at handling this kind of extensibility efficiently.
> Are you suggesting that INSERT/UPDATE statements also have the ability to modify objects?
Not necessarily, although that could be an option. What I mean is that changes to the table schema should be made using insert/update/delete statements, where the only "data" is things such as the column names, types, constraints, etc...
> That sounds like it would add complexity without much gain as opposed to just running and alter table query.
It would dramatically reduce complexity, removing an entire language from database engines, along with all the associated vendor-specific syntax, quirks, and limitations.
In fact, the same "schema" that is used in the wire protocol could be directly equal to the actual schema, and its update language. So if you get back a query result (with data), and want to create a copy of that schema elsewhere (e.g.: a local cache database), then you just take the "header" from the result set and "insert" it into the destination database schema. No conversion, no escaping, nothing.
Seriously: Try this as an exercise. Use Java or C#, write a select statement from a query (that has join, views, etc...), and then write the code that generates a table to cache this data in a separate, local database instance.
Do it. Sit down and give it a good go. You won't appreciate how hard this is until you do!
Now write the code to update the cache table dynamically if the source table changes. Assume that table has a petabyte of data. (That's why it's a table, it's a cache for something that's far too big to fit in memory!)
Now write the code to do all of the above with foreign constraints.
Good luck!
> I'm not sure I'm convinced that we can have the cake and eat it
There is a reason NoSQL databases exist and are wildly popular. It's not that "schemaless" is truly better so often, it's more that modifying the schema in most DBMS offerings is so fiddly that it's essentially impossible to do programmatically.
> or inserted the data as JSON
NoSQL in a nutshell! You haven't solved the problem, you've given up and resorted to schemaless tables instead, exactly the "bad" example above with all of its limitations and issues.
I've seen some good attempts at solving this problem, but they barely scratch the surface.
> Okay, bad example. I should have specified: Insert into an existing table, add the missing columns automatically. There are some special-cases where most database engines can automatically generate schemas, and "SELECT... INTO" is one of those few. Usually only allowed with an empty destination table.
See, I don't think I would want SQL that would modify the schema programatically. And imho, that sounds like an invitation for trouble. It can be done of-course, but, to me, the value of the relational data model is to enforce consistency through constraint. If I don't care about consistency, I may as well just dump the data as schemaless JSON and then periodically turn that semi-structured data into actual relational data.
Note that I said schemad database object.
Normalizing the data model is too complex to do programmatically and sounds ill advised. If you want the advantage of painless schemaless data, dump it as JSON.
> NoSQL in a nutshell! You haven't solved the problem, you've given up and resorted to schemaless tables instead, exactly the "bad" example above with all of its limitations and issues.
Note that I was meant this as a stop-gap solution, which to build on and turn into schema-controlled data. Preferably iteratively and hopefully not in production.
I've done this when I've had semi-structured data that I wanted to dump into the RDBMS at which point there was no schema. But it's a band aid. That data is in my experience best turned into schema controlled data ASAP, otherwise inconsistencies start to creep in. Relations that were not immediately obvious pop-up and I don't think there's hope to let the SQL database carry out such a task programmatically.
There are already tools do a good job in analyzing data, which aid the construction of schemas. There are also tools that make updating them relatively painless, especially if you've already inserted the semi-structured data, as most RDBMS' have decent support for handling JSON these days.
Then again, I'm not sure I'm able to appreciate the language you dream of. To me, it sounds like SQL but where ALTER TABLE and UPDATE/INSERT are conjoined into one query. And I guess it would also deal with simple relations (ie, the ones already present by virtue of arrays, associative arrays etc.) Which, admittedly, I see little point besides having same lines which were in two queries, now in one. Which I imagine would mostly just slow down the query parser and make things hard for the optimizer. Not that I don't appreciate that it might be convenient, it doesn't sound like the added complexity has enough trade-offs. Besides, with dumb relations, you end up with deformalized data, ripe for inconsistencies. Ie, the whole problem with NOSQL data modelling in the first place.
I wonder if such a job is not better put on something that is not the RDBMs, and in my experience, such tooling exists and already does a decent job without putting an extraneous responsibility on the already very complicated RDBMs.
I think I would feel comfortable putting schema generation in the hands of Java/C# applications. Instead I would go the route of dumping the semi-structured data straight into the RDBMs as JSON, if I was unable to to predefine the schema. I would do this in a development environment, then I would analyze the data, define the schema to structure it and then move it into production where the data would be queried in regular SQL.
speaking entirely from my own experience, schemaless data is nothing but a pain that results in much more work in the long run. For me it has been the data modelling equivalent of peeing your pants. First, it's nice and warm but pretty soon it get's cold and unpleasant.
Much like there's a reason we do not make computer program themselves, giving them the responsibility of designing the data model seems even unwiser. It's a task that requires intelligence.
Secret hacker pro-tip: inner platforms and key/value pair representations do not actually improve extensibility. You always have a schema -- you get to decide whether it's explicit or implicit. Their problem isn't that their data model is too extensible: it's that it's plain old bad.
This is the place I work at. The data model was designed with a narrow focus. When that turned out to not be viable, the company moved into an adjacent and much larger market. But the names never changed, and the subtle differences between the two worlds was never addressed. So now our application is full of terminology and restrictions that confuse our customers, and our database doesn’t match anyone’s mental model of what the application does. It’s all workable, but IMO we’ve paid (and pay) a not-insignificant price in productivity and complexity because we never took the time to fix these things.
At this point a ground-up rebuild is probably going to be no slower than trying to update the existing app. Neither will be cheap.
Whatever you will do, make sure to have a plan to deliver it peace-meal in successive releases. I am currently in the process of reworking the data model of a moderately complex application, and it has proven to be very risky and to take an insanely long time to do it all at once.
I hear ya. Both would probably cost the same, rebuild probably more, today, but it’s still cheaper in the long run. Unless the business goes back to what it was, it will keep diverging from the current terminology. No manager wants to hear it, but taking a 3-6 month breather to address tech debt like this is worth its weight in gold.
Practically speaking the data model creates very little value. If your startup is trying to make money, features are more important than design for a good stretch.
There comes a time to refactor and fix your architecture but it's usually not at the beginning.
You can design a data model if you don't know what you're building. And no startup really knows what they're building.
> Practically speaking the data model creates very little value.
That can be said about any cost centre, but you don’t have to drag managers kicking and screaming to get them to buy fire insurance.
Practically what it does is allow the company to keep up velocity and not be distracted putting out fires everywhere.
Of course building features is the team’s entire reason for existing. But there is no advantage to defer refactoring to some later date. The longer you wait the more painful it gets.
Chances are the time never comes, once progress stalls and the company isn’t out of business yet someone will have the brilliant idea to rewrite everything from scratch, which is just lighting money on fire with extra steps.
I worked at a startup that rewrote major parts of their product three times before the first Series A check cleared (while I worked there at least, I think they had another rewrite before my time). The company is alive and well because they didn't waste time and money on pondering architecture to solve a very difficult class of problems when they didn't know which problems were worth money yet.
They did do some smart things working around such known-unknowns, like operate as a consultancy for several years while building out the tech stack that would ultimately become the product catalog. That way they didn't have enormous risk associated with rewrites since all users were internal and zero projects actually needed feature or ABI with the stack.
The problems they had when I left were obvious, but the data model wasn't one of them. I'll stand by what I said above (ignoring the typo) - you can't specify a data model for a problem you don't know. And no startup really knows what problems they are going to solve when they start.
"The next series is the time to fix your mistakes from this one".
No matter what, startups break as they grow. You will need to fix things. Just make sure they're not sooo bad that you can't do it in a timely/affordable way.
I'm not aware of a single project, ever, that has gotten their data model right up front and not had to iterate on it countless times as it grew/evolved. Except maybe NASA. Even the best early data models fail after years of updates and evolution.
This is the rationalisation I get every time when I tell companies that their data model is a mess. Never mind that neither I nor the parent said anything about doing it up front.
Of course they have to iterate, the problem is that there is no deliberate effort anywhere, it’s just piling more crap on top of old crap and deluding themselves that they are some kind of lean, agile visionaries because of it.
No one gets it right, and it's just grandstanding to pretend that "deliberate effort" is the distinguishing difference between good or bad data models. Unless you're dealing with highly specialized technical scenarios, most software is written to solve ambiguous and nebulous business problems that even the business doesn't necessarily understand.
> most software is written to solve ambiguous and nebulous business problems that even the business doesn't necessarily understand
Obviously if you're not deliberate about what problem you're trying to solve in the first place, no amount of deliberate effort will produce a good data model. Designing business processes must be done with the same deliberate effort, and also need to be constantly refactored.
Have you worked at a start-up? Or in a field new to you, or new altogether? No amount of well-intended deliberate design will make up for domain expertise. Ever. The best you can do is mitigate as much future refactoring as possible, but eve than that's not always the best use of time.
> Designing business processes must be done with the same deliberate effort
That's not always realistic. Businesses processes solve existing problems, sometimes tackle new problems, but just like data models hindsight is 20/20. And just like data models, business processes are constantly evolving and never done.
> Have you worked at a start-up? Or in a field new to you, or new altogether?
Yes, to both questions. And I still stand by everything I said.
In fact both cases made me form this exact opinion. Being all hand-wavy and haphazard about our business processes is precisely and without a doubt what killed it. The second example was from a big regulatory compliance thing that had taken the particular industry by surprise, again because nobody in that industry can be bothered to understand their own business processes, but cargo cult everything. Managers and domain experts would throw PowerPoint mockups over the fence and tell the devs to "just make it clickable and put it on the web site". I had to constantly fight them to get them to take their own jobs seriously. I nearly burnt out and had to quit shortly after, but we actually released on time and though not perfect, came out miles ahead of the competition.
> And just like data models, business processes are constantly evolving and never done.
That makes deliberate design of business process more important, not less. A culture of hand-waviness is precisely why businesses are still caught by surprise by things like regulations that they are given years in advance to implement. If you don't know where you are, you don't know what you're supposed to pivot to.
Yes absolutely. From day 0 everyone should already know how to do data modeling and perform migrations. If you don't and call yourself and engineer, you have no business starting a company.
I think this is largely a consequence of microservices. What is the "data model" here? You're thinking database, to a microservice that's a repository implementation detail.
I don't think it's a consequence of microservices, I think the problem is as old as the programmable computer. I'm the old paper and even clay tablet processes had to be constantly refactored too.
And a common argument I hear from microservice fanboys is that their old monoliths were suffering from high coupling and low cohesion and that microservices help with that. But I don't see why they couldn't have just refactored their monolith to address the problems they were having.
I’m not saying micro services are terrible at all. But I think the low coupling leads to a lot of “problems” when you try and re-aggregate data for analytics purposes when fundamentally micro services are meant to enable teams to work on them without worrying about how some other micro service is storing its data. It’s a trade off you make.
> fundamentally micro services are meant to enable teams to work on them without worrying about how some other micro service is storing its data
What exactly is it that microservices bring to the table to achieve that, that doesn't already exist out of the box in every single language in common business use?
It lowers coupling between teams. If my team is responsible for microservice A then I have no dependency on the team working on microservice B. Microservice A writes to its own database. There are no migration dependencies, etc. It's a distributed system. You have to accept that one microservice could be using mongodb, another postgresql, another mssql. You're typically communicating through APIs, one microservice is not reading directly from the database of another microservice. Materializing all this crap when you want to do analytics is where the problem comes in as now you are forced to construct some unified data model.
Yes, but you don't need microservices to do that. You can just lower the coupling between the teams working on your monolith using the means of abstraction of the language is written in.
> If my team is responsible for microservice A then I have no dependency on the team working on microservice B.
How is that different from the team responsible for Java package A having no dependency on the team working on Java package B? Or whatever language you happen to use.
> Microservice A writes to its own database. There are no migration dependencies, etc. It's a distributed system. You have to accept that one microservice could be using mongodb, another postgresql, another mssql.
Nothing is stopping you from connecting to multiple databases from the same monolith either.
>You're typically communicating through APIs, one microservice is not reading directly from the database of another microservice.
You can (and should) build well defined APIs between modules in your monolith too. And it's faster than having to keep serializing and deserializing everything all the time.
>Materializing all this crap when you want to do analytics is where the problem comes in as now you are forced to construct some unified data model.
Why do you think you need to unify your data model in a monolith? If you think it's the wrong strategy for your problem all you have to do is not do it.
> Yes, but you don't need microservices to do that. You can just lower the coupling between the teams working on your monolith using the means of abstraction of the language is written in.
Forcing a common platform across unrelated components because you have decided to make it a monolith is tighter coupling than exist with microservices, so, no, you can’t. Choosing a monolith is choosing a higher level of irreducible coupling that choosing independent (not necessarily “micro”) services.
Maybe I should have called it "oligoliths". I'm not arguing for forcing the entire business to run in the same address space. For example it makes sense to split OLTP from batch processing, since they have different lifecycles. But I don't think that's what most people mean when they talk about microservices.
Whether parts of your monolith are overly coupled or not is purely a design decision. Pay the same attention to the APIs and areas of responsibility between the parts of the monolith as well as you do the microservices and you'll be no worse off. Devs who just can't help themselves and turn every monolith into a mess of spaghetti don't exactly inspire confidence that they will do better with microservices, especially with the added complexity of building a distributed system.
My rule is "make it easy for us to fix our mistakes".
Even when we've spent a bunch of time planning out data, but we still got a lot of things wrong in hindsight. The reality is we didn't know enough about our product direction to make any truly informed decisions.
In general, poor decisions seem to stem from working in ambiguity about product, rather than poor technical decisions.
> we didn't know enough about our product direction to make any truly informed decisions.
Bingo. Very few programmers are working on projects that they have many years of domain expertise in, so the data models we come up with are always going to be limited to our experience in the here and now. It's one thing to organize highly technical code around things like graph algorithms are combining multiple b-trees in a single operation to lookup data, it's a whole different thing to tackle line-of-business problems where things are not so well defined (even to the in-house domain experts).
This is a direct result of "move fast and break things", your database schema is the first thing broken and it never recovers.
It's baffling to me that for many companies I've worked for, their data model is basically 100% tech debt that can never be fixed because the cost is too high.
The end of this article is particularly weird. Is it really suggesting that a good general rule is to optimise for business metric queries (which sounds like something that would generally run daily during off peak hours or ad hoc when someone needs the data) over the most commonly run reads/updates (which sounds like something that will happen multiple times per minute for every active user)?
I feel like I'm missing something because that seems insane to me.
To your point, many of these could be addressed by making an analytics database copy of the transactional database, for example scrubbing test data and removing soft deletes in your etl.
From my experience with metabase, this makes it easier to use anyway but it means you have to maintain an etl.
I once saw something a little similar to this, except with one flavor of DB rather than several.
A company you've likely heard of went hard for a certain Java graph database product, due to a combination of an internal advocate who seemed determined to be The GraphDB Guy and an engineering manager who was weirdly susceptible to marketing material. This because some of their data could be represented as graphs, so clearly a graph database is a good idea.
However: the data for most of their products was tiny, rarely written, not even read that much really, even less commonly written concurrently, and was naturally sharded (with hard boundaries) among clients. Their use of that graph database product was plainly contributing to bugginess, operational pain, mediocre performance (it was reasonably fast... as long as you didn't want to both traverse a graph and fetch data related to that graph, then it was laughably slow) and low development velocity on multiple projects.
Meanwhile, the best DB to deliver the features they wanted quickly & with some nice built-in "free" features for them (ability to control access via existing file sharing tools they had, for instance) was probably... SQLite.
Been through this as well. There was one database for relational data, one for logs, one for analytics, one for miscelaneous, one for binaries, one for time-series, one for key-values, one for caches and probably a lot more! Total nightmare.
Nobody fully knew how operations, schemas, indexing or queries in any of them worked. Usually someone had managed to hack something together in a week and then the rest of the team just did minor changes to existing queries. Joining between the databases was also a fun exercise.
I blame it all on docker. It's so easy to just docker-compose run grafana:latest, then dust off your hands and claim you have a database running. Articles from HN on how fancy setup Netflix have also contributes to this, you don't have the same ops capacity to replicate a FAANG stack.
In the end all of it got replaced with only mongodb and firefighting went down to 0. Everybody in the team knew how to do everything, from new queries to migrations and backup-recovery. It's probably worse in every aspect on each task the specialized databases were solving, but it works good enough and often bringing a really good swiss-army-knife is better than having a caravan of specialized machines which each require special expertise.
> On the flip side, soft deletes require every single read query to exclude deleted records.
You can use partial indexes to only index non-deleted rows. If you are worried about having to remember to exclude deleted rows from queries: Use a view to abstract away the implementation detail from your analytics queries.
You can also use the index to cluster database blocks of the table based on the index (postgres => cluster command). This means all active records will be written to the same database blocks, and all deleted records will be kept in separate database blocks. This can speed up queries that need to access a lot of active records.
This is a good alternative to moving deleted records from an active table to a deleted table.
- Having informal metrics and dimension definitions: you throw together something quick and dirty and then realize there's something semantically broken about your data definitions or unevenness. For example your Android app and iOS apps report "countries" differently, or they have meaningfully different notions of "active users"
- Not anticipating backfill/restatement needs. Bugs in logging and analytics stacks happen as much as anywhere else, so it's important to plan for backfills. Without a plan, backfills can be major fire drills or impossible.
- Being over-attentive to ratio metrics (CTR, conversion rates) which are typically difficult to diagnose (step 1 figure out whether the numerator or the denominator is the problem). Ratio metrics can be useful to rank N alternatives (eg campaign keywords) but absolute metrics are usually more useful for overall day to day monitoring.
- Overlooking the usefulness of very simple basic alerting. It's common for bugs to cause a metric to go to zero, or to be double counted, or to not be updated with recent data, but often times even these highly obvious problems don't get detected until manual inspection.
> - Not anticipating backfill needs. Bugs in logging and analytics stacks happen, so it's important to plan for backfills. Without a plan, backfills can be major fire drills or impossible.
This matches my experience. Building tools that allow you to rebuild some or all of a dataset with minimal headache make any individual task much easier. Both in terms of safety, and in terms of things like branching/dev environments.
> what's the relation to bugs in logging and analytics?
I'm not sure what you mean. Software has bugs, data has bugs, etc. To be able to fix a bug and rerun a solution is important in all areas of software, it has nothing to do with logs or analytics (but data and data model type questions usually are important to those domains).
> also, is there a good resource on how to backfill?
Not really, because "backfill" means something different to everyone that holds data. Starting with what questions to ask, I would ask "What do we do if a lot of our data shows up incorrect" and "What do we do if lots of our data goes missing", and solving problems in an individual data stack that arise from those questions.
As an example, at a previous job our ETL/ELT system was all started with a file showing up in an S3 bucket. The code that ingested the contents of those files occasionally had bugs that required reingesting of all data that was processed by that version of the code. Having tools to identify (at the data level) what data was affected by this bug, and then being able to delete that data from a datastore and reingest only those S3 files with a newer version of the ingestion code made these types of bugs much easier to manage over time.
The reason I asked about resources is because I have data generated by a personal project. The initial data model was sloppy and so now I'm finding myself having to backfill to clean the data and it's rather painful. Though I haven't come across anything that deals with the subject so I'm just winging it on my own
OP used "Log stacks", but "Log Stacks" are just a specific flavor of event-based timeseries/analytics stacks. If you were to build a log-ingest and log-aggregation system, you'd just be building an ETL but with a specific emphasis on logging.
> The reason I asked about resources is because I have data generated by a personal project. The initial data model was sloppy and so now I'm finding myself having to backfill to clean the data and it's rather painful. Though I haven't come across anything that deals with the subject so I'm just winging it on my own
Snowflake's stage system works similarly to what I'm describing. You can use S3 as a stage, and then load data from the stage into a table. If something bad happens to your data _in Snowflake_, you can just reload from the Stage (with an updated INSERT).
For more ad-hoc ELT/ETL systems (ie not just All-In on Snowflake), you'd have to just assess your own tooling and build it yourself. In general, when building ingest systems I try to document whatever I can per-record. Meaning, each record in a store includes what version of what software ingested it, and a reference to the raw-est form of that data possible (ie, a JSON blob of the original event or a S3 URL to that event's backing source). This lets you say "We identified a bug in the ingest layer at version 0.1.1, we need to reingest all that data with 0.2.0", and then easily identify and remove the exact data that encountered that bug (because you recorded 0.1.1 as a part of the record itself), and then build a list of exactly what S3 files need to be reingested by 0.2.0.
If you're comfortable expanding your dataset a bit to include that type of metadata you save yourself a lot of time when bad things happen (which they will). It's always a game of metadata/bloat/compute time vs. savings, though.
edit I'll add, none of this matters if your dataset is small enough to be imported from 0 in almost no time at all. If you could write a small script to just parse every file in S3 and insert it into a database, and the time it would take to finish doesn't upset you, you're totally fine just doing that. What I described above is for when your data becomes so large that reimporting from 0 is basically impossible.
For example, your app logs clicks on the "submit" button, but there's a bug in your UX and the button is clickable/tappable multiple times while the form is being processed, instead of being disabled while being processed. Some users are tap-happy and will tap many times thus counting for multiple submissions. If that's how you count actions in your dashboards it will overcount.
In terms of resources, I'm not aware of a one-size-fits-all approach... the most basic would be to define upfront what the playbook is for making backfills, and testing it once in a while if you don't get the natural opportunity to do it.
Okay, I was confused as I thought you were referring to application logging and not logging that occurs in the data layer.
With a normalized and well defined schema, such inconsistent data is impossible. I guess your point is then to have a well defined process on how to go about resolving this when things go awry -- an important point that makes sense.
Regardless of where the logging is, the can be bugs, and there will be bugs given sufficient amount of time and complexity. It's all about planning for recovery.
If your company has a subscription business model, keep a history of user's subscriptions. They change over time and it is likely you will need to measure popularity and profitability of product offerings over time. Please don't force your analytics team to rely on event logs to reconstruct a subscription history.
This. You want to capture timestamps as users downgrade, upgrade, change quantity, churn, etc. If you have a status field, timestamp the changes to it. This way it’s easy to get the state of the world on any given day, which is a common analysis that’s done to study behavior of cohorts of subscriptions over time.
I first learned what an "audit log" was because I had to use an audit log to figure out the states of record in the database at a time in the past, because some specific pieces of data were being lost in the "soft-update" database setup.
That's a good point. Most subscription service providers, like Stripe, Chargrbee, Braintree, etc, use a fairly conventional one-to-many data architecture for Customers and Subscriptions.
Just take care to use the subscription service provider data model how it is intended. It is possible to design your integration in a way that goes against the grain and end up with gaps in your data. For example, by re-using a single subscription instance per customer and changing it's properties when the customer down/upgrades rather than creating a new Subscription instance.
I have seen so many people argue against soft deletes over the years. But I have also had so many instances where users 'accidentally' deleted a bunch of items and then call support to ask if there are any backups. And then I'll have to reconstruct the data from yesterday's backup plus today's changes. A soft delete will take care of this.
And no amount of "are you really really really sure you want to delete this?" confirmations are going to fix this. You could require the whole Spongebob Squarepants ravioli ravioli give me the formuoli song and dance and people will still delete hundreds or thousands of records by accident.
>On this case, one way is to make a past_ or deleted_$tablename where you insert the deleted row before deleting it from production table.
The problem with this that it gets really cumbersome if you have a complex system of tables that depend on the main table, you'll end up having to make deleted/archived versions of all those tables. In that case it's easier to have a deleted/archived flag in the main table.
Yeah but it's also not unusual to have shared accounts for manual testing in prod, or to write automated smoke tests that run in prod after a deploy...
I'm not sure how to get around this, actually. Any production service of a certain scale is going to have some amount of fake activity caused by debugging, monitoring, testing, feature demos to clients/investors/internal stakeholders... It seems naive to tell an engineering team "no test accounts in prod ever because it makes analytics harder."
We just have a live clone in Dev, updated monthly, and a dev instance of the front end to use it. Sometimes monthly is too long, so a DBA will run a manual update in off-peak times. Queries that don't write data back can be moved directly to prod, though we also have an ODS with denormalized data for easier creation of reports & analysis. And changes that significantly write back to the DB are moved to test first, then to prod. Sometimes different people have things going on and that requires different timing or a clean copy of dev or test, and we'll temporarily spin up another instance.
To be fair, the above description paints a better picture than we have in reality. There's nuances and edge cases. But prod is kept pretty clean. Most of the problems we have are related to upgrades-- these are enterprise apps that all use Oracle, and the latest updates for one might require a particulate version of Oracle, but another app will be in conflict with that version. So a lot of the DBA work involves wrangling support from vendors on how to work around these. You'd think an app using Oracle 12c would run fine if you upgrade to 13c, but no it doesn't.
I would add to their semi structured data fields section a suggestion to add a version or type key. Otherwise your code consuming those field may grow over time to a bunch of conditionals to figure what is in the json.
In my experience I would add: Building systems out of "lego blocks".
It is possible to get all the pieces that are needed to build a data server for a enterprise pre built form cloud providers. Then plumb them together so the mostly work.
When the heat comes on and peopel are using it for real and it must scale (even a little) it blows up horribly.
The "leggo bricks" save a lot of time and money, and mean that people with only half a clue can build large impressive looking systems, but in the end people like ,e are picking up the pieced
There are advantages for soft deletes for CRUD architecture, but are there any for CQRS/ES (Event Sourcing)?
I guess if your read model is based on RDBMS then it makes sense, otherwise it depends on the database system in question (i.e. some NoSQL databases like C*[1] and Riak[2] are implementing deletes by writing special tombstone values, which is kind of soft-delete but on the implementation level - but you can't easily restore the data like in case of RDBMS).
> Typically semi-structured data have schemas that are only enforced by convention
Technically, in Postgres you can (kind of) enforce arbitrary schemas for semi-structured data using CHECK constraints. Unfortunately this isn't well-documented and NoSQL DBs often don't support similar mechanisms.
Seems likely that the enforced schema would then break things when someone updates the "live" schema without updating all of the checks littered through downstream tables...
>The exact definition of what comprises a session typically changes as the app itself changes.
Isn't this an argument for post-hoc reconstruction? You can consistently re-run your analytics. If the definition changes in code, your persisted data becomes inconsistent, no?
> Queries for business metrics are usually scattered, written by many people, and generally much less controlled. So do what you can to make it easy for your business to get the metrics it needs to make better decisions.
A simple but useful thing is setting the database default time zone match the one where most of your team is (instead of UTC). This reduces the chance your metrics are wrong because you forgot to set the time zone when extracting the date of a timestamp.
I cannot overstate how bad this advice is. Everything should be UTC by default. You can explicitly use timestamp with timezones and frankly it's trivial to query something like midnight-to-midnight PST. Your team should learn this as early as possible.
Build tooling around this, warn users, hell, educate them, but don't set up foot-guns like non-UTC.
If I see a timestamp without a timezone, it must always be UTC. To do anything else is to introduce insanity.
This. I once joined a company with local timezone per deployment and it was a nightmare. Not only in terms of development and debugging, but even for all the support tools required and the numerous bugs we found.
I insisted that all the tools that were going to be installed under my watch would be UTC, and never experienced any time issue on them.
We went this route and ended up with a db set to pst and some servers based on Chicago time. Endless time bugs. Pick one timezone for everything or just use unix timestamps.
Good list in there. Len silverstons data model resource books are amazing.. especially volume 3. Reading that book and getting to the point where I actually understood the most generalized patterns in it was a total game changer for me
Doesn’t this mean beta testing in prod? Development, at least everywhere I’ve worked, takes place on a separate db. For instance where I work atm we copy prod data to a staging db every couple of months and develop/test new features there before rolling them out. Any data coming from the beta test, in prod, is not really test data it is prod data and I don’t see why you’d want to remove it.
This section is totally wrong IMO. What is the alternative? "Hard" deleting records from a table is usually a bad idea (unless it is for legal reasons), especially if that table's primary key is a foreign key in another table - imagine deleting a user and then having no idea who made an order. Setting a deleted/inactive flag is by far the least of two evils.
>when multiplied across all the analytics queries that you’ll run, this exclusion quickly starts to become a serious drag
I disagree, modern analytics databases filter cheaply and easily. I have scaled data orgs 10-50x and never seen this become an issue. And if this is really an issue, you can remove these records in a transform layer before it hits your analytics team, e.g. in your data warehouse.
>soft deletes introduce yet another place where different users can make different assumptions
Again, you can transform these records out.