Whenever I'm adding a new table, I almost always try to add an "internal_notes" text field and a "config" JSONB field, no matter how static we think the data definition is going to be. Because at an early stage startup stage, you're going to need to hack some fixes together. And given a choice between needing to QA a database migration vs. just adding {{foo.config.extra_whatever}} in a template, or using it in a single line of logic, the latter is just so much more feasible and easy-to-understand for a same-day bugfix, and you can always move to a dedicated column with a data migration later.
Of course, if you're adding this to a few dozen rows, then you might need to jump into SQL to do this at scale. And then you run into the wonderful situations described in the article. I've had to write the following a great many times:
And honestly, the syntax, especially the second one, isn't half bad once you get used to it. That said, the syntax from the article is amazing, and it will go a long way towards people reaching for Postgres even when their data model has a lot of unknowns.
PLEASE don't do this. I've worked on too many systems where these optional keys ended up being relied upon, misspelled, missing, unexpectedly null, etc. and a raft of bugs that made customers very unhappy necessitating a time consuming "data cleanse".
Use a system which makes it easy to do migrations instead. It will take barely any longer to create that new column and it won't store up trouble in the future.
As an industry, software development needs to get better at cleaning up TODOs. This is especially bad when business needs take prime spot always. Sometimes, technical health needs to take priority and drive down these issues. If there’s confidence in the organization that we can take on some tech debt and pay it down a little bit later, then you can move faster (just like leverage in finances). It’s when the debt you take on is so great or so well swept under the rug that it turns into a monster and ruins your day. If the tech debt items are allowed to live forever, the org gets into the position of quickly maxing out their credit and never being able to add more debt which slows delivery.
In the scrum world, prioritizing a feature story over a debt story should require a strong reason with significant debate. Don’t let them live longer than a couple of sprints.
I don't think there is any substitute for experience in this matter. Some hacks are fine to last essentially forever, some should never be allowed to happen and others are probably ok but only if they get unwound in under a week. It's rarely obvious which ones are which until you've felt the pain.
I tend to believe that hacks that undermine the integrity of your data are the absolute worst (which is why my reaction to this was so visceral) but can sense I might just be a product of my environment - different industries and use cases vary quite wildly in their expectations of quality and attitudes to failure.
Agree about Scrum. It creates a very high cost to unwind any tech debt that can't be smuggled into a story.
Same camp here, even more if this is a startup, these once-optional keys and settings will be ossified into the database, the knowledge of why they existed will be lost (as most knowledge from startups after 5 years end up), untangling that mess will take an engineer's days to weeks doing discovery work of possible incantations of its usage, etc.
If you give such a powerful feature to any table you are just shooting yourself on the foot on the whole modeling of your application, there is no static representation of the model, it will be dynamic depending on the JSONB configuration.
I totally understand where you're coming from. I think the greatest issue is when there's an overlap between two things:
- the fact that systems like Django actually make it difficult to switch branches from a feature branch with one migration to another on a different migration tree, without fully rolling back everything (including the test data you may be deep in the weeds experimenting on, that may not yet be in fixtures), and
- being in an environment where you require same-day turnaround on things that aren't really bugs but would be seen that way by high-profile clients, and the priority is high enough to disrupt flow and run the risk of those "raft of bugs"
For instance, if User A is highly influential and it would be meaningful to the business to have rapid turnaround on ensuring that Asset B that they're looking at has a specific fix to some data that you know you should make fully customizable at users' discretion, but don't yet have the right spec in place to build a customization on a per-Asset basis, and every team member is deeply working on debugging a feature... you just do asset.config.foo = bar, ensure it's extremely well documented at the site where a column definition would be in your model file, make a simple commit that you know won't affect anything else, and you take that as technical debt, and you keep yourself from getting too far out of flow.
The "data cleanse" you speak of comes in the form of a regular inventory of live config variables (select the set of keys and distinct values across all configurable tables), and the creation of user stories where necessary.
If there was a way to `git stash` all changes made to a dev database, and pop that afterwards, that would absolutely make it possible to do a more robust workflow. It's ironic that e.g. Postgres MVVC and transactional DDL is so robust, and yet I haven't found a tool that makes it feel nearly as seamless.
Conceptually it seems simpler to add a column to table rather than adding a property to some serialized structure. I wonder if it isn't a question of process and tooling encouraging the wrong thing?
At least in other contexts I have seen JSON or XML fields used as a trick to circumvent a heavyweight process for adding new columns. But IMHO the process should be improved rather than circumvented. Some times you need to quickly add a column, but adding a column is also a pretty safe operation. Doing the same thing "one level" up does not improve safety or convenience.
In more traditional organizations I have seen the DBA act as a gatekeeper, which lead to developers inventing all kinds of hack to get around the barrier. I can understand the thinking which lead to this, but it is dysfunctional.
If you don’t know the column will live forever, avoid adding it. Adding columns is easy, migrating and deleting columns is super annoying, even with auto-migration and an ORM, because the risk of data loss is significant.
But isn't this the same if you add some field to a JSON structure? If code depends on it, you can't remove it. The risk of data loss from removing a JSON field would be exactly the same as removing a column.
I absolutely agree with you. It also affects how you can scale a team in early stage. It is super easy to tell the new member to be look up framework specific documentation.
Personally I wouldn't want to inherit something like this. These two special fields are going to be updated by some people, and not by others. Give it a few years and with some people turnover and there is plenty opportunity for bitrot and name collisions.
I'm trying to understand this post, so I turned the initial "problematic" SQL into this:
SELECT j->'k' FROM t;
UPDATE t SET j = f(j, '{"k"}', '"v"');
... and the provided solution into this:
SELECT j['k'] FROM t;
UPDATE t SET j['k'] = '"v"';
I think I understand what a binary JSON column might be but by stripping variables and function names down to the minimum, both of these constructs look a bit heavy on the quotes and brackets (parenthesis.) There are squiggly brackets and square ones, single and double quotes. A lack of a colon is clearly an oversight.
I think I understand now: the function "jsonb_set on a "'@thing@'" is to become a magic result by inference with enough syntax on a variable.
I think I may have some way to go before I achieve enlightenment 8)
So, the extra quotes may seem like a lot, but I think in practice these are going to be parameterized queries... so something like this (in Python):
cursor.execute(
"""UPDATE t SET j['k'] = ?;""",
(json.dumps(newvalue),))
For transmitting JSON to the database server, it makes sense that the JSON would be serialized as a string, because that's the only standardized serialization format for JSON anyway. When you use parameterized queries, the parameters don't get quotes.
EDIT: Okay, so I looked up what {'k'} means, i.e. it's an array with one element 'k'. And in the article's function it's the "path" to the key to be set.
Still not sure if ARRAY['k'] is a thing, though. I could only find this sort of syntax described in the context of definition, e.g. ARRAY[4] for an array of length 4.
The semantic difference between `jsonb_set` and the new subscripting syntax reminds me of ES6 arrow function situation: a new and improved syntax built seemingly to replace the old one, except there's a slight semantic difference such that you sometimes still have to fall back to the old syntax for the specific effect. Not complaining, but just an additional quirk to have to teach beginners.
I am really really unconvinced that any teams should be reviewing patches by email like this when we have excellent PR-based workflows and interfaces. I know these people are 1000x better C programmers than me, but I think they are wrong and being unreceptive to improved technologies on this one.
Because the dot is used in table.column notation when the column name is ambiguous (in a join). Overloading it to also denote JSON field values would be confusing, the very thing the change mentioned in the article seeks to avoid.
It's got nothing to do with ambiguity, you're fundamentally thinking about dot notation wrong.
The dot notation is the actual naming convention, not using it is a convenient shortcut.
Many professional programmers would chew you out for not using it, being lazy and leaving it off can easily introduce unintended bugs.
For example, you add a new column to a table, and boom, half your SQL statements now fail because it's got the same name as an existing column on another table.
Of course, if you're adding this to a few dozen rows, then you might need to jump into SQL to do this at scale. And then you run into the wonderful situations described in the article. I've had to write the following a great many times:
Or this, which is great if you need to set a top-level key to something static; the || operator just mixes the two together similar to Object.assign: And honestly, the syntax, especially the second one, isn't half bad once you get used to it. That said, the syntax from the article is amazing, and it will go a long way towards people reaching for Postgres even when their data model has a lot of unknowns.