While postgres is indeed case sensitive usually writing queries with keywords in all caps is an effort to increase legibility for visual pattern matching. It absolutely isn't needed but if I'm debugging a query of yours I will send it through my prettifier so that I can breeze through your definitions without getting hung up on minor weird syntax things.
It's like prettification in any other language - visual structures that we can quickly recognize (like consistent indentation levels) make us waste less time on comprehension of the obvious so we can focus on what's important.
The only thing I really object to is "actuallyUsingCaseInIdentifiers" I never want to see columns that require double quotes for me to inspect on cli.
I find all caps identifiers wind up just looking like interchangeable blocks, where lowercase have word shapes. So all caps just slows down my reading.
I feel similarly and I also have friends with Dyslexia with even stronger opinions on it. All caps in addition to being "shouting" to my ancient internet-using brain (and thus rude in most cases), creates big similar rectangular blocks as word shapes and is such a big speed bump to reading speed for everyone (whether or not they notice it). For some of my friends with Dyslexia that have a huge tough time with word shapes at the best of times, all caps can be a hard stop "cannot read" blocker for them. They say it is like trying to read a redacted document where someone just made rectangular black marker cross outs.
Personally, given SQL's intended similarity to English, I find that I like English "sentence case" for it, with the opening keyword starting with a capital letter and nearly every remaining letter lower case (except for Proper Nouns, the truly case-sensitive parts of SQL like table names). Sentence case has been helpful to me in the past in spotting things like missing semicolons in dialects like Postgres' that require them, and/or near keywords like `Merge` that require them or helping to visually make sure the `select` under a `Merge` is intended as a clause rather than starting a new "sentence".
It's a bit of a "Why not both?" situation, I think? You can have blocks and sentence case:
Select *
from the_table
where some_column = 12
and other_column is not null
order by order_column;
That seems so much more readable to me. As I said, that single capital `S` in the outermost "select" has come in surprisingly handy in my experience when scanning through a collection of statements or a transaction or a stored procedure or even just a statement with a bunch of nested sub-selects. It's an interesting advantage I find over "all lower case" or "all upper case" keywords.
your example is less readable for me. not by a lot, but still.
the other example has the commands, in caps, on the left and the values on the right, in lowercase. your example removes one of those aspects and makes everything lowercase. my brain can ignore all-caps stuff, as these are just commands and the things i actually care about mostly are the values.
but i mean, in the end, it's just preferences. if you write SQL-queries, #1 is that you understand them well :)
> your example is less readable for me. not by a lot, but still
Agree, but I wonder how much of that is just the lack of colouring. My brain is suuuuper hard wired to expect all the special keywords to be identified that way as well as by case.
Mostly I'm a caps guy because my ahk scripts expand text like "ssf","w" and "gb"* to be the way I learned to write them at first.
I prefer lower case for my personal legibility reasons and it seems like a prettyfier should be able to adjust to that user’s preference. It’s not a team sport for me so I never had a conflict of styles other than converting public code samples to match my way.
I've always found it funny that SQL was designed the way it is to be as close to natural English as possible, but then they went ahead and made everything all-caps
My understanding is that the caps were syntax highlighting on monochrome screens; no longer needed with colour. Can't provide a reference, it's an old memory.
Most of the SQL I write is within a string of another programming language, so it's essentially monochrome unless there's some really fancy syntax highlighting going on.
VS Code does (did?) detect embedded SQL in PHP and correctly colour it, but only if it's on a single line. Any linebreaks and the colour is turned off. Also, if you're using prepared statements and have an @label, and that label is at the end of the string (so immediately followed by a closing quote), the SQL colouring continues into the rest of the PHP beyond the string. So it's important that single-line SQL statements ending in a @label be edited into multi-line statements to turn off the broken SQL colouring. Odd.
PHP strings tend to have better syntax highlighting with here/now docs (i.e. starting with `<<<TOKEN`). I've found SublimeText to have excellent SQL detection when using these tokens to delineate queries (and the syntax lends itself well to block strings anyways).
This is Jetbrain's "language injection" feature if you want to look it up. It works with any languages that the IDE supports, and like a sibling comment mentioned it does more than syntax highlighting.
If you're working with some established framework and project structure their IDEs pull that information out of that, otherwise you'll need to at least tell it the dialect, but if you e.g. configure the database as a data source in the IDE you'll get full schema xref.
Another aside: that is true for a huge range of programming languages as well as things like HTML. I believe it can automatically add \ to " in strings when those strings are marked to the IDE as HTML.
Same, I am a bit conflicted, I enjoy the sql and don't really like the ORM's but I hate seeing the big blocks of SQL in my code.
So I wrote a thing that lets me use sql text as a function, it is several sorts of terrible, in that way that you should never write clever code. but I am not really a programmer, most of my code is for myself. I keep using it more and more. I dread the day Someone else needs to look at my code.
I find that in worst cases I can always copy and paste to a quick temporary buffer that is highlighted. I might be doing that naturally anyway if I'm trying to debug it, just to run it in a Data IDE of my choice, but sometimes even just using a scratch VS Code "Untitled" file can be useful (it's SQL auto-detect is usually good enough, but switching to SQL is easy enough if it doesn't auto-detect).
I think the "color is all we need" idea makes sense in proportion to how many of our tools actually support colorization.
E.g., the last time I used the psql program, I don't think it had colorization of the SQL, despite running in a color-capable terminal emulator.
It probably doesn't help that terminal colors are a bit of mess. E.g., piping colored output through 'less' can result in some messy control-character rendering rather than having the desired effect.
Like sigils in that regard. Perl-type sigils are extremely nice... if you're editing in Notepad or some ancient vi without syntax highlighting and the ability to ID references on request. Little point to them, if you've got more-capable tools.
I'm curious about this for DuckDB [1]. In the last couple months or so I've been using DuckDB as a one-step solution to all problems I solve. In fact my development environment rarely requires anything other than Python and DuckDB (and some Rust if native code is necessary). DuckDB is an insanely fast and featureful analytic db. It'd be nice to have a linter, formatter etc specifically for DuckDB.
There is sqlfluff etc but I'm curious what people use.
[1] DuckDB SQL dialect is very close to Postgres, it's compatible in many ways but has some extra QOL features related to analytics, and lacks a few features like `vacuum full`;
i think the idea sql prettifier is pretty silly sometimes. it really likes indenting stuff to make sure things are aligned, which often results in dozens of whitespaces
where it insists on aligning WHEN past CASE. I think it would be perfectly reasonable to indent WHEN and ELSE 4 spaces less, for example. Similar things happen with nested conditions like (a or (b and c)) all getting pushed to the right
Ditto - if I'm throwing out an inspection query just to get a sense of what kind of data is in a column I won't bother with proper readable syntax (i.e. a `select distinct status from widgets`). I only really care about code I'll need to reread.
> writing queries with keywords in all caps is an effort to increase legibility for visual pattern matching
Considering most programming languages do just fine without ALL CAPS KEYWORDS I'd say it's a strange effort. I wish SQL didn't insist on being different this way.
I agree with you on prettification though. As long as the repository chooses a prettifier you can view it your way then commit it their way. So that's my advice: always demand prettification for pull requests.
I don't write or read SQL too often, but I prefer the ALL_CAPS because it usually lets me know if something is part of the SQL syntax itself or a function or whatever, or if it's referencing a table/column/etc.
Obviously not very foolproof, but automated linters/prettifiers like the one in DataGrip do a good job with this for every query I've ever thrown at it.
The all caps syntax also helps queries stand out as distinct from typical source languages. It is often helpful, since SQL tends to end up in all sorts of applications.
I think fighting in PRs over syntax preferences is pretty useless so dev shops should generally have code style guidelines to help keep things consistent. In my company we use all caps casing since we have momentum in that direction but I think that decision can be reasonable in either direction as long as it's consistent - it's like tabs vs. spaces... I've worked in companies with both preferences, I just configure my editor to auto-pretty code coming out and auto-lint code going in and never worry about it.
I think the increased legibility for visual pattern matching also makes SQL easier to read for many of the 350 million color blind people in the world.
Why don't they deprecate some of these features? If they're such easy stumbling blocks, seems like it makes sense to disable things like table inheritance in new schemas, and require some kind of arcane setting to re-enable them.
e.g. the suggested replacement for timestamp is timestamptz, which has its own problems (notably, it eagerly converts to UTC, which means it cannot account for TZ rule changes between storing the date and reading it). If medium term scheduling across multiple countries is something that needs to work in your app, you're kind of stuck with a column with a timestamp and another column with a timezone.
> stuck with a column with a timestamp and another column with a timezone.
I've been tinkering with a weird hack for this issue which might help or at least offer some inspiration.
It's similar to the foot-gun of "eagerly convert straight to UTC" except you can easily recalculate it later whenever you feel like it. Meanwhile, you get to keep the same performance benefits from all-UTC sorting and diffing.
The trick involves two additional columns along with time_zone and time_stamp, like:
-- Column that exist as a kind of trigger/info
time_recomputed_on TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT now(),
-- You might be able to do this with special triggers too
-- This coalesce() is a hack so that the above column changing causes re-generation
estimated_utc TIMESTAMP GENERATED ALWAYS AS (COALESCE(timezone('UTC', timezone(time_zone, time_stamp)), time_recomputed_on)) STORED
PostgreSQL will recalculate estimated_utc whenever any of the other referenced columns change, including the useless dependency on time_recomputed_on. So you can force a recalc with:
UPDATE table_name SET time_recomputed_on = now() WHERE time_recomputed_on < X;
Note that if you want time_recomputed_on to be more truthful, it should probably get updated if/when either time_zone or stamp are changed. Otherwise it might show the value as staler than it really is.
timestamptz is not more than glorified unix timestamp with nice formatting by default. And it's great! It provides easy zone conversions directly in SQL and allows to use your actual language "date with time zone" type.
`timestamptz` doesn't convert to UTC, it has no timezone - or rather, it gets interpreted as having whatever TZ the session is set to, which could change anytime. Postgres stores the value as a 64 bit microseconds-since-epoch. `timestamp` is the same. It's sad that even the official PG docs get this wrong, and it causes problems downstream like the JDBC driver natively mapping it to OffsetDateTime instead of Instant.
But you're right that timestamptz on postgres is different from timestamptz on oracle, which _does_ store a timezone field.
Here's the PostgreSQL documentation about timestamptz:
> For timestamp with time zone, the internally stored value is always in UTC (Universal Coordinated Time, traditionally known as Greenwich Mean Time, GMT). An input value that has an explicit time zone specified is converted to UTC using the appropriate offset for that time zone. If no time zone is stated in the input string, then it is assumed to be in the time zone indicated by the system's TimeZone parameter, and is converted to UTC using the offset for the timezone zone.
> When a timestamp with time zone value is output, it is always converted from UTC to the current timezone zone, and displayed as local time in that zone. To see the time in another time zone, either change timezone or use the AT TIME ZONE construct (see Section 9.9.4).
To me it seems to state quite clearly that timestamptz is converted on write from the input offset to UTC and on read from UTC to whatever the connection timezone is. Can you elaborate on which part of this is wrong? Or maybe we're talking past each other?
Timestamp is a 64 bit microseconds since epoch. It's a zone-less instant. There's no "UTC" in the data stored. Times are not "converted to UTC" because instants don't have timezones; there's nothing to convert to.
I'm guessing the problem is that someone heard "the epoch is 12am Jan 1 1970 UTC" and thought "we're converting this to UTC". That is false. These are also the epoch:
* 11pm Dec 31 1969 GMT-1
* 1am Jan 1 1970 GMT+1
* 2am Jan 1 1970 GMT+2
You get the picture. There's nothing special about which frame of reference you use. These are all equally valid expressions of the same instant in time.
So somebody wrote "we're converting to UTC" in the postgres documentation. The folks writing the JDBC driver read that and now they think OffsetDateTime is a reasonable mapping and Instant is not. Even though the stored value is an instant. And the only reason all of this works is that everyone in the universe uses UTC as the default session timezone.
To make it extra confusing, Oracle (and possibly others) TIMEZONE WITH TIME ZONE actually stores a timezone. [1am Jan 1 1970 GMT+1] <> [2am Jan 1 197 GMT+2]. So OffsetDateTime makes sense there. And the generic JDBC documentation suggests that OffsetDateTime is the natural mapping for that type.
But Posgres TIMESTAMP WITH TIME ZONE is a totally different type from Oracle TIMESTAMP WITH TIME ZONE. In Postgres, [1am Jan 1 1970 GMT+1] == [2am Jan 1 197 GMT+2].
You are thinking of UTC offsets as zones here, which is wrong. Yes, you can interpret an offset from the epoch in any utc offset and that's just a constant formatting operation. But interpreting a zoned datetime as an offset against a point in UTC (or UTC+/-X) is not.
You do not confidently know how far away 2025-03-01T00:00:00 America/New_York is from 1970-01-01T00:00:00+0000 until after that time. Even if you decide you're interpreting 1970-01-01T00:00:00+0000 as 1969-12-31T19:00-0500. Postgres assumes that 2025-03-01T00:00:00 America/New_York is the same as 2025-03-01T00:00:00-0500 and calculates the offset to that, but that transformation depends on mutable external state (NY state laws) that could change before that time passes.
If you get news of that updated state before March, you now have no way of applying it, as you have thrown away the information of where that seconds since epoch value came from.
I'm not quite sure what your point is. Postgres doesn't store time zones. "The internally stored value is always in UTC" from the documentation is false. It's not stored in UTC or any other zone. "it is always converted from UTC to the current timezone zone" is also false. It is not stored in UTC.
This is pointless pedantry: Expressing it as the number of seconds since a point that is defined in UTC is a conversion to UTC by anyone else's logic (including, clearly, the documentation writers), even if there's not some bits in the database that say "this is UTC", even if that point can be expressed with various UTC offsets.
The internal representation is just a integer, we all agree on that, this is not some great revelation. The fact that the internal representation is just an integer and the business rules surrounding it say that integer is the time since 1970-01-01T00:00:00Z is in fact the cause of the problem we are discussing here. The internal implementation prevents it being used as a timestamp with time zone, which its name and the ability to accept IATA TZs at the query level both in datetime literals and features like AT TIME ZONE or connection timezones strongly imply that it should be able to do. It also means the type is flawed if used to store future times and expecting to get back what you stored. We complain about behaviours like MySQL's previous silent truncation all the time, documented as they may have been, so "read the source code and you'll see it's doing XYZ" is not relevant to a discussion on if the interface it provides is good or bad.
Nor is the link you shared the full story for the source code, as you'd need to look at the implementation for parsing of datetime literals, conversion to that integer value, the implementation of AT TIME ZONE, etc.
This is not pedantry. It has real-world consequences. Based on the misleading text in the documentation, the folks writing the Postgres JDBC driver made a decision to map TIMESTAMPTZ to OffsetDateTime instead of Instant. Which caused some annoying bugs in my 1.5M line codebase that processes financial transactions. Which is why I'm a bit pissed about all this.
If you walk into a javascript job interview and answer "UTC" to the question "In what timezone is Date.now()?", you would get laughed at. I don't understand why Postgres people get a pass.
If it's an instant, treat it as an instant. And it is an instant.
> note that all OffsetDateTime instances will have be in UTC (have offset 0)
Is this not effectively an Instant? Are you saying that the instant it represents can be straight up wrong? Or are you saying that because it uses OffsetDateTime, issues are being caused based on people assuming that it represents an input offset (when in reality any such information was lost at input time)?
Also that page implies that they did it that way to align with the JDBC spec, rather than your assertions about misleading documentation.
Seeing this topic/documentation gives me a sense of deja vu: I think it's been frustrating and confusing a great many perfectly decent PostgreSQL-using developers for over 20 years now. :P
A lot of these aren't postgres-specific. (null weirdness, index column order, etc.)
For example, how nulls work - especially how interact with indexes and unique constraints - is also non-intuitive in mysql.
If you have a user table with a non-nullable email column and a nullable username column, and a uniqueness constraint on something like (email, username), you'll be able to insert multiple identical emails with a null username into that table - because a null isn't equivalent to another null.
> If you have a user table with a non-nullable email column and a nullable username column, and a uniqueness constraint on something like (email, username), you'll be able to insert multiple identical emails with a null username into that table - because a null isn't equivalent to another null.
FWIW, since 15 postgres you can influence that behaviour with NULLS [NOT] DISTINCT for constraints and unique indexes.
I didn't know this when I started, so I never vacuumed the reddit databases. Then one day I was forced to, and it took reddit down for almost a day while I waited for it to finish.
Auto vacuuming is enabled now. We did have some near misses due to long running vacuums that barely completed before wraparounds, but got things tuned over time.
I did none of that work but was on the team where it happened.
> Normalize your data unless you have a good reason not to
Ouch. You don't want to just say that and move on.
The author even linked to a page citing 10 different kinds of normalization (11 with the "non-normalized"). Most people don't even know what those are, and have no use for 7 of those. Do not send people on wild-goose chases after those higher normal forms.
The problem with BCNF is generally that you're enforcing a generally fairly complex and subject-to-change relation at database level rather than application logic.
The one exception I'll make from the very start is "add tenant identifier to every row, yes, even if it's linked to another table that has tenant identifiers."
Sure, this means you will have some "unnecessary" `tenant_id` columns in some tables that you could get through a relation, but it saves you from _having_ to include that relation just to limit by tenant, which you will some day almost be guaranteed to want. (Especially if you want to enable row-level security¹ someday.)
We do a variant of this across all our tables. If we have a parent-child-child relationship, then all child tables, regardless of depth, will have the parent id.
This way we can load up all the data needed to process an order or invoice etc easily, without a ton of joins.
We don't do multi-tenant, instead separate databases per tenant so far.
How is this working for you so far? Do you ever need to report across the multiple tenants, and how do database migrations go? I'm starting to look into this, and purely for reporting and database migrations I'm leaning towards multi-tenant.
The product I'm working have not needed to report across multiple tenants.
We do have some customers which have separate daughter companies which might technically be individual tenants, and where we might need to report across those. But in all those cases so far we've been able to host the databases on the same server, so can easily join tables from the individual databases into the same query.
Database migrations are very smooth, given each tenant has their own database, so we can do it when they're ready instead of needing a service window that fits all our tenants. We have several that are essentially 24/7 operations, and while they do have periods of low activity that fits a service window, they usually don't line up well between them.
Likewise schema upgrades are trivial. We have written our own little tool that updates the database given a source schema description in XML (since we've had it for 15+ years now), and as it does not do destructive updates is is low risk. So schema upgrades is done automatically as part of our automated update deployment (ala Windows Update).
Of course this requires a bit more thought during schema design, and sometimes we do add some manual cleanup or similar of tables/columns to our tool that we know are safe to remove/change.
One upside is that performance is easy to manage. A single tenant will almost never cause performance issues for others than themselves. If they start doing that we can always just trivially move them to their own server.
A downside is that we have a lot of large code lists, and currently these are kept per-database as we've traditionally been deployed on-prem. So we're looking to consolidate those.
We do have another product that I haven't worked on, it's a more traditional SAAS web app thing, and that does have multi-tenant. It's not as business-critical, so the service window constraint isn't an issue there.
Given that it has an order of magnitude more tenants than the application I work on, I think that multi-tenant was a decent choice. However it has also had some complications. I do recall hearing some issues around it being a more critical point of failure, and also that certain larger customers have started to want their own separate database.
I think ideally we would have combined the two approaches. Allow for multi-tenant by having a tenant id everywhere, but also assume different tenants will run in different databases.
Thank you very much for your time, I appreciate it! It definitely seems automated schema updating is necessary if you're doing more than a couple of databases, and you raised many other good points that I hadn't fully considered. I can definitely appreciate larger clients wanting their own dedicated databases, so planning for that initially could be a wise choice.
Thank you again!
Nice one! I’m working on an app the is user/consumer facing but will eventually have a teams/bussiness offering. Everything has a uid with it cus users are the core of your app. But yea if your multitennant then tenants should be treated like that as well.
In a great number of (probably most) business-to-business software domains, most rows are only relevant to one tenant at a time. It’s a partition/shard key, basically: queries will only uncommonly retrieve rows for more than one tenant at a time.
Hassles emerge when the tenant that owns a row changes (client splits or mergers), but even then this is a common and robust architecture pattern.
It's usually to normalize into the 3rd form. But that's not enough on some cases, that's too much on some other cases, and the reason it breaks is performance about as often as it's not.
Isn't 6NF essentially a flavor of EAV? I think essentially it is.
6NF means having one non-PK column, so that if the value would be NULL then the row need not exist, and so the value column can be NOT NULL.
But when you need all the columns of what you'd think of as the <thing> then you need to go gather them from all those rows in all those tables. It's a bit annoying. On the other hand a proper EAV schema has its positives (though also its negatives).
It's similar, but instead of one table holding lots of attributes, there are separate tables that hold optional fields that might otherwise be null if they were in the main table.
Every time I see a suggestion like this, I wonder what kind of projects people work on that only require a few columns.
In my experience every non-trivial project would need several dozen columns in core tables, all which could be NULL while the user works on the record during the day.
In our current project we'd have to do several hundred joins to get a single main-table record.
I also wonder why people have such an aversion for NULL. Crappy tools?
No, it's just a dedication to pure relational algebra. Nulls introduce "three valued logic" into your code/sql. A nullable boolean column can contain 3 different "values" instead of two. True is not false, true is not null, false is not null (and null is neither true nor false). Or in a nullable numeric field, you have rows where that column is neither greater than zero or less than zero.
On the other hand, even if you eliminate all nulls, you still have to do some three-valued logic ("A join B on B.is_something", "A join B on not B.is_something" and the "anti-join": "A where no B exists"), but only in the context of joins.
It feels a little like rust's Option type. If something in "nullable", these are tools that forces you to deal with it in some way so there are fewer surprises.
Long before databases could even store structured JSON data, junior developers used to bikeshed viciously over the correct degree of normalization.
More experienced developers knew that the correct answer was to duplicate nothing (except for keys obviously) and then to denormalize only with extreme reluctance.
Then databases like mongo came along and encouraged those juniors by giving them something like a database, but where normalization was difficult/irrelevant. The result was a brief flowering of horrible database designs and unmaintainable crap towers.
Now the pendulum has swing back and people have rediscovered the virtues of a normalized database, but JSON columns provide an escape hatch where those bad practices can still flower.
Eh. JSON has its place. I have some stateful data that is fairly transient in nature and which doesn’t really matter all that much if it gets lost / corrupted. It’s the sort of thing I’d throw into Redis if we had Redis in our stack. But the only storage in my project is S3 and Postgres. Postgres allows me to trivially query, update, analyze the usage of the feature, etc. Normalization wouldn’t buy me much, if anything, for my use case, but it would make “save this stuff for later” more of a nuisance (a sync across a bunch of rows vs a single upsert).
That said, I’ve worked on projects that had almost no normalization, and it was pure hell. I’m certainly not arguing against normalizing; just saying that data blobs are useful sometimes.
1. To store JSON. There's a pattern where when your webserver calls into some third-party API, you store the raw API response in a JSONB column, and then process the response from there. This gives you an auditable paper trail if you need to debug issues coming from that 3rd-party API.
2. To store sum types. SQL not supporting sum types is arguably the biggest deficiency when modelling data in SQL databases. There are several workarounds - one of them being "just chuck it in a JSONB column and validate it in the application" - but none of the workarounds is particularly great.
Even if you care about it, you will still often wind up with a junk drawer of JSONB. I don't really see it as a problem unless people are writing bad queries against it instead of lifting values out of the JSONB into their own columns, etc.
Most developers using these kinds of tools these days are actually building their own database management systems, just outsourcing the persistence to another DMBS, so there isn't a strong imperative to think about good design so long as it successfully satisfies the persistence need.
Whether we actually should be building DMBSes on top of DMBSes is questionable, but is the current state of affairs regardless.
A previous employer thought that sql databases didn’t understand graphs. So they made their own system for serializing/deserializing graphs of objects into Postgres
. They never used queries and instead had their own in-memory operators for traversing the graph, had to solve problems like deleting an entry and removing all references, partial graph updates.
This needs working schema migration process, including ability to undo schema change if the new column tanks the performance or breaks stuff.
If there are CLI tools involved, you also need to ensure you can handle some downtime, or do synchronized version update across company, or support both old and new schemas for a while.
If a database is not part of team's primary product all of this could be missing.
this is really nice. i am glad the author put it together. i didn't know the pg docs were 3200 pages long! i have been using it for a while and try to learn as i go. i really do like the docs. and i also like to read articles on various particular subjects as i find a need to.
i think the author might find it helpful to readers to add a note to https://challahscript.com/what_i_wish_someone_told_me_about_... that if someone is selecting for b alone, then an index on columns (b, a) would work fine. i think this is kind of implied when they talk about selecting on a alone, but maybe it wouldn't hurt to be extra explicit.
(i didn't spend much time on the json/jsonb part since i rarely use that stuff)
From my experience with a lot of hilarious SQL stuff I have seen in the wild.
It would be a good start to read the paper of codd and trying to understand what the relational model is. It's only 11 pages long and doing that would reduce the suffering in this world.
Nice article! One thing I'd add is that almost all of it applies to other MVCC databases like MySQL too. While some details might be different, it too suffers from lon transactions, holds metadata locks during ALTERs, etc, all the good stuff :).
I've been learning Postgres and SQL on the job for the first time over the last six months - I can confirm I've learnt all of these the hard way!
I'd also recommend reading up on the awesome pg statistics tables, and leverage them to benchmark your things like index performance and macro call speeds.
I had an interesting problem occur to the pg stats. We were migrating and had a version column, I.e key, val, version.
We were migrating from version 1 to version 2, double writing into the same table. An index on (key, val, version) was being hit by our reader process using a where clause like key=k and version=1.
When we flipped the reader to read version 2, the latency jumped from 30ms to 11s. Explain showed a sequential scan even though the index could serve the query. I was able to use MATERIALIZED and reorder CTEs to get the planner to do the right thing, but it caused an outage.
We were autovacuuming as well. I ended up deleting the old version and rebuilding the index.
My theory is that because the read load was on 50% of the data, the stats were super skewed.
That nested JSON query operator chains such as json_col->'foo'->'bar'->>'baz' internally return (copy) entire sub-objects at each level and can be much slower than jsonb_path_query(json_col, '$.foo.bar.baz') for large JSONB data
... although I haven't had the chance to verify this myself
I got nerd-sniped on this, because I actually hadn't heard that before and would be horrified if it were true. It took some non-trivial digging to even get down into the "well, what does foo->>'bar' even map to in C?" level. I for sure am not claiming certainty, but based merely on "getIthJsonbValueFromContainer"
<https://sourcegraph.com/github.com/postgres/postgres@REL_17_...> it does seem that they do palloc copies for at least some of the JSONB calls
Postgres does have infrastructure to avoid this in cases where the result is reused, and that's used in other places, e.g. array constructors / accessors. But not for jsonb at the moment.
> It’s possible that adding an index will do nothing
This is one of the more perplexing thing to me where Postgres ideology is a bit too strong, or at least, the way it works is too hard for me to understand (and I've tried - I'm not going to claim I'm a genius but I'm also not a moron). I hear there may be finally support for some kind of hints in upcoming versions, which would be very welcome to me. I've spent way too much time trying to divine the voodoo of why a slow query is not using indexes when it seems obvious that it should.
Only some of these are really Postgres specific (use "text" / "timestamptz"; make psql more useful; copy to CSV). Most of them apply to relational databases in general (learn how to normalise!; yes, NULL is weird; learn how indexes work!; locks and long-running transactions will bite you; avoid storing and querying JSON blobs of doom). Not that that detracts from the usefulness of this article - pretty much all of them are important things to know when working with Postgres.
since we are on the topic and since your article clearly mentions "Normalize your data unless you have a good reason not to" I had to ask. I am trying to build a news aggregator and I have many websites. Each of them has slightly different format. Even though I use feedparser in python, it still doesn't change how some of them put html text inside content and some of them break it down into a separate media xml attribute while retaining only basic textual summary inside a summary attribute. Do you think it makes more sense to store a single rss item as a column inside postgres or should it be stored after parsing it? I can see upsides and downsides to both approaches. Store it as XML and you have the ability to change your processing logic down the line for each stored item but you lose the flexibility of querying metadata and you also have to parse it on the fly every single time. Store it in multiple columns after processing it and it may require different logic for different websites + changing your overall python processing logic requires a lot of thinking on how it might affect some source. What do you guys recommend?
With the caveat that you probably shouldn't listen to me (or anyone else on here) since you are the only one who knows how much pain each choice will be ...
I think that given that you are not really dealing with structured data - you've said that different sites have different structures, and I assume even with processing, you may not be able to generate identical metadata structures from each entry.
I think I would go for one column of XML, plus maybe another column that holds a parsed data structure that represents the result of your processing (basically a cache holding the post-processed version of each site). Hopefully that could be re-evaluated by whatever language (Python?) you are using for your application. That way you don't have to do the full parsing each time you want to examine the entry, but you have access to something that can quickly give you whatever metadata is associated with it, but which doesn't tie you to the rigid structure of a table based database.
Once you know what you are really doing with the data, then you could add additional metadata columns that are more rigid, and which can be queried directly in SQL as you identify patterns that are useful for performance.
i am using the feedparser library in python https://github.com/kurtmckee/feedparser/ which basically takes an RSS url and standardizes it to a reasonable extent. But I have noticed that different websites still get parsed slightly differently. For example look at how https://beincrypto.com/feed/ has a long description (containing actual HTML) inside but this website https://www.coindesk.com/arc/outboundfeeds/rss/ completely cuts the description out. I have about 50 such websites and they all have slight variations. So you are saying that in addition to storing parsed data (title, summary, content, author, pubdate, link, guid) that I currently store, I should also add an xml column and store the raw <item></item> from each url till I get a good hang of how each site differs?
Just shooting without knowing your real needs - take this with a grain of salt.
Store some parsed representation that makes it easier for you to work with (probably normalized). Keep an archive of raw data somewhere. That may be another column, table or even S3 bucket. Don't worry about schema changes but you need to not lose the original data. There are some pitfalls to schema migrations. But the schema should be the representation that works for you _at the moment_, otherwise it'll slow you down.
If you’re going to be parsing it anyways, and there’s the possibility of actually doing something with that parsed info beyond just reprinting it, then storing the post-parse results is probably better. Especially if you’re only going to need a reduced subset of the information and storage matters.
If you’re just reprinting — you’re parsing only for the sake of rendering logic — then storing the parse-result is probably just extra unnecessary work.
Also if storage isn’t a concern, then I like using the database as intermediate storage for the pipeline. Grab the RSS, stuff it in the DB as-is. Take it out of the DB, parse, store the parse results. Etc. You’ll have to do this anyways if you’re going to end up with a processing queue (and you can use PG as a simple queue.. SELECT…FOR UPDATE), but it’s nice-to-have if your pipeline is going to eventually change as well — and you’re able to reprocess old items
> Store it in multiple columns after processing it and it may require different logic for different websites + changing your overall python processing logic requires a lot of thinking on how it might affect some source.
Don’t you need to deal with this problem regardless? Ideally you’ll find a common subset you actually care about and your app-logic will look like
website —> website handler -> extract data subset —> add to common structure —> render common structure
Even if you don’t use a database at all your app-code will need to figure out the data normalization
I'd say I tend to ignore the standard docs because they rarely have examples and rely on the arcane procedure of trying to decipher the super command options with all it's "[OR THIS|THAT]".
I assume _someone_ can read this pseudo programming, but it's not me.
I've been bitten by those before because they are not generated from the actual syntax-parsing code, and thus are sometimes out of sync and wrong (or at least misleading).
FWIW square brackets indicate an optional parameter and the pipe inside indicates all the parameters available; but I agree, I don't particularly care for that notation and prefer actual running SQL examples.
Railroad-esque diagrams can be weird but they say a lot in a very short space, I would highly recommend spending a little extra time working on thinking through them, they are everywhere!
Well, you should!
But, I didn't invent anything there, that's what they are named :)
Syntax diagrams (or railroad diagrams) are a way to represent a context-free grammar. They represent a graphical alternative to Backus–Naur form, EBNF, Augmented Backus–Naur form, and other text-based grammars as metalanguages...
https://en.wikipedia.org/wiki/Syntax_diagram
Sometimes I find it annoying but mostly it works well for me. I've come to embrace the find feature and visually scanning over any parenthetical stuff.
The alternative is they have to break it up into several quasi examples, each with their own optional modifiers.
Unrelated to that issue, the right hand side TOC does not display completely. I am using firefox on windows with a default zoom of 120%. The TOC ends up going below the screen and hiding the last few entries. It also floats above the footer and hides the text there.
If I may suggest, use `position: sticky` instead of `position: fixed` or some equivalent to whatever trick you used.
Watch out, for row/record values, if a column in the row/record is NULL then IS NULL will be true! You want to use IS [NOT] DISTINCT FROM NULL, full stop.
On other hand it's a mere blog post. You should not be bothered by TCP cost. But reliable data storage and reliability/restore (in case of backups) cost is a concern.
It is mostly a blog post. A usecase for a database that holds tables and rows is very rare in real world. I know noone who uses contacts app in their mobile phones. Access is already there with checkboxes, select inputs and everything for years. Noone uses.
From reading this, it's clear that the author never sat down and learned to use databases from the ground up. The author started using them and learned as he went, so his "tips" include things you'll be told in the first hour of any course.
This doesn't hold any value for someone who's been using databases for almost any length of time.
People have too much faith in LLM's at the moment. This might be able to give some insights, but if you're trying to LEARN something in detail, such as the pros and cons on a particular custom query on your unuiqe-ish schema.. The LLM will be prone to making stuff up by design, it wasn't trained on YOUR data..
It will give some good basic results and probably some functional queries, but it misses the mark with the finer points of optimization etc. If you don't know how to properly build the queries, go learn it properly with confidence that what you're learning is actually reliable.
This is similar to what I’ve done with my sass where most of the logic lives in Postgres. I went through much of the logic and refactored it with GPT a while back and improved it significantly. It’s soooo good with SQL. Like insanely good. And these days I'm using Claude for all things code, which is significantly better than GPT.
Agreed that LLMs are quite good at SQL (and relational algebra). However, ask yourself this: how many programming languages do you know to at least a comfortable level? If N > 1, why not add another (SQL)? It’s not a difficult language, which is in part why LLMs are so good at it – it’s easy to predict the next token, because there are so few of them.
I know SQL pretty well (enough to write complex procedures) and prefer it over doing biz logic in application code (yuck ORMs). But SQL is just such a wonky syntax that I appreciate any help that I can get.
“Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowcharts; they'll be obvious.” - Fred Brooks
Yep, but if you think your database schema is "secret sauce", you're fooling yourself. You can almost always guess the database schema if you're given 10 minutes to play with the application.
Generally your schema is your code, which IS your IP - your data is often NOT your IP (and often might just be a list of facts, a thing which is not copyrightable)
It's like prettification in any other language - visual structures that we can quickly recognize (like consistent indentation levels) make us waste less time on comprehension of the obvious so we can focus on what's important.
The only thing I really object to is "actuallyUsingCaseInIdentifiers" I never want to see columns that require double quotes for me to inspect on cli.
reply