This is why I’ll probably just always use a UUIDv7 primary key and a secondary UUIDv4 indexed external identifier… which is extremely close to how I tend to do things today (I’ve been using ULID and UUIDv4)
Lookup speed. Direct mappings are faster. If something needs an external identifier that can be looked up for URL/API queries and such things…
Internal sorts and index on write with ULID/UUIDv7 which reveal potentially sensitive timestamp information, and so when that’s not appropriate a separate column can be used for…
External opaque identifiers which are UUIDv4, and if indexing speeds become an issue, can be switched to deferred indexing…
It’s a good balance and everything I’ve ever used supports this (I only needed a pretty strong representation wrapper for ULIDs since PG doesn’t validate UUID bit field structures so i can just miss-use UUID columns)
It looks like Snowflake has the same information exposure issues that using UUIDv7 or ULID for a publicly visible identifier.
Purely random identifiers are the recommended primary key for plenty of databases - eg. spanner.
Random identifiers spread work evenly between shards of something sharded by keyspace. They also don't get 'hotspotting' on more recent records - recent records frequently get more than their fair share of updates and changes, and database query planners have no knowledge of that.
For large distributed systems like Spanner you want to avoid a hotspot on a single node as that limits throughout.
However for a single-node system like PostgreSQL you want hot spots because they are cache friendly. (Locks aside)
Basically you want a hotspot that is as hot as a single node can manage, but no hotter. Even for things like Spanner you still want good cache hits (which is why they support table interleaving and other methods of clustering related data) but in general avoiding overloading single nodes is more important (as it hurts latency and efficiency but doesn't limit throughput like a hotspot would).
Spanner is also bespoke, and was probably designed with that in mind.
Anything with a clustering index - MySQL (with InnoDB), SQL Server - will absolutely hate the page splits from a random PK.
Postgres also doesn’t like it that much, for a variety of reasons. Tuples are in a heap, but the PK is still a B+tree (ish), so it still suffers from splits. They also use half the default page size as MySQL, AND their MVCC implementation doesn’t lend itself to easy updates without rewriting the entire page.
Go run benchmarks with some decent scale (tens of millions of rows or more) on any DB you’d like between UUIDv4 and UUIDv7 as the PK, and see how it goes.
Right. In a way, these new sequential UUIDs approach the fact that recent records are often accessed more frequently as something that can be exploited, not as an issue that needs to be ironed out by randomness.
For tables this is not such a problem because of reasonably good locality (rows inserted close in time will end up close in the file too), but for indexes that's very difference. In Postgres this is particularly painful for writes, because of the write amplification.
None of this really breaks sharding (IDs are still unique and can be hashed), so no new hotspots. It can't fix the "many rows with the same ID" issue, but neither can any other ID.
If AES-128 is an acceptable external UUID (and likely an acceptable internal one), then you might as well just stick with a faster RNG.