Why not just use the AES-128 result as the UUID then? What's the benefit of the ...

oconnore · on Oct 2, 2023

That would be the same as using a random identifier (UUIDv4, for example) with the associated indexing issues when stored in a database.

The whole point here would be that you can expose something opaque externally but benefit from well behaved index keys internally.

amelius · on Oct 2, 2023

Storage is cheap, you might as well store the extra integer.

masklinn · on Oct 2, 2023

Storage is cheap, updating indexes is not.

techdragon · on Oct 2, 2023

This is why I’ll probably just always use a UUIDv7 primary key and a secondary UUIDv4 indexed external identifier… which is extremely close to how I tend to do things today (I’ve been using ULID and UUIDv4)

turtles3 · on Oct 2, 2023

But you still need an external->internal lookup, so doesn't that mean you still need an index on the fully random id?

RhodesianHunter · on Oct 2, 2023

Why not use snowflake IDs?

techdragon · on Oct 2, 2023

Lookup speed. Direct mappings are faster. If something needs an external identifier that can be looked up for URL/API queries and such things…

Internal sorts and index on write with ULID/UUIDv7 which reveal potentially sensitive timestamp information, and so when that’s not appropriate a separate column can be used for…

External opaque identifiers which are UUIDv4, and if indexing speeds become an issue, can be switched to deferred indexing…

It’s a good balance and everything I’ve ever used supports this (I only needed a pretty strong representation wrapper for ULIDs since PG doesn’t validate UUID bit field structures so i can just miss-use UUID columns)

It looks like Snowflake has the same information exposure issues that using UUIDv7 or ULID for a publicly visible identifier.

pdpi · on Oct 2, 2023

> What's the benefit of the internal structure at all?

Purely random identifiers are the bane of DB indices. The internal structure is sequential-ish and therefore indexes well.

londons_explore · on Oct 2, 2023

Purely random identifiers are the recommended primary key for plenty of databases - eg. spanner.

Random identifiers spread work evenly between shards of something sharded by keyspace. They also don't get 'hotspotting' on more recent records - recent records frequently get more than their fair share of updates and changes, and database query planners have no knowledge of that.

kevincox · on Oct 2, 2023

Yes, you want your clusters hot, but not too hot.

For large distributed systems like Spanner you want to avoid a hotspot on a single node as that limits throughout.

However for a single-node system like PostgreSQL you want hot spots because they are cache friendly. (Locks aside)

Basically you want a hotspot that is as hot as a single node can manage, but no hotter. Even for things like Spanner you still want good cache hits (which is why they support table interleaving and other methods of clustering related data) but in general avoiding overloading single nodes is more important (as it hurts latency and efficiency but doesn't limit throughput like a hotspot would).

sgarland · on Oct 2, 2023

Spanner is also bespoke, and was probably designed with that in mind.

Anything with a clustering index - MySQL (with InnoDB), SQL Server - will absolutely hate the page splits from a random PK.

Postgres also doesn’t like it that much, for a variety of reasons. Tuples are in a heap, but the PK is still a B+tree (ish), so it still suffers from splits. They also use half the default page size as MySQL, AND their MVCC implementation doesn’t lend itself to easy updates without rewriting the entire page.

Go run benchmarks with some decent scale (tens of millions of rows or more) on any DB you’d like between UUIDv4 and UUIDv7 as the PK, and see how it goes.

pgaddict · on Oct 2, 2023

Right. In a way, these new sequential UUIDs approach the fact that recent records are often accessed more frequently as something that can be exploited, not as an issue that needs to be ironed out by randomness.

For tables this is not such a problem because of reasonably good locality (rows inserted close in time will end up close in the file too), but for indexes that's very difference. In Postgres this is particularly painful for writes, because of the write amplification.

None of this really breaks sharding (IDs are still unique and can be hashed), so no new hotspots. It can't fix the "many rows with the same ID" issue, but neither can any other ID.

gwbas1c · on Oct 2, 2023

You could use whatever is convenient for a DB index, and then encrypt it as the public ID.