Given that a UUID identifier fits in a single cipher block, and the whole point is that these are unique by construction (no IV needed so long as that holds true), it seems like a single round of ECB-mode AES-128 would enable quickly converting between internal/external identifiers.
I like the idea, but I think it's not possible to rotate the key with that approach, without introducing a breaking change. Eternal secrets are usually a very bad idea, because at some point they are going to be leaked.
If you were using something like UUIDv4, you wouldn't be exposing that information at all though, neither in cleartext or ciphertext. It seems weird to say, "the user ID contains secret information so we encrypt it with an eternal fixed pre-shared key then share the ciphertext with the world", when you could've just said "the user ID contains no secret information".
It feels like the right solution here is to pick between: use UUIDv7 and treat the account creation time as public information, or use an identifier scheme which doesn't contain the account creation time.
That doesn't fix the info leak though, you're still leaking approximately when the account was created. Knowing whether an account was created in 2003 or 2023 is a pretty significant amount of information even if you just know that it was created some time between June 2003 and August 2003.
I mean it's certainly an improvement over telling everyone the millisecond the account was created, but if account creation times are to be considered non-public info, I would probably just not include any version of it in public facing user IDs. And if you do consider approximate account creation times to be public (such as HN, where anyone can see that my account was created July 1, 2015), then adding some fuzz to the timestamp seems to be a good way to avoid certain cryptographic issues.
With massive amount of data this would negate the performance advantage completely. Related data is no longer stored close to each other, every time-sequential access is a cache miss ...
An incrementing version number would once again leak time information. Even a not incrementing version number would leak that kind of information, because if you know the timestamp of another ID with the same version.
I think there is no good alternative to random external identifiers.
This is why I’ll probably just always use a UUIDv7 primary key and a secondary UUIDv4 indexed external identifier… which is extremely close to how I tend to do things today (I’ve been using ULID and UUIDv4)
Lookup speed. Direct mappings are faster. If something needs an external identifier that can be looked up for URL/API queries and such things…
Internal sorts and index on write with ULID/UUIDv7 which reveal potentially sensitive timestamp information, and so when that’s not appropriate a separate column can be used for…
External opaque identifiers which are UUIDv4, and if indexing speeds become an issue, can be switched to deferred indexing…
It’s a good balance and everything I’ve ever used supports this (I only needed a pretty strong representation wrapper for ULIDs since PG doesn’t validate UUID bit field structures so i can just miss-use UUID columns)
It looks like Snowflake has the same information exposure issues that using UUIDv7 or ULID for a publicly visible identifier.
Purely random identifiers are the recommended primary key for plenty of databases - eg. spanner.
Random identifiers spread work evenly between shards of something sharded by keyspace. They also don't get 'hotspotting' on more recent records - recent records frequently get more than their fair share of updates and changes, and database query planners have no knowledge of that.
For large distributed systems like Spanner you want to avoid a hotspot on a single node as that limits throughout.
However for a single-node system like PostgreSQL you want hot spots because they are cache friendly. (Locks aside)
Basically you want a hotspot that is as hot as a single node can manage, but no hotter. Even for things like Spanner you still want good cache hits (which is why they support table interleaving and other methods of clustering related data) but in general avoiding overloading single nodes is more important (as it hurts latency and efficiency but doesn't limit throughput like a hotspot would).
Spanner is also bespoke, and was probably designed with that in mind.
Anything with a clustering index - MySQL (with InnoDB), SQL Server - will absolutely hate the page splits from a random PK.
Postgres also doesn’t like it that much, for a variety of reasons. Tuples are in a heap, but the PK is still a B+tree (ish), so it still suffers from splits. They also use half the default page size as MySQL, AND their MVCC implementation doesn’t lend itself to easy updates without rewriting the entire page.
Go run benchmarks with some decent scale (tens of millions of rows or more) on any DB you’d like between UUIDv4 and UUIDv7 as the PK, and see how it goes.
Right. In a way, these new sequential UUIDs approach the fact that recent records are often accessed more frequently as something that can be exploited, not as an issue that needs to be ironed out by randomness.
For tables this is not such a problem because of reasonably good locality (rows inserted close in time will end up close in the file too), but for indexes that's very difference. In Postgres this is particularly painful for writes, because of the write amplification.
None of this really breaks sharding (IDs are still unique and can be hashed), so no new hotspots. It can't fix the "many rows with the same ID" issue, but neither can any other ID.
I’m afraid you won’t be able to ever rotate that key, would you? Since it’s result is externally used as an identifier, you would have to rotate the external identifiers, too.
Assuming you have a table where the identifiers are stored you'd have the internal one (UUIDv7) and the encrypted version from it (external id).
You could rotate encryption keys whenever you want for new external id calculation, so that older external ids won't change (as they are external, they need to stay immutable).
If you're storing the identifiers then you don't need encryption, you just generate a UUIDv4 and use that as the external identifier. And then we're back at where the blog post started.
That's an interesting idea, how would you deal with the bits in the UUID that are used for the version? Setting them to random bits may cause issues for clients that try to use the identifier in their own database or application, as mentioned in the article.
For your information, yes, you can [1]. For example if you have a good enough 128-bit block cipher (e.g. AES-128-ECB), start with a block of 128 bits where specific 6 bits are filled out and others are filled with the plain text. Repeatedly encrypt the block until those specific 6 bits are reached again (and do the same thing in reverse for decryption). This is possible because a good block cipher is also a good pseudorandom permutation, so it should have a small number of extremely long cycles (ideally just one) with a 2^-6 probability of allowed values in average.
Cycle-walking FPE has a (nearly) unbounded upper-bound on latency.
Breaking the 122 bits into two 61-bit halves and using AES as the round function for a Feistel cipher gives you a constant 3-encryption latency instead of the expected 64-encryption average latency of the cycle-walking format preserving encryption.
If there's any way for the client to influence the input, it may be prone to DoS attacks: By my calculations, with a million random attempts, you would expect to find a cycle of length at least 435, which is over 13x the average. (Mind you, multiplying the number of attempts by 10 only adds about 72.5 to the expected cycle length, and probably no one has the patience to try more than 100 billion or so attempts.)
The properties of the permutation are dependent upon the encryption key, so a client being able to select malicious inputs to get long cycles implies either that the client knows the AES key, or that the client has broken AES.
In any case, as I mentioned in a sibling comment, with 3 AES encryptions one can construct a 122-bit balanced Feistel cipher with a constant amount of work.
I'm actually working on encrypting database keys like this, and I opted for a string ID with "url safe" base64. It avoids the ambiguity of looking like a UUID when it's not, and I prefer "o3Ru98R3Qw-_x2MdiEEdSQ" in a URL over "a3746ef7-c477-430f-bfc7-631d88411d49". (Not that either one is very beautiful)
And you have no problem with the same data being encrypted being identifyable (no salt). i see now why it might be useful in this case, though I still don’t like the idea, it feels bad somehow. (Yeah I get that you save storage for some computation)
Encrypting an internal id with ECB into an external id continues to allow the comparison for equality of 2 ids, to determine whether they are the same or not, but except for this it removes all the information contained in the structure of an UUID.
Yep have used an approach just like that, worked quite well if you have a strong pattern to easily translate from one to the other. Gives you an id with the right properties for internal use, efficient indexing etc, and in its encrypted form gives you the properties you want from an external identifier being unpredictable etc, all from one source id.
It is true that now your encryption key is now very long lived and effectively part of your public interface, but depending on your situation that could be an acceptable tradeoff, and there are quite a few pragmatic reasons why that might be true as has been described by other comments.
Edit: you can even do 64bit snowflakes internally to 128bit AES encrypted externally, doesn’t have to be 128-128 obvs
True you could rotate by persisting the old value and complicate your lookup/join process, not my idea of an acceptable solution but yep totally possible and worth it for some set of tradeoffs.
One key per token would require a table matching internal tokens to their key for forward conversion, and another table matching external keys to their key for reversing.
Might as well just use randomly generated external keys and have one table if you were doing that.
128 bits -> 128 bits