Given that a UUID identifier fits in a single cipher block, and the whole point ...

andix · 2023-10-02T09:27:42

I like the idea, but I think it's not possible to rotate the key with that approach, without introducing a breaking change. Eternal secrets are usually a very bad idea, because at some point they are going to be leaked.

londons_explore · 2023-10-02T10:14:10

But the private data you are protecting (the user's account creation time) has the same properties as an eternal secret.

Therefore there doesn't seem to be much downside in this specific case.

mort96 · 2023-10-02T11:54:20

If you were using something like UUIDv4, you wouldn't be exposing that information at all though, neither in cleartext or ciphertext. It seems weird to say, "the user ID contains secret information so we encrypt it with an eternal fixed pre-shared key then share the ciphertext with the world", when you could've just said "the user ID contains no secret information".

It feels like the right solution here is to pick between: use UUIDv7 and treat the account creation time as public information, or use an identifier scheme which doesn't contain the account creation time.

ilyt · 2023-10-02T12:39:42

You could just randomize the timestamp. adding +/- month or two to the UUIDv7 won't break the advantages all that much.

mort96 · 2023-10-02T12:59:06

That doesn't fix the info leak though, you're still leaking approximately when the account was created. Knowing whether an account was created in 2003 or 2023 is a pretty significant amount of information even if you just know that it was created some time between June 2003 and August 2003.

I mean it's certainly an improvement over telling everyone the millisecond the account was created, but if account creation times are to be considered non-public info, I would probably just not include any version of it in public facing user IDs. And if you do consider approximate account creation times to be public (such as HN, where anyone can see that my account was created July 1, 2015), then adding some fuzz to the timestamp seems to be a good way to avoid certain cryptographic issues.

The_Colonel · 2023-10-02T14:36:40

With massive amount of data this would negate the performance advantage completely. Related data is no longer stored close to each other, every time-sequential access is a cache miss ...

remram · 2023-10-02T17:42:18

Sure you can, just prefix the encrypted identifier with a version number.

    https://app.example.org/file/1:abcdef12345

    -> decrypt abcdef12345 with key "1" to yield UUIDv7 key of file (no matter what the latest key is)

andix · 2023-10-03T13:17:33

An incrementing version number would once again leak time information. Even a not incrementing version number would leak that kind of information, because if you know the timestamp of another ID with the same version.

I think there is no good alternative to random external identifiers.

remram · 2023-10-04T02:17:29

Oh that's a good point.

dragontamer · 2023-10-02T07:35:55

Why not just use the AES-128 result as the UUID then? What's the benefit of the internal structure at all?

If AES-128 is an acceptable external UUID (and likely an acceptable internal one), then you might as well just stick with a faster RNG.

oconnore · 2023-10-02T07:43:42

That would be the same as using a random identifier (UUIDv4, for example) with the associated indexing issues when stored in a database.

The whole point here would be that you can expose something opaque externally but benefit from well behaved index keys internally.

amelius · 2023-10-02T08:54:59

Storage is cheap, you might as well store the extra integer.

masklinn · 2023-10-02T09:21:55

Storage is cheap, updating indexes is not.

techdragon · 2023-10-02T10:50:08

This is why I’ll probably just always use a UUIDv7 primary key and a secondary UUIDv4 indexed external identifier… which is extremely close to how I tend to do things today (I’ve been using ULID and UUIDv4)

turtles3 · 2023-10-02T18:16:50

But you still need an external->internal lookup, so doesn't that mean you still need an index on the fully random id?

RhodesianHunter · 2023-10-02T12:01:55

Why not use snowflake IDs?

techdragon · 2023-10-02T12:11:29

Lookup speed. Direct mappings are faster. If something needs an external identifier that can be looked up for URL/API queries and such things…

Internal sorts and index on write with ULID/UUIDv7 which reveal potentially sensitive timestamp information, and so when that’s not appropriate a separate column can be used for…

External opaque identifiers which are UUIDv4, and if indexing speeds become an issue, can be switched to deferred indexing…

It’s a good balance and everything I’ve ever used supports this (I only needed a pretty strong representation wrapper for ULIDs since PG doesn’t validate UUID bit field structures so i can just miss-use UUID columns)

It looks like Snowflake has the same information exposure issues that using UUIDv7 or ULID for a publicly visible identifier.

pdpi · 2023-10-02T09:28:36

> What's the benefit of the internal structure at all?

Purely random identifiers are the bane of DB indices. The internal structure is sequential-ish and therefore indexes well.

londons_explore · 2023-10-02T10:20:32

Purely random identifiers are the recommended primary key for plenty of databases - eg. spanner.

Random identifiers spread work evenly between shards of something sharded by keyspace. They also don't get 'hotspotting' on more recent records - recent records frequently get more than their fair share of updates and changes, and database query planners have no knowledge of that.

kevincox · 2023-10-02T13:27:21

Yes, you want your clusters hot, but not too hot.

For large distributed systems like Spanner you want to avoid a hotspot on a single node as that limits throughout.

However for a single-node system like PostgreSQL you want hot spots because they are cache friendly. (Locks aside)

Basically you want a hotspot that is as hot as a single node can manage, but no hotter. Even for things like Spanner you still want good cache hits (which is why they support table interleaving and other methods of clustering related data) but in general avoiding overloading single nodes is more important (as it hurts latency and efficiency but doesn't limit throughput like a hotspot would).

sgarland · 2023-10-02T12:00:38

Spanner is also bespoke, and was probably designed with that in mind.

Anything with a clustering index - MySQL (with InnoDB), SQL Server - will absolutely hate the page splits from a random PK.

Postgres also doesn’t like it that much, for a variety of reasons. Tuples are in a heap, but the PK is still a B+tree (ish), so it still suffers from splits. They also use half the default page size as MySQL, AND their MVCC implementation doesn’t lend itself to easy updates without rewriting the entire page.

Go run benchmarks with some decent scale (tens of millions of rows or more) on any DB you’d like between UUIDv4 and UUIDv7 as the PK, and see how it goes.

pgaddict · 2023-10-02T13:31:29

Right. In a way, these new sequential UUIDs approach the fact that recent records are often accessed more frequently as something that can be exploited, not as an issue that needs to be ironed out by randomness.

For tables this is not such a problem because of reasonably good locality (rows inserted close in time will end up close in the file too), but for indexes that's very difference. In Postgres this is particularly painful for writes, because of the write amplification.

None of this really breaks sharding (IDs are still unique and can be hashed), so no new hotspots. It can't fix the "many rows with the same ID" issue, but neither can any other ID.

gwbas1c · 2023-10-02T13:41:58

You could use whatever is convenient for a DB index, and then encrypt it as the public ID.

lysium · 2023-10-02T05:45:29

Neat idea.

I’m afraid you won’t be able to ever rotate that key, would you? Since it’s result is externally used as an identifier, you would have to rotate the external identifiers, too.

aurbano · 2023-10-02T10:11:09

Assuming you have a table where the identifiers are stored you'd have the internal one (UUIDv7) and the encrypted version from it (external id).

You could rotate encryption keys whenever you want for new external id calculation, so that older external ids won't change (as they are external, they need to stay immutable).

lilyball · 2023-10-02T17:03:36

If you're storing the identifiers then you don't need encryption, you just generate a UUIDv4 and use that as the external identifier. And then we're back at where the blog post started.

oconnore · 2023-10-02T08:03:10

I think you could but it would further complicate the id scheme (would need some sort of a version mask to facilitate a rotation window).

the_arun · 2023-10-02T06:00:13

What is the use case for rotating uuids? Aren’t they immutable?

Raed667 · 2023-10-02T06:01:54

i think they meant rotating the encryption key not the internal uuids

dajonker · 2023-10-02T05:38:20

That's an interesting idea, how would you deal with the bits in the UUID that are used for the version? Setting them to random bits may cause issues for clients that try to use the identifier in their own database or application, as mentioned in the article.

notpushkin · 2023-10-02T06:01:12

Is there a way to encrypt 122 bits -> 122 bits? If so – do that and set version to 4.

Alternatively, just say it's a random string ID and not an UUID.

lifthrasiir · 2023-10-02T08:33:43

For your information, yes, you can [1]. For example if you have a good enough 128-bit block cipher (e.g. AES-128-ECB), start with a block of 128 bits where specific 6 bits are filled out and others are filled with the plain text. Repeatedly encrypt the block until those specific 6 bits are reached again (and do the same thing in reverse for decryption). This is possible because a good block cipher is also a good pseudorandom permutation, so it should have a small number of extremely long cycles (ideally just one) with a 2^-6 probability of allowed values in average.

[1] https://en.wikipedia.org/wiki/Format-preserving_encryption

KMag · 2023-10-02T14:59:10

Cycle-walking FPE has a (nearly) unbounded upper-bound on latency.

Breaking the 122 bits into two 61-bit halves and using AES as the round function for a Feistel cipher gives you a constant 3-encryption latency instead of the expected 64-encryption average latency of the cycle-walking format preserving encryption.

Alternatively, use AES in VIL mode ( https://cseweb.ucsd.edu/~mihir/papers/lpe.pdf ).

akoboldfrying · 2023-10-02T11:48:08

What a cute observation!

If there's any way for the client to influence the input, it may be prone to DoS attacks: By my calculations, with a million random attempts, you would expect to find a cycle of length at least 435, which is over 13x the average. (Mind you, multiplying the number of attempts by 10 only adds about 72.5 to the expected cycle length, and probably no one has the patience to try more than 100 billion or so attempts.)

KMag · 2023-10-02T15:12:07

The properties of the permutation are dependent upon the encryption key, so a client being able to select malicious inputs to get long cycles implies either that the client knows the AES key, or that the client has broken AES.

In any case, as I mentioned in a sibling comment, with 3 AES encryptions one can construct a 122-bit balanced Feistel cipher with a constant amount of work.

bajsejohannes · 2023-10-02T06:53:59

I'm actually working on encrypting database keys like this, and I opted for a string ID with "url safe" base64. It avoids the ambiguity of looking like a UUID when it's not, and I prefer "o3Ru98R3Qw-_x2MdiEEdSQ" in a URL over "a3746ef7-c477-430f-bfc7-631d88411d49". (Not that either one is very beautiful)

matja · 2023-10-02T07:44:14

Curious on the preference there, especially when characters in base64url encoding can look similar/ambiguous in some fonts.

bajsejohannes · 2023-10-02T07:59:14

The preference is purely about the amount of characters. I hope no one will ever actually read or write these URLs, but you never know...

matja · 2023-10-02T16:56:58

Ah in that case, base58 can be useful because it doesn't use characters that may be ambiguous in some fonts (doesn't use: 0/O/I/l , but does use: 1/i)

MattPalmer1086 · 2023-10-02T07:30:58

This seems overly complex, and you need some kind of key too.

Why not just hash it with pretty much any hash function?

oconnore · 2023-10-02T07:46:49

A hash is not reversible, so you’d need a database index to recover the original efficient-to-index identifier, which misses the whole point ;)

If you didn’t care about index clustering then just use UUIDv4

masklinn · 2023-10-02T09:24:34

Although you could use a hash index to avoid the more deleterious effects of insertion, as long as you don’t need a unique constraint anyway…

bajsejohannes · 2023-10-02T07:36:51

Because a hash is (by definition) a one-way function. You need to be able to go the other way for incoming ids.

MattPalmer1086 · 2023-10-02T19:27:57

Ah, good point. Hadn't thought about the opposite direction.

fodkodrasz · 2023-10-02T06:35:47

No IV, ECB mode... why bother with encryption at all? Just expose the internal id.

computerfriend · 2023-10-02T06:59:51

ECB is perfectly secure when you use it on a single block.

fodkodrasz · 2023-10-02T11:20:17

And you have no problem with the same data being encrypted being identifyable (no salt). i see now why it might be useful in this case, though I still don’t like the idea, it feels bad somehow. (Yeah I get that you save storage for some computation)

adrian_b · 2023-10-02T06:56:05

Encrypting an internal id with ECB into an external id continues to allow the comparison for equality of 2 ids, to determine whether they are the same or not, but except for this it removes all the information contained in the structure of an UUID.

bajsejohannes · 2023-10-02T06:44:54

You would still use a secret key, so it's impossible for the end user to decrypt it.

isbvhodnvemrwvn · 2023-10-02T09:59:09

You are encrypting a single block of unique information. No other encryption mode gives you any advantages whatsoever.

remram · 2023-10-02T17:44:12

Because the internal ID exposes timing/sequence information, as per jonhohle's comment.

nhoughto · 2023-10-02T11:19:56

Yep have used an approach just like that, worked quite well if you have a strong pattern to easily translate from one to the other. Gives you an id with the right properties for internal use, efficient indexing etc, and in its encrypted form gives you the properties you want from an external identifier being unpredictable etc, all from one source id.

It is true that now your encryption key is now very long lived and effectively part of your public interface, but depending on your situation that could be an acceptable tradeoff, and there are quite a few pragmatic reasons why that might be true as has been described by other comments.

Edit: you can even do 64bit snowflakes internally to 128bit AES encrypted externally, doesn’t have to be 128-128 obvs

phkahler · 2023-10-02T11:32:17

>> It is true that now your encryption key is now very long lived and effectively part of your public interface

No need to encrypt, just store the external key in a table. Not that you're likely to change algorithms.

nhoughto · 2023-10-02T11:53:00

True you could rotate by persisting the old value and complicate your lookup/join process, not my idea of an acceptable solution but yep totally possible and worth it for some set of tradeoffs.

phkahler · 2023-10-02T18:22:50

Late edit: I meant to say No need to encrypt on the fly. Do it once and save it.

kevincox · 2023-10-02T13:24:18

You are basically describing BuildKite's previous solution.

nindalf · 2023-10-02T08:24:24

One key for all tokens or one key per token? If it’s the latter a simple XOR would do because it would be the equivalent of a one time pad.

Nevermark · 2023-10-02T09:21:01

One key per token would require a table matching internal tokens to their key for forward conversion, and another table matching external keys to their key for reversing.

Might as well just use randomly generated external keys and have one table if you were doing that.

So, one key per all tokens.

robertlagrant · 2023-10-02T08:53:56

I don't think it can be a key per token, or it will scale appallingly.