Goodbye integers, hello UUIDv7

jonhohle · 2023-10-02T03:42:25

This is great for internal distributed systems where having ordered keys is useful, however, it should probably be noted that these probably shouldn't be used as public identifiers (even though this will probably be the defacto standard and used publicly without thought).

Having any information, specifically time information, leaking from your systems may or may not have unanticipated security or business implications. (e.g. knowing when session tokens or accounts are created).

oconnore · 2023-10-02T04:33:40

Given that a UUID identifier fits in a single cipher block, and the whole point is that these are unique by construction (no IV needed so long as that holds true), it seems like a single round of ECB-mode AES-128 would enable quickly converting between internal/external identifiers.

128 bits -> 128 bits

andix · 2023-10-02T09:27:42

I like the idea, but I think it's not possible to rotate the key with that approach, without introducing a breaking change. Eternal secrets are usually a very bad idea, because at some point they are going to be leaked.

londons_explore · 2023-10-02T10:14:10

But the private data you are protecting (the user's account creation time) has the same properties as an eternal secret.

Therefore there doesn't seem to be much downside in this specific case.

mort96 · 2023-10-02T11:54:20

If you were using something like UUIDv4, you wouldn't be exposing that information at all though, neither in cleartext or ciphertext. It seems weird to say, "the user ID contains secret information so we encrypt it with an eternal fixed pre-shared key then share the ciphertext with the world", when you could've just said "the user ID contains no secret information".

It feels like the right solution here is to pick between: use UUIDv7 and treat the account creation time as public information, or use an identifier scheme which doesn't contain the account creation time.

ilyt · 2023-10-02T12:39:42

You could just randomize the timestamp. adding +/- month or two to the UUIDv7 won't break the advantages all that much.

mort96 · 2023-10-02T12:59:06

That doesn't fix the info leak though, you're still leaking approximately when the account was created. Knowing whether an account was created in 2003 or 2023 is a pretty significant amount of information even if you just know that it was created some time between June 2003 and August 2003.

I mean it's certainly an improvement over telling everyone the millisecond the account was created, but if account creation times are to be considered non-public info, I would probably just not include any version of it in public facing user IDs. And if you do consider approximate account creation times to be public (such as HN, where anyone can see that my account was created July 1, 2015), then adding some fuzz to the timestamp seems to be a good way to avoid certain cryptographic issues.

The_Colonel · 2023-10-02T14:36:40

With massive amount of data this would negate the performance advantage completely. Related data is no longer stored close to each other, every time-sequential access is a cache miss ...

remram · 2023-10-02T17:42:18

Sure you can, just prefix the encrypted identifier with a version number.

    https://app.example.org/file/1:abcdef12345

    -> decrypt abcdef12345 with key "1" to yield UUIDv7 key of file (no matter what the latest key is)

andix · 2023-10-03T13:17:33

An incrementing version number would once again leak time information. Even a not incrementing version number would leak that kind of information, because if you know the timestamp of another ID with the same version.

I think there is no good alternative to random external identifiers.

remram · 2023-10-04T02:17:29

Oh that's a good point.

dragontamer · 2023-10-02T07:35:55

Why not just use the AES-128 result as the UUID then? What's the benefit of the internal structure at all?

If AES-128 is an acceptable external UUID (and likely an acceptable internal one), then you might as well just stick with a faster RNG.

oconnore · 2023-10-02T07:43:42

That would be the same as using a random identifier (UUIDv4, for example) with the associated indexing issues when stored in a database.

The whole point here would be that you can expose something opaque externally but benefit from well behaved index keys internally.

amelius · 2023-10-02T08:54:59

Storage is cheap, you might as well store the extra integer.

masklinn · 2023-10-02T09:21:55

Storage is cheap, updating indexes is not.

techdragon · 2023-10-02T10:50:08

This is why I’ll probably just always use a UUIDv7 primary key and a secondary UUIDv4 indexed external identifier… which is extremely close to how I tend to do things today (I’ve been using ULID and UUIDv4)

turtles3 · 2023-10-02T18:16:50

But you still need an external->internal lookup, so doesn't that mean you still need an index on the fully random id?

RhodesianHunter · 2023-10-02T12:01:55

Why not use snowflake IDs?

techdragon · 2023-10-02T12:11:29

Lookup speed. Direct mappings are faster. If something needs an external identifier that can be looked up for URL/API queries and such things…

Internal sorts and index on write with ULID/UUIDv7 which reveal potentially sensitive timestamp information, and so when that’s not appropriate a separate column can be used for…

External opaque identifiers which are UUIDv4, and if indexing speeds become an issue, can be switched to deferred indexing…

It’s a good balance and everything I’ve ever used supports this (I only needed a pretty strong representation wrapper for ULIDs since PG doesn’t validate UUID bit field structures so i can just miss-use UUID columns)

It looks like Snowflake has the same information exposure issues that using UUIDv7 or ULID for a publicly visible identifier.

pdpi · 2023-10-02T09:28:36

> What's the benefit of the internal structure at all?

Purely random identifiers are the bane of DB indices. The internal structure is sequential-ish and therefore indexes well.

londons_explore · 2023-10-02T10:20:32

Purely random identifiers are the recommended primary key for plenty of databases - eg. spanner.

Random identifiers spread work evenly between shards of something sharded by keyspace. They also don't get 'hotspotting' on more recent records - recent records frequently get more than their fair share of updates and changes, and database query planners have no knowledge of that.

kevincox · 2023-10-02T13:27:21

Yes, you want your clusters hot, but not too hot.

For large distributed systems like Spanner you want to avoid a hotspot on a single node as that limits throughout.

However for a single-node system like PostgreSQL you want hot spots because they are cache friendly. (Locks aside)

Basically you want a hotspot that is as hot as a single node can manage, but no hotter. Even for things like Spanner you still want good cache hits (which is why they support table interleaving and other methods of clustering related data) but in general avoiding overloading single nodes is more important (as it hurts latency and efficiency but doesn't limit throughput like a hotspot would).

sgarland · 2023-10-02T12:00:38

Spanner is also bespoke, and was probably designed with that in mind.

Anything with a clustering index - MySQL (with InnoDB), SQL Server - will absolutely hate the page splits from a random PK.

Postgres also doesn’t like it that much, for a variety of reasons. Tuples are in a heap, but the PK is still a B+tree (ish), so it still suffers from splits. They also use half the default page size as MySQL, AND their MVCC implementation doesn’t lend itself to easy updates without rewriting the entire page.

Go run benchmarks with some decent scale (tens of millions of rows or more) on any DB you’d like between UUIDv4 and UUIDv7 as the PK, and see how it goes.

pgaddict · 2023-10-02T13:31:29

Right. In a way, these new sequential UUIDs approach the fact that recent records are often accessed more frequently as something that can be exploited, not as an issue that needs to be ironed out by randomness.

For tables this is not such a problem because of reasonably good locality (rows inserted close in time will end up close in the file too), but for indexes that's very difference. In Postgres this is particularly painful for writes, because of the write amplification.

None of this really breaks sharding (IDs are still unique and can be hashed), so no new hotspots. It can't fix the "many rows with the same ID" issue, but neither can any other ID.

gwbas1c · 2023-10-02T13:41:58

You could use whatever is convenient for a DB index, and then encrypt it as the public ID.

lysium · 2023-10-02T05:45:29

Neat idea.

I’m afraid you won’t be able to ever rotate that key, would you? Since it’s result is externally used as an identifier, you would have to rotate the external identifiers, too.

aurbano · 2023-10-02T10:11:09

Assuming you have a table where the identifiers are stored you'd have the internal one (UUIDv7) and the encrypted version from it (external id).

You could rotate encryption keys whenever you want for new external id calculation, so that older external ids won't change (as they are external, they need to stay immutable).

lilyball · 2023-10-02T17:03:36

If you're storing the identifiers then you don't need encryption, you just generate a UUIDv4 and use that as the external identifier. And then we're back at where the blog post started.

oconnore · 2023-10-02T08:03:10

I think you could but it would further complicate the id scheme (would need some sort of a version mask to facilitate a rotation window).

the_arun · 2023-10-02T06:00:13

What is the use case for rotating uuids? Aren’t they immutable?

Raed667 · 2023-10-02T06:01:54

i think they meant rotating the encryption key not the internal uuids

dajonker · 2023-10-02T05:38:20

That's an interesting idea, how would you deal with the bits in the UUID that are used for the version? Setting them to random bits may cause issues for clients that try to use the identifier in their own database or application, as mentioned in the article.

notpushkin · 2023-10-02T06:01:12

Is there a way to encrypt 122 bits -> 122 bits? If so – do that and set version to 4.

Alternatively, just say it's a random string ID and not an UUID.

lifthrasiir · 2023-10-02T08:33:43

For your information, yes, you can [1]. For example if you have a good enough 128-bit block cipher (e.g. AES-128-ECB), start with a block of 128 bits where specific 6 bits are filled out and others are filled with the plain text. Repeatedly encrypt the block until those specific 6 bits are reached again (and do the same thing in reverse for decryption). This is possible because a good block cipher is also a good pseudorandom permutation, so it should have a small number of extremely long cycles (ideally just one) with a 2^-6 probability of allowed values in average.

[1] https://en.wikipedia.org/wiki/Format-preserving_encryption

KMag · 2023-10-02T14:59:10

Cycle-walking FPE has a (nearly) unbounded upper-bound on latency.

Breaking the 122 bits into two 61-bit halves and using AES as the round function for a Feistel cipher gives you a constant 3-encryption latency instead of the expected 64-encryption average latency of the cycle-walking format preserving encryption.

Alternatively, use AES in VIL mode ( https://cseweb.ucsd.edu/~mihir/papers/lpe.pdf ).

akoboldfrying · 2023-10-02T11:48:08

What a cute observation!

If there's any way for the client to influence the input, it may be prone to DoS attacks: By my calculations, with a million random attempts, you would expect to find a cycle of length at least 435, which is over 13x the average. (Mind you, multiplying the number of attempts by 10 only adds about 72.5 to the expected cycle length, and probably no one has the patience to try more than 100 billion or so attempts.)

KMag · 2023-10-02T15:12:07

The properties of the permutation are dependent upon the encryption key, so a client being able to select malicious inputs to get long cycles implies either that the client knows the AES key, or that the client has broken AES.

In any case, as I mentioned in a sibling comment, with 3 AES encryptions one can construct a 122-bit balanced Feistel cipher with a constant amount of work.

bajsejohannes · 2023-10-02T06:53:59

I'm actually working on encrypting database keys like this, and I opted for a string ID with "url safe" base64. It avoids the ambiguity of looking like a UUID when it's not, and I prefer "o3Ru98R3Qw-_x2MdiEEdSQ" in a URL over "a3746ef7-c477-430f-bfc7-631d88411d49". (Not that either one is very beautiful)

matja · 2023-10-02T07:44:14

Curious on the preference there, especially when characters in base64url encoding can look similar/ambiguous in some fonts.

bajsejohannes · 2023-10-02T07:59:14

The preference is purely about the amount of characters. I hope no one will ever actually read or write these URLs, but you never know...

matja · 2023-10-02T16:56:58

Ah in that case, base58 can be useful because it doesn't use characters that may be ambiguous in some fonts (doesn't use: 0/O/I/l , but does use: 1/i)

MattPalmer1086 · 2023-10-02T07:30:58

This seems overly complex, and you need some kind of key too.

Why not just hash it with pretty much any hash function?

oconnore · 2023-10-02T07:46:49

A hash is not reversible, so you’d need a database index to recover the original efficient-to-index identifier, which misses the whole point ;)

If you didn’t care about index clustering then just use UUIDv4

masklinn · 2023-10-02T09:24:34

Although you could use a hash index to avoid the more deleterious effects of insertion, as long as you don’t need a unique constraint anyway…

bajsejohannes · 2023-10-02T07:36:51

Because a hash is (by definition) a one-way function. You need to be able to go the other way for incoming ids.

MattPalmer1086 · 2023-10-02T19:27:57

Ah, good point. Hadn't thought about the opposite direction.

fodkodrasz · 2023-10-02T06:35:47

No IV, ECB mode... why bother with encryption at all? Just expose the internal id.

computerfriend · 2023-10-02T06:59:51

ECB is perfectly secure when you use it on a single block.

fodkodrasz · 2023-10-02T11:20:17

And you have no problem with the same data being encrypted being identifyable (no salt). i see now why it might be useful in this case, though I still don’t like the idea, it feels bad somehow. (Yeah I get that you save storage for some computation)

adrian_b · 2023-10-02T06:56:05

Encrypting an internal id with ECB into an external id continues to allow the comparison for equality of 2 ids, to determine whether they are the same or not, but except for this it removes all the information contained in the structure of an UUID.

bajsejohannes · 2023-10-02T06:44:54

You would still use a secret key, so it's impossible for the end user to decrypt it.

isbvhodnvemrwvn · 2023-10-02T09:59:09

You are encrypting a single block of unique information. No other encryption mode gives you any advantages whatsoever.

remram · 2023-10-02T17:44:12

Because the internal ID exposes timing/sequence information, as per jonhohle's comment.

nhoughto · 2023-10-02T11:19:56

Yep have used an approach just like that, worked quite well if you have a strong pattern to easily translate from one to the other. Gives you an id with the right properties for internal use, efficient indexing etc, and in its encrypted form gives you the properties you want from an external identifier being unpredictable etc, all from one source id.

It is true that now your encryption key is now very long lived and effectively part of your public interface, but depending on your situation that could be an acceptable tradeoff, and there are quite a few pragmatic reasons why that might be true as has been described by other comments.

Edit: you can even do 64bit snowflakes internally to 128bit AES encrypted externally, doesn’t have to be 128-128 obvs

phkahler · 2023-10-02T11:32:17

>> It is true that now your encryption key is now very long lived and effectively part of your public interface

No need to encrypt, just store the external key in a table. Not that you're likely to change algorithms.

nhoughto · 2023-10-02T11:53:00

True you could rotate by persisting the old value and complicate your lookup/join process, not my idea of an acceptable solution but yep totally possible and worth it for some set of tradeoffs.

phkahler · 2023-10-02T18:22:50

Late edit: I meant to say No need to encrypt on the fly. Do it once and save it.

kevincox · 2023-10-02T13:24:18

You are basically describing BuildKite's previous solution.

nindalf · 2023-10-02T08:24:24

One key for all tokens or one key per token? If it’s the latter a simple XOR would do because it would be the equivalent of a one time pad.

Nevermark · 2023-10-02T09:21:01

One key per token would require a table matching internal tokens to their key for forward conversion, and another table matching external keys to their key for reversing.

Might as well just use randomly generated external keys and have one table if you were doing that.

So, one key per all tokens.

robertlagrant · 2023-10-02T08:53:56

I don't think it can be a key per token, or it will scale appallingly.

noduerme · 2023-10-02T06:01:30

Thinking that harder-to-guess IDs will mitigate attacks is an example of security by obscurity. It's better to think of any IDs in your database as being public knowledge, because they will leak anyway. Assuming that no one can guess another ID leads to shoddy practices. I generally keep IDs sequential and build security around the basic assumption that IDs are not keys, passwords, sessions, or secrets - they're just the public matching identifier for those things.

To that end, I think it's neat to be able to improve indexing on UUIDs, but it's not a security solution.

dalore · 2023-10-02T10:21:23

Having sequential ID's is more than just a security risk, it's an information risk. Competitors can use them to estimate the size of your business, the number of customers you have, and all sorts of stuff.

This was used in the war to estimate the number of German tanks based on the sequential IDs

https://en.wikipedia.org/wiki/German_tank_problem

So just for business intelligence you don't want to leak your IDs.

sgarland · 2023-10-02T12:23:02

I’ve heard this argument many times, but I’ve never seen anyone actually post a reference to it happening (as in, a company finding and using this information; not the German tank problem).

To me, it reeks of solving imaginary problems while causing new ones.

lightbendover · 2023-10-02T13:23:22

Years ago I wrote a library that would exaggerate sequential IDs to make our SaaS platform appear more popular than it actually was to anyone trying to pay attention. Not sure if I’m proud of the hack or embarrassed. But of both I suppose.

pnpnp · 2023-10-02T20:42:18

but UUIDv7 isn’t sequential (unless I’m getting it mixed up). There’s just a time-based component which can make sorting really nice & some random bits at the end.

If you don’t let an attacker iterate your data, all they can tell is when the ID was created.

lightbendover · 2023-10-03T16:01:15

I was responding in a sub-thread about risks/opportunities associated with sequential IDs with an anecdote on opportunities.

jolux · 2023-10-03T01:28:17

> If you don’t let an attacker iterate your data, all they can tell is when the ID was created.

The ordering means that you can reconstruct the sequence if you have enough of them, though.

umanwizard · 2023-10-02T12:56:50

I vaguely remember somebody figuring out from photoshop activation IDs how many sales adobe was making and trading options around their earnings report with that information.

I don’t remember the details, so maybe it was something else and not photoshop/adobe.

dbbk · 2023-10-02T12:41:57

It's perhaps embarrassing for a new startup to have a user ID of "10".

That's about the only problem I can discern.

jabroni_salad · 2023-10-02T14:40:27

Depends on how far you get, I suppose. Wozniak's apple employee badge has ID #1 on it, and that's cool as heck. I also remember ICQ users sorting themselves based on how many digits their identifier had.

vidarh · 2023-10-02T11:23:39

I agree just knowing the number is bad, but it also makes it easier to discover far worse problems as well.

My second job was for a company that provided internet enabled phone conferencing solutions (this was years before VoIP became widespread).

The customer ids were sequential. Couple that with an outright idiotic security flaw (the login process set the customer ID in a cookie and the app trusted it on ever subsequent request. Just the ID. Nothing else), I was able to iterate over all the customer ids and hand over a complete list of users to my boss to illustrate the problem, starting with a list of the accounts of the complete upper management.

They could have been used to spin up huge numbers of 30-person long distance conference calls at high costs (this company was building out nodes with 20,000 line pstn switches before they had customers... it was crazy, and they failed but would've failed far faster if that had been abused and they were on the hook for costs from their carriers)

Trusting that cookie was still stupid, but had it been a long random key it'd at least been a bit harder to discover and exploit (their next attempt was to base64 encode it and I had to explain why that didn't help; they then finally blowfish encrypted it, but without any time component, so still subject to replay attacks... I jumped at the first opportunity I got to get out of there)

TRiG_Ireland · 2023-10-02T14:54:38

Huh. That's exactly the same security flaw as Moonpig had. Tom Scott made a video about it.

madeofpalk · 2023-10-02T11:14:06

Not disagreeing with the general concept - these IDs leak information - but these are sequential IDs, not auto-incrementing IDs. The leak is the time the ID was generated, not the volume of IDs generated.

crazygringo · 2023-10-02T14:36:55

That's still a competitive risk -- it does things like reveal if a given list of customers from recent orders/posts are all new customers or long-term customers.

Or from a list of most recently added customers/users, you can figure out the rate of signups.

Revealing timestamps is bad because it can reveal way too much information about the health of your business that you prefer to keep private, if a sequential list of ID's ever gets exposed (which is hard to prevent).

pnpnp · 2023-10-02T20:43:20

They’re not even strongly sequential (is there a term for this?). The gaps between them can be arbitrarily large.

madeofpalk · 2023-10-02T23:08:28

They are sequential, where they are in a sequence where one is clearly before or after another.

They're not monotonic.

pnpnp · 2023-10-03T22:33:26

Thanks! This is what I was looking for.

jabart · 2023-10-02T15:29:18

They can still find out through LinkedIn, ZoomInfo, BuiltWith, and any number of tools.

The tank problem doesn't fit when the incrementing value is time since epoch. Integers yeah, UUIDs, KSUIDS, or any other semi-ordered thing to make your database indexes less fragmented I haven't seen a real leak issue with those.

wolletd · 2023-10-02T10:43:15

Just friday I've had a discussion with a colleague about filenames.

We do a lot of computer vision and in his project, each processed object is assigned a UUID and he wanted to save images to files for each one.

So we took some time to go over various timestamp formats to be embedded into the filename to make the files sort chronologically. UUIDv7 is just spot-on solving our problem. In this use case, there are no real security considerations.

berkes · 2023-10-02T06:14:26

Doesn't that still leak (statistical) information?

It may not be technically security, but e.g. knowing your competitor just added N products to their shop, might be a security issue for the business.

noduerme · 2023-10-02T07:24:59

It may. Certainly, for instance, sequential invoice numbers do. If a business decides to take measures to obscure that, no problem. All I'm saying is that obscuring a numbering system for data artifacts shouldn't be considered any sort of security as far as keeping your endpoints from being hacked.

berkes · 2023-10-02T08:35:22

The point on invoice-numbers brings another issue to mind.

We model our domain(s) using DDD, and often "The ID" really is best left a thing with meaning. Customer-id, Bank-account-number, invoice-number, email, etc. At least within the domain, it is. The business (and laws etc) already ensure there can only ever be one invoice with this number. Its terribly counterproductive to have two ID's for something. "Hey, can you have a look at invoice 20230233, because it seems the VAT was applied wrong. Hmm, do you have the UUID for that invoice and DM me that? You know, the long one with the hyphens".

I guess there isn't a one-size-fits all solution and that "it depends" very much on what e.g. "public" means.

hughesjj · 2023-10-02T06:20:13

You're absolutely right, this is also why you generally encrypt sessionized or "consistent view" pagination tokens for public apis (save for primitives like ddb or Kafka)

The end user should know no details about your internal key space.

Nevermark · 2023-10-02T09:31:57

Security by obscurity is a necessary step in most software security.

It hardens, completes and complements other measures.

Examples of every day security using obscurity: every password and encryption key

EDIT: Thanks for the replies.

Ignore above!

Obscurity is the low bit of security. But when it’s convenient, it still helps.

earnesti · 2023-10-02T09:48:47

Obscurity and secrecy are different things. Though I agree with you. Moderate amount of well implemented obscurity is helpful.

Sebb767 · 2023-10-02T10:14:12

> Moderate amount of well implemented obscurity is helpful.

You're getting that wrong: Everything else being equal, the more obscure system will always be the safer one. It's just that obscurity can easily be lost, so your system should, if in any way possible, still be secure even if fully known. In the end, however, no system is 100% secure, but more obscurity will make it harder to find the inevitably existing issues.

cloverich · 2023-10-02T15:50:25

I think the counter argument is, that all else is not equal when obscurity is a goal of security, because it adds a maintenance burden to some greater or lesser degree, and that maintenance burden becomes time taken away from proper security practices, or other value providing work.

Sebb767 · 2023-10-02T16:43:46

I think the main argument is that security by obscurity can easily be circumvented, be it via sidechannel, secret leak, source code leak or a surprisingly small search space (for example the whole range of IPv4 being scanned by now). It's easy to assume something is secure and spend a lot of time on obscurity, which completely falls apart thanks to a small sidechannel attack. It's (usually) just a weak defense overall. Yes, it can also be a maintenance overhead and therefore risk via proxy, but it can actually be easier in other situations.

For a personal anecdote, I used to work in a small webshop and our software was horrible, to the point where minimal effort would have been able to compromise our servers, which were running software roughly as old as I was at the time (I want to note that I worked on improving the situation while I was there). Still, the only time we had a problem was when we took over a Joomla-hosted site, as we were small enough to not get any individual attention and your off-the-shelf WordPress or Joomla-scripts did not work on our home-brewed software.

In the end, I fully agree that security by obscurity is a weak concept and the usual wisdom of not relying on it is completely correct. Still, it's important to acknowledge that obscurity can and does help security and bring actual reasons on why you should not rely on it. Just saying "it's obviously bad" leads to an easily refuted argument and will not convince some developers, leading to worse software overall.

noduerme · 2023-10-02T20:00:06

To me, the main reason to avoid obscurity in naming or numbering things, or even in code - rather than view it as a modest addendum to security - is to force yourself to do the mental exercise of what happens when that obscurity is lost.

Not doing that is how small companies seem to get away with terrible security holes for a long time, until suddenly they don't. I've seen too many cases of companies in a position where they built a small, insecure service that's now getting shared more widely than envisioned, who don't want to spend the money to make it right, because no one has compromised it yet (that they know of), and what are the chances of someone stumbling across it - where even pointing out that it's an attack vector can earn you trouble.

PhilipRoman · 2023-10-02T09:46:57

passwords and encryption keys are secrets, not obscurity.

Security by obscurity would be hiding your house key under a doormat for your friend to find - depending on the culture you live in you may be more or less safe but it is not security (just like hosting your ssh server on port 9384 will repel 99% of attackers but is not a security measure).

wolletd · 2023-10-02T11:03:33

I keep SSH on Port 22. After years, I'm still amazed about the operational model of these attacking hosts.

They are completely dumb. I haven't kept record, but I have the feeling that some IPs in my fail2ban list are practically in there for month or even years now.

I assume they are just sweeping the whole IPv4 range? No state, no cache. Either they successfully attack a host or they go to the next IP. Repeat 2^32 times, start again.

I'm not sure where I wanted to go with this comment. Is it _that cheap_ to constantly sweep the IPv4 range or is it _that profitable_ to do it once you have a successful attack?

vidarh · 2023-10-02T11:30:58

You should think of them as public, but that doesn't mean it isn't still helpful to obscure aspects of the information they carry.

Obscurity can be helpful as part of defence in depth, to reduce the impact when someone does something stupid, or to make it more difficult to extract information that might be helpful as a means to attack the system from another angle.

If you're already thinking about the implications, you can likely ensure people doesn't jump to the conclusion that the IDs can be trusted just because they look complex.

ivan_gammel · 2023-10-02T08:48:02

Security by obscurity is a working solution if implemented with other measures. It increases the cost of attack, which in the presence of unknown vulnerabilities gives you precious time to respond.

imiric · 2023-10-02T06:05:03

I'm a fan of Cuid2[1] for this reason.

They are compact, don't leak information, and make a good case why k-sortable IDs are unnecessary, or even harmful for performance.

I'm using sequential integers and created_at/updated_at timestamps for internal use, and Cuid2 IDs externally.

[1]: https://github.com/paralleldrive/cuid2

tveita · 2023-10-02T10:49:51

> But not too fast: If you can hash too quickly you can launch parallel attacks to find duplicates or break entropy-hiding. For unique ids, the fastest runner loses the security race.

> Cuid2 has been audited by security experts and artificial intelligence, and is considered safe to use for use-cases like secret sharing links.

I'm getting some snake oil vibes from this... There absolutely shouldn't be anything like a random ID that is 'too fast' to compute. You might need a rate limit to stay within your collision bounds, but CPU usage is a poor way to do it.

And there is currently no publicly available "artificial intelligence" that would be useful in a security audit, unless you want to call fuzzers "AI".

sgarland · 2023-10-02T12:18:38

The comments on performance are utterly incorrect, modulo discussions on hotspots, but you shouldn’t be sharding randomly anyway. If you get to the point where you _need_ to shard for anything other than geolocality, doing so randomly will rapidly reveal your hotspots.

> One reason for using sequential keys is to avoid id fragmentation, which can require a large amount of disk space for databases with billions of records.

Disk is cheap but not free at higher tiers. But more importantly, record fragmentation means more pages (unless you take the time to do a full table lock and rewrite it, and who’s doing that?) which means more index bloat. I assure you, that adds up once you’re into the billions of records level.

> the ids will be generated in a sequential order, causing the tree to become unbalanced, which will lead to frequent rebalancing.

Given the width of B+trees used in DBs, I doubt they generally need to go more than one or at most two levels up. I’ll take the ability to rapidly follow the leaf nodes and have a good shot at sequential reads in cache from prefetch, thanks.

ndriscoll · 2023-10-02T10:55:09

Nearly everything in this README about security or performance is wrong. I'd be very wary of using this.

BiteCode_dev · 2023-10-02T06:13:52

What benefice over uuid4 ?

matharmin · 2023-10-02T08:11:41

Reading their docs: No real benefits, just misconceptions.

1. Collision resistance / "weak" PRGNs used to generate UUIDv4. Firstly, these are properties of the implementation, not the spec. Secondly, the source for calling the browser `Crypto.getRandomValues()` insecure is an issue that has been fixed back in 2016. I would not trust the developers of this implementation to do a better job than current browsers.

2. "Not URL or name friendly": Fair, but not very strong argument.

3. "Horizontally scalable" and "offline-capable": No argument given for why UUIDv4 does not meet these requirements, apart from point 1 above.

4. "Too fast": No argument given for why having a slower algorithm to generate random ids is more secure. Both UUIDv4 and Cuid2 use a similar number of random bits (122-124). When using a secure PRNG, both are equally difficult to guess, the SHA3 hashing doesn't add anything. You don't have to try and guess the "input" of the Cuid2 - you can just try to guess the "output" and skip the SHA3 hashing. It would be impossible to actually guess a generated ID, but UUIDv4 is just as impossible. Also no argument given for why UUIDv7 is fine but UUIDv4 is not.

I've used UUIDv4 for genering unique IDs for over 10 years now. I have run into collisions, when I hand-rolled my own implementation for J2ME with major bugs many years ago - ended up with around 20 bits of entropy instead of 122. That's not a reason to not use UUIDv4, just a reason to not implement it yourself unless you really know what you're doing.

wolletd · 2023-10-02T10:13:50

I once had UUID collisions in Linux Boot IDs when we started developing our own embedded systems at my company.

Found them because systemd-journald isn't very happy when Boot IDs repeat and (apparently, then) stops showing earlier boots once it hits a repeating boot ID. And I wanted to see an earlier boot. Then I started logging the Boot ID in a textfile myself and it took less than 10 reboots to have duplicate Boot IDs.

Long story short, some weeks earlier, I "optimized" the Kernel config for that system and some config flags that didn't sound like something I'd need. As it turns out, an ARCH_ZYNQ target apparently also needs ARCH_VEXPRESS set. Otherwise it works absolutely fine, but with a broken RNG that you will notice weeks later.

That was a valuable "don't take down a fence until you know the reason why it was put up" lesson. Don't unset kernel config flags until you know why they are set.

Aside from breaking RNGs, I've never experienced any UUID collisions either.

AtlasBarfed · 2023-10-03T12:40:06

"A couple of years ago we decided to standardize the use of sequential integer IDs as primary keys, due to the significant performance issues of non-time-ordered UUIDs."

From the article. I'd like a lot more exposition on that, since it goes against some of what used to motivate UUID use in the first place. Sequential ordering across distributed nodes isn't a fun thing to do, and even if you navigate the coding, the network agreement makes it really slow.

Do they mean "sequential enough" but some locality of the node that generated it? And I guess that prng can sometimes have some performance bottlenecks, but compared to doing locks on a single incremented integer?

Yeah, I don't really get this, lots of usual "better" "faster" etc without actual numbers to back it up or detailed algorithmic discussions.

As you basically said, wake me up if it's good enough to get into the standard vetted libraries of UUID generation.

onion2k · 2023-10-02T12:58:09

No argument given for why having a slower algorithm to generate random ids is more secure.

If the algorithm is too fast it means you can detect when some other part of the system is having a significant impact on how the key is returned. Eg checking a database to see if a user exists and returning their key versus getting null back and generating a new key. That difference can be used to determine if a user exists. You want your key gen process to be slow enough that it's a significant part of the process, which makes timing attacks hard.

maronato · 2023-10-02T13:27:20

This doesn’t make sense. If you’re going to generate a new key if the user doesn’t exist then you’re creating a new user anyway, so there’s no hidden information. Unless you mean that the system should return a bogus id instead of a status 40X when the user doesn’t exist, which makes even less sense.

ID generation should usually only happen when creating new assets, so it should be as fast as possible.

pfix · 2023-10-02T06:53:44

There's a comparison in the README of the project:

https://github.com/paralleldrive/cuid2#the-contenders

Some of the arguments mentioned are explained elsewhere in the README, others are assumed.

One argument standing out for me is the lack of collision-resistance for UUIDv4 which is surprising for me and I didn't spot any sources for that argument.

Another argument is the entropy source where they go about that Math.random is not reliable as a single entropy source but glimpsing at the source code, they sprinkle the CUID with Math.random data.

I am no expert in ID security, so I am not qualified to speak about the validity of their arguments, only that there's insufficient information to validate without prior knowledge about the problem domain.

forty · 2023-10-02T07:32:34

crypto.randomUUID should generate UUIDv4 with a cryptographically secure RNG (ie not math.random)

Collision of UUIDV4 (which are 122 bits of entropy) are unlikely enough that it should fit most definitions of the word "impossible".

The argument listed in this library README feels like total bullshit to me, I'd avoid using it for this reason alone.

BillinghamJ · 2023-10-02T16:53:43

>Having any information, specifically time information, leaking from your systems may or may not have unanticipated security or business implications. (e.g. knowing when session tokens or accounts are created).

I don't think this is really true? These are not serially incrementing, they just indicate the time it happened. If you have an ID that you know exists, having the ability to know _when_ it was created is very rarely meaningful.

What could present more of a risk is being able to predict a large part of IDs that will be created. Even then though, you shouldn't depend on your IDs for secrecy - best to ensure the IDs are never used as protection by themselves (ie treat them like they're just a simple autoincrementing number, even if they're not)

WorldMaker · 2023-10-02T22:34:36

One real world security problem is the "elder account" problem: as the age of an account increases the likelihood increases that it uses an insecure old password and/or that the account owner isn't paying as much attention to the account in the present. Depending on what the account represents age may also imply more "value" in the account. (Including just "sentimentality" value in the case of ransom operations, not just financial value.) Being able to tell from an ID alone that an account is at least X years older than some other ID in the system can be a handy way to find "potentially high value/low security" accounts to focus on to social engineer.

There are certainly mitigations that can be made and not all things are equally valuable as they age. (Plus many public APIs include created/modified timestamps anyway. The information is often easy to discover even when not embedded in an ID.) I don't find it a strong reason to avoid timestamp-based IDs for the threat models of that many things beyond user accounts and other things susceptible for social engineering, but it is something to keep aware of.

Someone · 2023-10-02T16:59:16

One business implication is that third parties can detect whether your sales increase or decrease from sampling those IDs (a variation on https://en.wikipedia.org/wiki/German_tank_problem)

adamckay · 2023-10-02T17:35:01

But how?

You would be correct if the ID were an integer being serially increased. I can sign up to your website today and get an ID X and then sign up again in a week and get ID Y, I can then calculate the number of new users you've had by performing Y-X.

If this ID is a timestamp then there's no such information I can get out of it from a small sample. I sign up today and get todays timestamp, then I sign up next week and get next weeks timestamp..?

rtsil · 2023-10-02T15:34:39

Don't create them each time a record is created, create a batch in advance in sufficient number, and do the same every time the previous batch has ran out. UUIDv7 is 128 bits, you can store a large number of them without major penalty.

Macha · 2023-10-02T15:52:48

But then you need to have the client communicate with the server to identify it's newly created object or complicate your logic to have incomplete objects in a pending state, which is one of the things people were using UUIDs to avoid.

rtsil · 2023-10-02T19:53:44

You don't store the UUIDs in the database as incomplete records. You can put them in a unused_uuids table and store some of the values in memory to minimize the round-trip. You can even store them in a simple file, and remove each used UUID from that file. When the file is empty, you create a million more of them.

Macha · 2023-10-02T22:00:48

The incomplete objects refers to when someone clicks "new" in your UI. Until it's saved back to the server, that "new" object has no ID, since you need to communicate with the server somehow to get that UUID in this approach. So now the client creates objects without IDs, so now all your models need to assume IDs are optional, and you can't create your object references on unsaved objects.

rtsil · 2023-10-03T01:07:41

Then we can get a random sample of UUIDs in sufficient number from the batch, send it to the client at the beginning of the session along with the other data, lock that batch until the client releases the session to avoid duplicate use, and have the client use these UUIDs until they run out, at which point it can request a new batch.

pnpnp · 2023-10-02T20:39:54

They’re also incredibly cheap to create & don’t need knowledge of each other. I mostly see batching of IDs like this when a lock is involved to prevent collisions & maintain performance.

With UUIDv7, you are reasonably sure that there won’t be collisions (check your use-case first), and can just generate them wherever on-demand (no locks required).

I’d argue batching IDs is actually more complicated than UUIDv7 for most use-cases.

rtsil · 2023-10-03T01:05:10

This was a solution to the UUIDv7 problem of being time-dependend and therefore may leak information. Create the UUIDv7 in advance by batch of 10,000, use these randomly, and you fix that problem.

0xEFF · 2023-10-02T15:13:59

Every access and id token issued by oidc already has an issued at (iat) and expiration (exp) fields.

throwaway167 · 2023-10-02T13:53:27

Just randomise your clock.

/s

Persistent IDs are a security and information risk. If that's a concern, don't persist IDs.

greatgib · 2023-10-02T10:17:11

Fyi, the timestamp is already encoded inside uuid4. But for having a good distribution of values, the low bits half of the timestamp is stored before the high bits half.

Here uuidv7 will just re-order that. So the content of the uuid in itself does not change.

mafuy · 2023-10-02T10:24:51

No, this was the case in earlier uuids. In v4, there is no timestamp.

greatgib · 2023-10-02T10:37:13

Indeed you are right, just the other ones.

woile · 2023-10-02T06:15:52

Could you explain a bit more how it would be a risk? Maybe for session tokens is understandable. But why leaking account created info is a problem?

logicchains · 2023-10-02T10:53:21

Leaking a monotonic ID could allow outside observers to estimate e.g. number of accounts created or products sold over certain timeframe. Competitors (or traders, for a public company) could use this like a form of inside information on the company (e.g. sell the stock if the rate was falling).

gorkish · 2023-10-02T17:28:41

This would really only be possible if you leaked a monotonic sequence; the monotonic clock only would potentially leak only event ordering or absolute time.

IMO it's not the job of the identifier itself to prevent information leakage vulnerabilities though; if thee is sensitivity to this, the solution should be explicit, such as employing a secondary key derived from the UUID using a secure KDF or similar.

fanf2 · 2023-10-02T11:01:28

UUIDv7 does not leak the allocation rate of UUIDs.

andrewmatte · 2023-10-02T04:31:33

jonhohle, thanks. Do you know of examples of when milliseconds are part of the session tokens or accounts being created has been exploited?

tgv · 2023-10-02T05:14:52

The German tank production capacity was estimated by serial numbers of captured tanks. There are ways to read all kinds of information by observing energy usage. High resolution time and sequence data undoubtedly reveal more than you’d like.

CSDude · 2023-10-02T06:19:33

Most of our lives as boring SaaS etc. software developer will not be near as exciting as this, but of course you may never know.

I parsed the EV chargers APIs where I live (using Frida in Android) and one of the fields returned the daily revenue and profit.

t0mas88 · 2023-10-02T08:00:49

Could be quite useful information to competing charger networks or in future M&A discussions.

I've seen a project for a trading firm that inferred all kinds of traffic and revenue numbers for companies before their quarterly earnings were made public. It wasn't perfect, but knowing with a certain confidence level whether the numbers were going to be better or worse than estimate was profitable for them.

tgv · 2023-10-02T13:01:45

Sure, for many places it doesn't really matter, but if your URLs or user ids can be seen/scraped by others, you might expose some commercially interesting information to competitors.

And the indexing argument isn't really compelling, is it? You lose very little by sticking to fully random UUIDs.

kozak · 2023-10-02T11:38:21

Here in Ukraine trying to estimate enemy's drone and missile production capacities by serial numbers of their parts is quite a mundane task these days.

BWStearns · 2023-10-02T05:30:17

I could imagine using the timestamp segment of publicly observable ids to estimate activity patterns in an organization. Probably not super crucial and there are probably easier ways in most cases but it could be a big deal at the right moment and for the right target. This could be like a more refined version of PIZZAINT (where you can detect impending policy/operational movements by the quantity of food deliveries to a government organization).

rhaps0dy · 2023-10-02T05:33:10

Is PIZZAINT real? I thought it was a joke. How do you even find out how much pizza is being delivered?

BWStearns · 2023-10-02T05:40:48

In ye olden days you had a minion sitting in a car watching the front gate with a notepad and enough cigarettes to last the night.

Pizza itself might be a bit of a joke but looking for non-operational behavioral changes is absolutely real. The Cuban missile crisis was started in part because soviets played soccer and Cubans played baseball and the presence of soccer fields helped confirm soviet presence (in enough numbers to bother making rec centers). A more advanced version might be the public Strava data leaking US base layouts and locations or Strava helping the Ukrainians kill a Russian submarine commander.

Edit to add: you could also just figure out where your target orders pizza from and pay one of the dudes working there to tell you when there’s a spike in deliveries to your target.

littlestymaar · 2023-10-02T06:13:14

> soviets played soccer and Cubans played baseball

It's a Henri Kissinger's quote, but it's not accurate: Cubans do in fact play football. Also this quote wasn't from the 1962 Cuban missile crisis, but to another event in 1970. That being said, it is true that the US intelligence got warned by the construction of football fields (or maybe even more so by the lack of baseball grounds).

https://www.cracked.com/article_31335_that-time-soccer-field...

yard2010 · 2023-10-02T07:52:31

That's Goodhart's law, if you want to know something instead of asking just say something which is wrong and someone will correct you

Karellen · 2023-10-02T12:41:38

lol

iswydt

tarjei_huse · 2023-10-02T11:19:59

I know of people who used leaked customer ids in public facing chatbot solutions (like Intercom) to estimate how fast their competitors were growing and/or how many customers they had.

thinkharderdev · 2023-10-02T13:24:53

How would you do that with UUIDv7 though? I see how using sequential IDs would obviously leak that information, but if all you leak is the timestamp the ID was generated, how do you then infer anything about the rate of ID generation?

rvnx · 2023-10-02T09:06:52

It’s as convenient as manipulating IPv6 addresses.

tabtab · 2023-10-02T16:00:42

Plus ID's often have to be given over the phone for support. Long or complicated ID's will drive both sides bonkers. #BeenThereDoneThat.

jacobgorm · 2023-10-02T10:25:55

Not to speak of the increased risk of key collisions. https://en.wikipedia.org/wiki/Birthday_problem

hannasanarion · 2023-10-02T15:42:01

In the same millisecond, in the same database system, having rolled the same 74 bit (~22 digit) number?

jacobgorm · 2023-10-02T16:52:09

If it is the same database system, you may as well use a SERIAL or similar synchronized datatype. The point of using UUIDs is that any participant in a distributed system should be able to generate them, without having to rely on a centralized authority. If the use is purely local then UUIDs would be pointless.

If you have 74 bits of entropy, the birthday paradox says that after 2^37 keys you will hit a 50% risk of a key collision. Whether that is going to be a problem depends on the use case, and on the quality of your RNG.

hannasanarion · 2023-10-03T00:35:14

But the uuid collision chance only matters if there is even a possibility of colliding writes and lookups.

So what if your microblogging platform's tweet uuid happens to collide with my cocktail recipe generator's ingredient uuid? The likelihood that any one device will ever be running our apps at the same time and trying to read data from one into the other is even smaller than the 1-in-380000000000000000000000000000000000000 (uuidv4) probability of having a collision in the first place.

jacobgorm · 2023-10-03T08:51:30

You're right, but why would either of us be using UUIDs for that purpose, when an auto-incrementing integer would suffice?

hannasanarion · 2023-10-03T13:56:12

Because incrementing integers require a singular central node to track the universal state and even so still have a risk of collision bucause they are not guaranteed to be unique in absence of an externally carefully managed master incrementer.

Applications use uuids to avoid colliding with themselves. Uuids exist so that web-apps can create objects client-side without waiting for a database CREATE, applications can be built with multi-node dbms, operating systems can name hardware compenents.

rockwotj · 2023-10-02T02:38:19

I find it interesting that it’s quoted random IDs are bad for performance, because it’s actually better for distributed storage systems because you don’t hotspot on a single node. For example see: https://stackoverflow.com/a/53901549 and https://medium.com/google-cloud/cloud-spanner-choosing-the-r...

chacham15 · 2023-10-02T02:47:42

They can be bad for performance. It all depends on your access patterns. A common caching pattern is called "temporal locality" which means that theres a high likelihood that data created at the same time will be accessed at the same time. Therefore, if these pieces of information are on the same machine, they can be queried / returned much faster than if they were both on separate machines. This is doubly true if theres a data dependency between them. E.g. SELECT x + y or SELECT x WHERE y = 'foo'.

stepanhruda · 2023-10-02T03:17:42

Yes but if that machine with sequential data receives 100x the traffic of other machines, it can be worse than splitting this traffic evenly across all available machines.

kijin · 2023-10-02T04:31:36

If your database simply shards keys sequentially, it's going to get hotspots in a lot of use cases, like plain old integer keys and timestamps, not just UUIDv7. In that case it would be fair to say that your database is doing it wrong.

Fortunately, there's no rule that says you should shard your keys using the sequential part up front.

One of the rules for generating randomness from environmental sources is to throw away the high bits and only use the low bits. Distributed databases should do the same if they want a good distribution.

johncolanduoni · 2023-10-02T15:12:23

What distributed databases shard on the low bits? How do they do something like a range query?

The closest I’ve ever heard of is sharding based on a hash (e.g. CockroachDB can do this on request[1]) but most distributed databases with strong consistency (Spanner descendants in particular) default to “doing it wrong”.

[1]: https://www.cockroachlabs.com/docs/stable/hash-sharded-index...

stepanhruda · 2023-10-02T13:21:32

As I understood it, a big part of the premise of the post was that they see sequential storage (either in db or cache layer) as desirable

paulddraper · 2023-10-02T05:31:39

It depends if you have a request covers a lot of sequential data, or if you have a lot of requests of sequential data.

stepanhruda · 2023-10-02T13:20:01

Correct, it speeds up latency in best case scenario, and falls over in worst case scenario. Randomly sharded keys give a more consistent performance.

chacham15 · 2023-10-02T22:42:24

You're painting with way too broad of a brush. It is not always better and not always worse. E.g. "give me a list of users who made two posts where both posts were created within 1 second of each other" This query would likely blow up on a system which has all the data completely randomly sharded (because you'd have to aggregate all the data centrally, unless you had a complicated shuffle setup (which most dbs dont)) whereas would work fine on a system which has posts sharded by time.

berkes · 2023-10-02T06:25:08

We solved that with UUIDS and updated_at and created_at columns. The latter the default sort in all views and queries. So the btree/indexing issues were hardly an issue. Whenever you fetch a set of rows, they will be bounded by these timestamps.

We even sharded on these columns, because of this (our business case made it so that hardly ever did people need data over multiple months)

But we never encountered distribution issues. I don't think the locality issue will be solved, as postgres doesn't consider other columns when distributing data, only the primary key IIRC. I don't know why we never saw this, though.

hinkley · 2023-10-02T06:22:25

If you're working on a multi-user system, particularly one with hundreds of requests per second, there is no locality of ids. Two of my actions are separated by a sea of actions by other users.

akira2501 · 2023-10-02T02:45:51

UUIDv7: Timestamp up front, random in the back.

wenc · 2023-10-02T03:29:25

I know HN doesn't like jokes, but this is really funny. And the subcomment about mullets too.

(for folks who don't get it, mullets are a 1980s haircut (think MacGyver) with a short front but a long tail in the back. A funny description of them is "business in the front, party in the back")

jjgreen · 2023-10-02T10:12:29

Sitting on a tram in Sheffield, early noughties, a chap with a magnificent unreconstructed ROCK mullet gets off. I see two women chatting on the street, they are stunned into slack-jawed amazement as he struts past -- one of the women makes a scissor-snipping hand gesture behind him as he passes, the whole tram erupts in laughter.

labster · 2023-10-02T02:54:10

Truly, the mullet of unique identifiers.

sj26 · 2023-10-02T05:18:04

It Depends(tm).

If you're using a system which is built for distribution, random is great.

When you're leaning on a Postgres database which has powered your startup through scaling but expects right-leaning btree indexes, it's a bad time.

Rearchitecting to use a new data store is ideal, but often impractical as an immediate step. UUIDv7 is a great increment walking that road via sharding etc.

fnordpiglet · 2023-10-02T02:45:22

In all the distributed systems I’ve built I hashed the keys to ensure good distribution. A nice thing of ordered keys is you can use part of the ordering to distribute keys with a tunable amount of key locality in each node for efficiency.

jillesvangurp · 2023-10-02T06:21:56

Depends how you query it. In a lot of systems, recently added data is also the most queried data or data typically gets pulled out sorted by time. Having that data on disk in more or less the order it is going to be queried makes sorting it a bit easier. Even in a sharded system, each of the shards would have less work to do for sorting. Of course a lot of these systems would have an append only write model which would effectively sort things by time anyway, even with completely random ids.

Somebody posted an interesting article for the instagram ids, which do something similar. They use 41 bits for a time from a custom epoch followed two more groups of bits for a shard id and a sequential number. Each shard has an incrementing sequence for the sequential bit, which guarantees that things on a shard are sorted by time.

This UUIDv7 is slightly weaker than that but sorting things published in the same millisecond is mostly going to be very light work. The lack of a dedicated sharding group of bits is not that important as you could just take the n least significant bits at the end for that without too much effort. Those are random so you end up with nice consistent hashing. Having 48 instead of 41 bits for the time means we won't run out of time any time soon (nearly 9K years vs. 70 years).

gregw2 · 2023-10-02T08:51:46

An explicit shard id can ensure all related data across all tables can be on the same shard. Helpful for SQL JOIN operations.

Picking the N least significant bits only a single table has good distribution and sort qualities, no cross-table properties.

stingraycharles · 2023-10-02T03:17:56

Depends on the use case. If, for example, you store things on disk ordered by these IDs, and access patterns to your dataset are related to time (e.g. more recently created data is accessed more frequently), it will help a lot to have this data ordered by time.

This is especially useful when your underlying database stores data in large "chunks", such as LSM-trees you find with e.g. rocksdb.

tveita · 2023-10-02T12:23:47

It's great for performance, up until you reach the point where a single device becomes a bottleneck, at which point it's terrible for performance.

As a sibling comment says, you ideally want to shard on some other key to get "just enough" distribution that all your machines/disks have work to do, but you are still only hitting a limited number of hot sectors on each disk that can be effectively cached. But that requires active monitoring and rebalancing of your data as it grows. Totally random keys are a safe default that will scale with any kind of data distribution and access patterns.

sroussey · 2023-10-02T04:48:48

It can be bad for performance due to how b-trees work in databases, and more pronounced when you have a clustered index.

dheera · 2023-10-02T03:03:14

It's bad for performance if you frequently access large consecutive sets of records.

hinkley · 2023-10-02T06:25:54

UUIDs are good for data where I want either lots of different users being able to insert without collision, or lots of users who I want to keep their peepers off of other user's metadata (eg, how many X they add to the system per day).

In both cases I'm melding highly disjointed data into a single schema. There are no large consecutive sets of records.

If you're using UUIDs, there's probably a reason. And that reason invalidates the justifications for not using them.

sgarland · 2023-10-02T12:49:14

> If you're using UUIDs, there's probably a reason.

Not really; think of how many architectural decisions are made purely based on imaginary scaling problems, or what the latest blog said.

I would wager that if you polled 100 backend devs, very few of them could correctly articulate the pros and cons of a randomized primary key.

moralestapia · 2023-10-02T03:01:43

Agree, but the solution is easy peasy with a simple hash function.

(Or just reverse the bits, take the last n, etc)

conradludgate · 2023-10-02T06:56:15

As far as I understand, you want to have a random shard position, but once you have found a shard you want that index operation to be cache friendly. When choosing a shard, you can always use the last N bits or use some consistent hashing strategy[1]

[1]: https://en.m.wikipedia.org/wiki/Consistent_hashing

andix · 2023-10-02T09:23:09

There are many other options that usually scale better than random distribution. For example distribution by user or tenant id.

EGreg · 2023-10-02T02:42:23

They likely mean it’s good for latency and not necessarily for throughput.

I still think that graph databases are way better for this sort of thing.

rockwotj · 2023-10-02T02:49:01

They later note most of this traffic is going to a single postgres instance. Having all the keys go to the same range probably helps throughput because they can do a better job of grouping fsync. But that probably depends on the type of drives they are using (even fast NVMe benefit from locality).

phkahler · 2023-10-02T12:59:19

Is there some reason new versions of UUID keep appearing? It seems like the desired properties are never quite achieved so new ones appear later. Is there a table with UUID version across the top and characteristics down the side, so I can see the differences and pick one that fits my needs? That might also help to explain why there are so many variants.

ricardobeat · 2023-10-02T13:25:29

Unless you have specific needs, the only type of UUID you should care about is v4.

v1: mac address + time + random

v4: completely random

v5: input + seed (consistent, derived from input)

v7: time + random (distributed sortable ids)

kozak · 2023-10-02T13:51:38

As someone who only cares about v4, I periodically wonder why don't I just use fully random 128-bit identifiers instead (without the version information).

hansvm · 2023-10-02T14:11:36

That'd be perfectly fine. You need to take some care to avoid duplicates or patterns though. Databases might not come with crypto-random out of the box but usually do have UUID support. Similarly, you need to use the crypto-random routines in your favorite standard library.

Depending on the implementation you still might have to worry about seeding issues. That's probably moot though since the UUID library would probably be compromised by something like that under the hood too.

klysm · 2023-10-02T16:32:35

Because the rest of the world uses uuidv4 and the extra couple bits doesn’t really buy you anything

Xeoncross · 2023-10-02T14:23:36

UUID v4 isn't large enough to prevent collisions, that is why segment.io created https://github.com/segmentio/ksuid which is 160bit vs the 128bit of a UUIDv4.

hannasanarion · 2023-10-02T16:19:32

I think you vastly overestimate the likelihood of collisions. If all of the 2 billion computers in the world produced a new uuidv4 every millisecond, we still wouldn't expect a collision for the next 5 quintillion years.

Or if the same number of bits are used in a more structured manner, like uuidv1 which combines a 48 bit MAC address, 60 bit 100-nanosecond timestamp, and a 14 bit uniquifier with an effective resolution of 6.5 femtoseconds, you could have 300 quadrillion computers make 160 billion uuidv1s per second with guaranteed zero collisions until the year 3400 (plus an extra 500 or so years because the timestamp is referenced to 1542 for some silly reason), after which point any collisions that do happen will be completely irrelevant because they're guaranteed to be colliding with db entries created over 2000 years prior.

Jenda_ · 2023-10-03T04:36:29

> If all of the 2 billion computers in the world produced a new uuidv4 every millisecond, we still wouldn't expect a collision for the next 5 quintillion years.

This would generate 2^127.8 UUIDs.

First, no, the collision would be expected in less than one year, approximately after exhausting square root of the available space (2^64 generated UUIDs): https://en.wikipedia.org/wiki/Birthday_problem

Second, no, the UUIDv4 has 122 random bits, not 128 as you thought: https://en.wikipedia.org/wiki/Universally_unique_identifier#...

hannasanarion · 2023-10-03T14:10:36

Sure, the presence of some collision somewhere is still likely to happen, but the chances that that collision will actually matter for anything is still vanishingly small.

In the real world, we do not spend 100% of our entire species's computing capacity generating uuids and doing nothing else. In the real world, we aren't burning through uuids at a rate of 2 billion per millisecond, and even if we were it wouldn't matter because the true denominator is the scope of the data system the uuid will be referenced in: if your hard drive partition and my webapp user entry happen to get the same uuid, we will never know, and if for some reason it does matter, then we can use uuidv1 or uuidv7 which guarantee no collisions for thousands of years by embedding a timestamp.

klysm · 2023-10-02T16:33:09

No number of bits is large enough to _prevent_ collisions.

phkahler · 2023-10-02T18:11:22

That doesn't help me. My code may generate a whole bunch of IDs in a microsecond, so using a 32bit time isn't going to keep them time sorted. I may as well just use a UUIDv4 at that point.

ricardobeat · 2023-10-02T23:29:45

Looks like you got things a bit mixed up. You’ll not see a v4 collision in your lifetime, that is not why they built ksuid. At the time, UUIDv1 was the only standard alternative that includes a time component, but in way less bits, not sortable, and the fixed mac address takes a lot of space from the random bits.

ksuid is similar to Twitter Snowflake, the main goal is distributed generation of collision-free sortable IDs. The UUIDv7 proposal is meant to address the same use case. You don’t need to worry about collisions as much here, as the timestamp is monotonically increasing, there is a 42-bit counter for every millisecond + the random 32 bits at the end. You’d have to be generating trillions of IDs per second to have the chance of a collision.

xcrunner529 · 2023-10-02T14:08:55

It would seem sequential keys for database performance is more than a 'specific' need.

ricardobeat · 2023-10-04T06:52:21

Those are not helpful for database performance in a general sense. Last product I worked on used v4 uuids as primary keys without any issues - single master database, sorting by creation time only used in admin panels. You can index on created_at if needed. The IDs being sortable would be a non-feature.

Generating sortable IDs in very high volume, in a distributed architecture, is a problem very specific to systems like social networks, metrics collection etc. You won’t need that for your average e-commerce app or machine parts database.

epcoa · 2023-10-02T14:25:11

v4 being completely random has terrible properties even on a non distributed database. Probably better to use v7.

bhouston · 2023-10-02T13:07:09

https://en.wikipedia.org/wiki/Universally_unique_identifier#...

andersa · 2023-10-02T03:11:14

> We use sequential primary keys for efficient indexing, and UUID secondary keys for external use. The upcoming UUIDv7 standard offers the best of both worlds

Unless you consider users being able to extract the generation time from the id to be an issue, of course.

giancarlostoro · 2023-10-02T03:15:31

I've been seeing a few different vendors do this already. MongoDB's ObjectIds are inherently timestamps (so you can actually generate generic MongoDB IDs to query based on time). There's also Discord's Snowflakes as well. I'm sure there's loads of others. All it tells you is when something was generated, not much else. I do love how MongoDB has it stored in such a way that it is easy to query against. I wonder if any RDBMS' will allow you to query these timestamps as well.

andersa · 2023-10-02T03:24:24

There are definitely many cases where it isn't an issue since you were going to tell the user the time anyway (like sent time on a message)

afavour · 2023-10-02T04:08:58

Can’t agree with that logic. Unless it’s specifically documented leaking timestamp data is going to get totally forgotten. So when you add (e.g.) the ability to change the sent timestamp on a message you’re going to inadvertently leak when a timestamp has been changed. Could cause embarrassment in a lot of scenarios.

zo1 · 2023-10-02T07:54:18

This reminds me of that recent investigation regarding fudged data in those Harvard studies. The fields stored the "original" ID, solidifying their creation sequence, which differed to the displayed sequence - Implying that the fields were updated out-of-sequence, thereby they were tampered with.

RhodesianHunter · 2023-10-02T12:40:51

This sounds like a completely arbitrarily invented argument against a technology that's perfectly useful in many scenarios.