Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: 128-bit, roughly-ordered, URL-safe UUIDs (github.com/anthonynsimon)
205 points by amzans on Jan 22, 2021 | hide | past | favorite | 137 comments



For those interested, here's (IMO) a good explanation on why completely random UUIDs hurt performance, and can lead to database fragmentation issues down the road (on MySQL):

https://www.percona.com/blog/2019/11/22/uuids-are-popular-bu...

It's part of what Timeflake aims to solve, but there's plenty alternatives too depending on your exact requirements.


This is not universally true. It depends on the database engine and how its indexes are implemented.


It's pretty much true for any B-tree based index. You could get around it by using a combined index on another timestamp value, but that's going to be a very large index. Or reduce it by partitioning by e.g. year-month. Even hash-based indices would be slightly impacted due to less locality.


Thanks, I updated the comment to mention that the link is in the context of MySQL's case.


Is this applicable to postgres?


It's applicable to the indexes that postgres uses. Postgres does not put rows in the heap in order of primary key so the issue is less serious than in innodb.


This isn't strictly speaking a UUID. It's just a 128-bit identifier. It's "compatible with UUID" only in the sense that both are 128 bits wide, but this scheme doesn't have the structure of a UUID.

To summarize the scheme, a 128-bit id is generated by concatenating:

  - 48 bits timestamp (milliseconds since UNIX epoch)
  - 80 bits of random data
These ids have the property that they're ordered by generation time (ids generated at the same millisecond are ordered randomly). The result can then be encoded in base 16 or base 62 to make it URL-compatible.

By the way, encoding timestamps into an id is not always desirable. It has some privacy implications, since now you can tell from the id when the resource was created. If this corresponds with user activity (e.g. a user filling out a form or uploading an image) having this information encoded in a URL is possibly undesirable.


You are talking about a v1 UUID.

A random number is a valid v4 UUID, provided you set the version and variant bits to the correct value. A random 128-bit number won’t have those bits correctly set (except by chance). Set them, and you’ve made a valid v4 UUID


As you say, you need to set certain bits to special values to encode a v4 UUID, leaving less than 128 bits for the payload, which means a random 128 bit number is not a valid UUID except by chance.

Second, a v4 UUID should set all remaining bits randomly. Encoding a timestamp in some of the bits is not random. Sure, you could argue these could have been randomly generated by chance, so any such UUID is not invalid, strictly speaking. Still, a collection of these v4 UUIDs don't have the properties that you would expect of random UUIDs.

As specificed, the scheme satisfies neither of these two requirements.


There’s always the expired v6 proposal: https://datatracker.ietf.org/doc/draft-peabody-dispatch-new-...


bit perplexed why this never went anywhere. Seems like a solid win all around really given where uuids show up.


Exactly. And the v4 UUID will have 122 bits of randomness.


Just be careful if you use these as keys for a distributed system that is susceptible to hotspotting. GCP's Cloud Spanner [1] has this problem and has a good document explaining it.

As always - understand what is going on in your stack and apply design with that in mind. There are no universals!

[1] https://cloud.google.com/spanner/docs/schema-design#primary-...


I had to pause when they use the term "anti-pattern". It may be an anti-pattern for cloud spanner but it's not an anti-pattern in other distributed systems, for example cassandra.


The very next word after "anti-pattern" is if. "The following table definition that uses a timestamp-based primary key as the first key part is an anti-pattern if the table will see a high rate of insertion:"

Once your data gets big enough there's no easy one-size-fits-all. You gotta look at how you're going to be using the system.


Right. This spanner thing is new to me, but looking at it, I don't see why spanner doesn't/couldn't shard by modulo / checksum to reduce this risk. And I think other systems do


It shards it this way (by primary key ranges) to support efficient range scans.


Good point. I guess it would be cool if you could choose the sharding method per-table. I haven't put too much thought into this, but off the top of my head I can imagine a number of different methods suitable for different situations.


Have you considered using a 32 bit time value with second resolution?

That way you still get sort-ability, you get 16 bit more entropy, fare better on 32 bit systems, and still get 130 years of usage, out of them.

Also you actively discourage people from trying funny stuff like synchronisation and conflict resolution via the time component. Because 1ms "that sounds kinda good enough :?", while 1s "that'll never work!!!"

Ammendment:

Thinking about it. Using a 16bit timestamp prefix with second resolution and rollover should probably provide a good tradeoff between improving insertion performance (most expensive in terms of index touching), a 112 bit random value, and a dense usage of the timstamp field.

It also allows you to do thing like "forgetting" old values, within a 18h timeframe.


> Because 1ms "that sounds kinda good enough :?"

Ugh. You wouldn't think so, but I caught a case of this in production, using millisecond timestamps as unique ids for browser analytics. The developer felt that even though two users could in rare cases click at exactly the same millisecond, their clocks weren't likely to be in perfect sync with each other, and that this "additional level" of randomness made a collision virtually impossible.


Oh dear god.

I suppose it can be counter-intuitive, but people really need to learn about timer precision... and, more importantly, statistics (basics, at least!).

I absolutely hate the 'combinatorics' part of statistics, but distributions, etc. seemed pretty learnable to me... but then I like continuous math, so...


That's not even statistics. That's simply the pigeon hole principle.

A second has 1000ms. If you have more than 1000 users doing things per second, there MUST be a collision.


Then you have https://en.m.wikipedia.org/wiki/Birthday_problem

Even with a small number of users per second (like 20) it becomes likely to happen often.


Ah, but now you're discretizing! :D

... also, it Technically depends on your throughput (something something queue theory).


> even though two users could in rare cases click at exactly the same millisecond, their clocks weren't likely to be in perfect sync with each other, and that this "additional level" of randomness made a collision virtually impossible

I think that the clocks not being in sync doesn't increase randomness in any way, as two clients who didn't click at the same time might still get the same value.


And you'd be right.


I worked on an analytics system where we appended a random 3 digits to the millisecond resolution timestamp for exactly this reason. It turns out, collisions are far, far more common that the math would suggest. Especially back in the aughts when browser had a lot of individual quirks to deal with.


I personally use: https://github.com/tvondra/sequential-uuids

The time component resolution is configurable, and you can make it wrap around every so often, filling in gaps left by page splits, updates, deletes, etc.


Its just a bit annoying because it is a C extension and that can't be used in most managed DB in the cloud.


I had to re-write the time based function from that extension in PL/PGSQL earlier in the week for someone on my team who couldn't work with the C extension: https://gist.github.com/Tostino/ca104bff40e704b6db70b9af4926...

This can be used in RDS, etc.


Nice! This is kinda what I had in mind, while writing that ammendum!

Thanks for sharing :D


You can also pick your own epoch and use centiseconds since that epoch for the time component. It's what I do in [1].

[1]: https://docs.racket-lang.org/buid/index.html?q=buid#%28part....


Yeah, I think the regular overlfow is actually an advantage.

A.) you pracice having overflows regularly, so the issue is resolved now and not left to some pooor schmuck world war XXVII Survivor in the year 35.000.

B.) It uses every bit, just like a fully random ID, as the key space is filled more densely with each overflow.


This approach has become my preferred way of generating IDs for database rows. It is relatively space efficient and doesn't require a round trip to the database to generate an ID like auto increment does.

While working on a C# implementation for MySQL and we found that when the DB uses Little Endian Binary that the GUID must be reversed to store in order:

https://github.com/PomeloFoundation/Pomelo.EntityFrameworkCo...


Does random UUID work as well as lets say an ordered integer id or ordered uuid? I thought the ids should have a definite order(not necessarily sequential) so that a binary search can be performed on the primary index. Similar for b-trees.


Random uuids are perfectly ordered, there’s no issue looking up a single record by uuid.

The problems are that they’re scattered during reads and writes, and will require updating old pages.


Note that scattering can actually be beneficial if you have too much load for a single node when the data is hot. Of course you can solve this with weird sharding such as sharding on the least significant byte.


Seems like this point is not understood very well by many. Ordered keys can create hot spots on insert and read, this can be good or bad depending on backend.


A hot-spot in one database index is collocated data in another. Sometimes you want uniform distribution of keys, sometimes you want clusters of keys because different databases are optimized based on assumptions about the underlying distribution of keys.


The classic goldilocks problem. You want your pages to be hot, but not too hot.


The biggest thing I can highlight after having used Standard UUIDs and heavily using ULID values including building my own library and testing compatibility and behavioural compliance of the library with other database variants than the big 4 (MySQL, MS SQL, SQLite and PostgreSQL) ... is that no PK format should ever be considered suitable for all use cases its all about trade offs.

If you care about the order of something you shouldn’t expect any PK to handle that alone, even auto-incremented integers will start to fail you once you have to scale your write workload horizontally. If there’s a property of your data that you want to be able to order it by then you just need to make a column of the appropriate data type and store it. If I had a nickel for every time someone has look at a schema and said something like “Do we really need this date for ordering? The table is already sorted by insert order...” I’d have a lot more nickels than I want.


One of the issues this is trying to resolve is that v4 UUIDs wholly random nature makes them less than brilliant in the b-trees that SQL databases use for indices (I am actually not sure how much that really matters in practice for most projects).

However, Postgres has an alternate index implementation that uses a hash function. It is crash-safe now (previously wasn't) but you still can't use it to enforce a unique constraint. If you can use that, I suspect it might resolve the problem and would be easier than switching UUID generation algorithm.


It’s a shame v6 uuids didn’t become a thing[1]. It’s the entire reason they were invented.

1: https://datatracker.ietf.org/doc/draft-peabody-dispatch-new-...


I'd worry a bit about the chance of collisions.

They correctly state you can generate 50 million of them per millisecond with only a one in a billion chance of a collision (per millisecond).

Unfortunately this means the chance of a collision in a year is 99.999999999998% which isn't great!

(I think they've given too many bits to the timestamp.)

Also, the uuid format it creates isn't a valid uuid which is a pity.

If you want to have "only" a one in a billion chance of a collision a year, you can only generate about 300 a millisecond.


Look for a different solution if you need 1 quintillion IDs per year. Noted.


I guess you assume to build one timeflake per ms for one year. That's about 3e13 timeflakes. Ordinary UUIDs have a collisions at the order of 1e18 hashes (cf. https://en.wikipedia.org/wiki/Universally_unique_identifier#...). So we are "only" 5 orders of magnitude away. For me the collision quality feels the same.


If you assume one per ms then timeflake and UUIDv1 both have a 0% chance of collision as they encode the timestamp.

You might go above this rate some of the time though and we might as well strive for perfection.

One of the nice things about UUIDs is that you can just merge data together willy-nilly without worrying about collisions - for example, if you gets bought by Google. You can't really do this with timeflake.


>> If you assume one per ms then timeflake and UUIDv1 both have a 0% chance of collision as they encode the timestamp.

They can be generated anywhere at any time. It two machines are generating even 1 per hour, they will have collisions if their clocks are synchronized. The random bits are important.


You'd run into a bigger problem than collisions a lot sooner (IMHO): time ordering. "ms" precision is very misleading, because keeping clocks really in sync while processing this much data is a real challenge.

Sure, maybe it doesn't matter that one server/emitter and thus a bunch of IDs are out of order, but if this is really just for helping indexes, then a simple 0.1 second precision would make more sense. (Or even simply truncating the timestamp to however many bits to get ~2-30 sec precision, and using a data store with LevelDB-like staged compaction. Which is virtually a requirement at this scale.)


> Unfortunately this means the chance of a collision in a year is 99.999999999998% which isn't great!

How was this computed? My intuition begs to differ.


Probability of no collision in a single millisecond: p_1 = 999999999/1000000000

Milliseconds in a year: t = 1000 * 60 * 60 * 24 * 365

Probability of no collision happening for every millisecond of a year: p_2 = p_1^t

Probability of a collision happening for at least one millisecond of a year: p_3 = 1 - p_2

https://www.wolframalpha.com/input/?i=1+-+%28999999999%2F100...


There are 31,536,000,000 milliseconds in a year. If you do (1-1E-9)^3.1536E10, you get a 2E-14 chance of no collision in a year.


I think the presumption is you will not be generating 50 million of them each millisecond.


Which is a pretty safe assumption, considering just generating those 128bits 50 million times a millisecond results in over 20 million TB of data in a year. I think it's safe to say this is not a common scenario.


There are close to 32 billion millisecond in a year. If the chance of a collision happening within a millisecond is 1 in a billion then my intuition is that the probability of a collision happening at least once a year is quite high, indeed.

It's like a dice: You have 1 in 6 chance of getting a 6 every time your roll the dice. Now, if you roll the dice 100 times the chance that you'll get a 6 at least once is then pretty high.


Query : For people who recommend UUIDs as a unique identifier in a distributed sytem, does the above satisfy the need?

In the SO answer[0] uuid v4 is being recommended but that is the most random version of uuid. Isn't that going to hurt the b-tree performance by having a very high fanout? How does it help the primary index?

[0] : https://stackoverflow.com/questions/33274291/postgresql-uuid...

EDIT : Apologies. I meant low fanout. What I meant was that you need something ordered to have a better organised b-tree node. If all values are random, each node will have a lot of children, increasing the search space. It might turn into Linked list in the worst case scenario.


Excuse my possible ignorance, but how is more entropy in a primary key going to hurt index performance? The fanout in a b-tree is completely in control of the database, and not dependent at all on properties of the data — apart, of course, from the size of the data, since typically the node fanout is precisely the number of fields that fit in one disk page. Randomness does not influence the size of a UUID on disk.


With random keys updates are going to touch way more pages than with sequential ones. It increases IO due to more dirty buffers and journaling.


Point taken. Quite true.



We moved away from UUIDs for different reasons. It is hard to work with UUIDs when multiple entities have them. You do not know if you are looking at a user or a group if both is just a UUID. We created our own IDs with prefixes and using base58 encoding for the random part, for example: user-AhnPpn7CbbwPykmux11LJ39Jk. It has the same amount of randomness in it and we can understand what we are looking at.


looks like its almost twice the size of a UUID at 30 bytes in this case with a short prefix, I'm assuming your storing this as a ascii string? That's a lot of extra bytes from some convenience when viewing logs or raw data etc.

UUID's stored properly in binary form are only 16 bytes which can still be a concern over an 8 or 4 byte integer when you have a lot of them all over the place in a large db. (larger index more memory use).


You can store UUIDs as bytes if you wish but when you are looking at it in a human readable form it will be still a UUID. I think storing UUID as text in 2021 is not that hard.

>> UUID's stored properly in binary form are only 16 bytes

Depending on what is your use case.


>> You can store UUIDs as bytes if you wish but when you are looking at it in a human readable form it will be still a UUID. I think storing UUID as text in 2021 is not that hard.

Not going to store a text type prefix + UUID in a 16byte UUID column, you will need a variable length text column. All your DB tools will convert UUID to human readable plus the column name no need to store in the data reading for every value.

if your looking at network traffic or debugging the id will hopefully be keyed (json, query string, protobuf, etc) if your looking at plain text logs with no schema, sure emit the type-UUID for readability, in your logging code...

>>Depending on what is your use case.

Only use case I can think of is unstructured logging, or maybe using a file name as a key? I guess if your db is just key-value with string keys maybe (basically a file system).


Uuid v5 is namespaced.


But given a v3 or v5 UUID, you can’t tell what namespace it belongs to. Both are one way functions from (namespace,name) -> uuid. The gp is asking for a way to tell user uuid apart from group uuid just by looking at the uuid.

That is actually possible with the v2 uuids used with DCE. But, since just about nobody uses DCE any more, just about nobody uses v2 uuids either


What is DCE?



Why would you ever be in a situation where you have a UUID but don't know what it refers to?


Because somebody (not necessarily you) wrote the following code:

printf("%s - %s", id0, id1)

You can even do this:

printf("user-id %s group-id %s", group-id, user-id)

Do you think you will be able to quickly debug what is wrong in a high severity situation when a customer facing system is down?

I do not, so we use eids (external ids) that are easy to identify.


It happened to me few times:

- We got a "400 bad request" because we received a UUID for something that wasn't even supposed to be a UUID. I had to fish around for what this UUID was actually referring to, to track down the bug. It was an integration with a partner, and some faulty branching logic. - We ask our customer support team to add IDs to engineering support request (customer ID, order ID, etc etc) so that we don't lose time. Sometimes they make a mistake (or it's not always clear what they're supposed to do) and we get a user ID that doesn't refer to any user. Bit of fishing around then again - Wild logs: "could not find entity f5fac1d0-5d0f-11eb-9ab7-3e22fb071270".

I'm sure I could find more example but you get the gist of it


* You stumble upon them in log files or error messages.

* Somebody asks you why some API call is not working.

* You send around Excel files to higher management (for example in incidents).


Now I'm even more baffled. What kind of stupid logger logs a uuid without a clue as to what kind of object it identifies? Why would you need to communicate uuids to higher management?

This makes me think of Douglas Adams. You know the answer is 42 but you forgot what the question was. I really can't imagine how this could ever happen in a business. "Yes, sir, there was a problem with 1087." "1087 what?" "I don't know, but it was number 1087."


> What kind of stupid logger logs a uuid without a clue as to what kind of object it identifies?

The code cannot know what kind of object an uuid identifies, it can only assume that it identifies what it expects. So it you might see:

"Unknown person 550e8400-e29b-11d4-a716-446655440000" in the log files. Much more helpful for debugging, however, is: "Unknown person order-550e8400-e29b-11d4-a716-446655440000". Then you know somebody accidentally put an order id where a person id belongs.

This is an easy example, in the real world you have persons, groups, representatives, accounts, acls, etc, which are easily confused. If you just have an uuid, you will have a hard time figured out what it actually is.


This is why you have column headers?

If you just want it to look that way in a text log, then sure prepend the type with the id wasting some space there, but to use it throughout your db eats up tons of space from both text encoding the id and repeating the column name in the data over and over.


> This is why you have column headers?

I'm not sure I can follow. You get an uuid from somewhere, probably external code that you cannot even look at. The uuid comes stringly-typed, let's say as json. Then you have no chance to know what type it is, you can only assume it is the correct type that you expect.

Just putting "550e8400-e29b-11d4-a716-446655440000" into a person colum does not make it a person id.

If you get instead "order-550e8400-e29b-11d4-a716-446655440000" as a typed-uuid, you know for sure that this is not a person id. It is an order id. Somebody made a mistake somewhere. You also know that you probably have to look in the order DB to debug what it actually refers to.

> but to use it throughout your db eats up tons of space from both text encoding the id

You dont need to put your uuid in this format into the db. You can add/remove the type before/after accessing the db.


Just putting "order-550e8400-e29b-11d4-a716-446655440000" does not make it an order id either, thats what foreign keys are for, mistakes can be made either way.

If your sending id via json you should be using json to encode type either:

{ order:"550e8400-e29b-11d4-a716-446655440000" }

Or if for some reason the value could contain multiple types:

{ thing: { t:"order", id:"550e8400-e29b-11d4-a716-446655440000" } }

Even then just use:

{ thing: { order :"550e8400-e29b-11d4-a716-446655440000" } }

This is redundant and not necessary adding overhead:

{ order: "order-550e8400-e29b-11d4-a716-446655440000" }

Storing that in your db as json even worse (Mongo, JSONB), storing as varchar in normal column not much better.

If your not storing ID with a type prefix in the db, then thats not the form of your id's that just the UI/encoding for them in your app which eliminates half the supposed advantage which was easy debugging say in db query tools. It also means you are parsing / deparsing "order-550e8400-e29b-11d4-a716-446655440000" somewhere in your db mapper code instead of just using your json or query string parser or protobuf or whatever, why?.


> Just putting "order-550e8400-e29b-11d4-a716-446655440000" does not make it an order id either, thats what foreign keys are for, mistakes can be made either way.

I'm not talking about the DB level here, I'm talking about what goes over the wire or ends up in logs files. And if you are handing out the uuids to your consumers, you can be reasonable sure that you only get back the uuids you handed out. So if "order-550e8400-e29b-11d4-a716-446655440000" is not an order, it an error on your side and not on your clients side.

> If your sending id via json you should be using json to encode type either:

Yes and you should also not make any mistakes.


So it all comes down to less mistakes made with:

order:"order-550e8400-e29b-11d4-a716-446655440000"

vs

order:"550e8400-e29b-11d4-a716-446655440000"

Either one will error in a sane system that checks to make sure the UUID is actually in the db either one can have mishandling sending the wrong id, you just have some extra redundant bytes double confirming the key type while the actual ID will still have to verified.

The nice thing about UUID over say ints is there should be no overlap, so it should be easy to confirm an error like that, double encoding the key from external apis, sure I guess a very slight reduction in time taken to verify an error, for logs aren't you logging the json with the key already?

Of course this whole discussion was about an time order UUID's which are mostly useful just for DB's, if we are just talking about how ID's are formatted for external use in the app, well geez I thought that was what query strings and json was for but ok.


> Either one will error in a sane system that checks to make sure the UUID is actually in the db either one can have mishandling sending the wrong id

Yes of course both will error. What is easier to debug:

order:"person-550e8400-e29b-11d4-a716-446655440000"

or

order:"550e8400-e29b-11d4-a716-446655440000"

In which case you could locate 550e8400-e29b-11d4-a716-446655440000 in one of your DBs quicker in case of a emergency situation?


Yes redundantly encoding type everywhere in your id will be easier for a specific kind of bug yet you're adding that overhead all the time in both your db mapper and extra bytes everywhere.

Or we are getting reports of invalid order ID let me run a scan across db's and see if its valid anywhere, oh hey it shows up in persons somebody mixed them up somewhere or no it's just invalid somebody messed up in some other way.

If it happens enough to encode it everywhere in ID's you could just have a dev tool that looks for it across your db's with a copy paste, a little longer to scan, vs overhead always in the running app so you can quickly determine that one specific bugs cause. Either way its an invalid ID and being a UUID it won't collide and cause worse problems with overlapping person and order ID as would occur with integers.

I just don't see the value but hey at least you're not storing it that way in the db right?


> Yes redundantly encoding type everywhere in your id will be easier for a specific kind of bug yet you're adding that overhead all the time in both your db mapper and extra bytes everywhere.

I can safe the extra bytes by just removing the '-' from the uuids and using the more compact enconding (base what ever). And at the same time I'll safe countless of debugging hours, because everybody is aligned what kind of ids we are talking about.

> Or we are getting reports of invalid order ID let me run a scan across db's and see if its valid anywhere, oh hey it shows up in persons somebody mixed them up somewhere or no it's just invalid somebody messed up in some other way.

Because you have access to all the DBs and you can easily write a query over all your dbs. That is the naturally assumed.


>uuids and using the more compact enconding (base what ever). And at the same time I'll safe countless of debugging hours, because everybody is aligned what kind of ids we are talking about.

Will never be as compact as 16 byte binary UUID's. Will never be as compact as the same compact text encoding without the type prefix. Will always have the runtime overhead of parsing the type and id and rendering them back from the db.

Again if that one specific bug is so common to add extra id and db mapping overhead ALL THE TIME in the application maybe a simple devop tool should be made available to paste an id and verify its type would be better use of computing resources removing constant runtime overhead.

Seriously I have seen this type of bug once in a great while and it sucks with integer keys because they can easily overlap causing pretty nasty outcomes, UUIDs eliminate that. If I had to dig further I might do a search to see if its actually a valid UUID somewhere, but seriously someone is just sending the wrong ID, it should be pretty quick to track down without the prefix since its globally unique.


>> but to use it throughout your db eats up tons of space

I think human resources and outages cost a lot more than disk space these days. If I can trade disk space for either of those I will do it in a blink of an eye.


Not just disk space, larger index size mean less fits in caches and memory, more I/O etc.

These are keys which are typically used constantly by an app in the db with joins etc. They are some of the hottest data typically and you want to avoid cache misses on them if possible.


I can think of a lot of reasons why it might happen.

Maybe you have a process accepts multiple types of objects. Maybe you think you are passing in the correct type of object but you are not. Maybe the person reporting the error to you omitted information. Maybe the person who wrote the application didn't log enough context. Yes, all of these situations are not ideal, but if the process was ideal you wouldn't be debugging it in the first place.

> I really can't imagine how this could ever happen in a business. "Yes, sir, there was a problem with 1087." "1087 what?" "I don't know, but it was number 1087."

This happens all the time in many businesses. Users rarely generate perfect error reports, and it's not uncommon for intermediaries to incorrectly communicate details either. What developer hasn't seen an error report that consists of a vague screenshot with "doesn't work, got this error" as the description?


> Users rarely generate perfect error reports, and it's not uncommon for intermediaries to incorrectly communicate details either. What developer hasn't seen an error report that consists of a screenshot with "doesn't work, got this error" as the description?

And the solution is to have them repeat a UUID to you? I don't think so...


Something can be not a solution to a problem, but contribute to making your life easier. This is that.

There are certainly applications that have type-ambiguous IDs which work just fine too. Not all engineers make the same decisions; that's ok.


I am with you. This makes no sense to me. But I also have wrapped userdata or account data in a generated hash to quickly have access to the underlying account info / expiration info etc. I think it is easier to justify why to implement vs. why not to implement.


> Isn't that going to hurt the b-tree performance by having a very high fanout?

High fanout helps minimise the depth of the tree and therefore helps performance as it minimises the number of nodes to check.


Sorry, I meant low fan out, as in a b-tree node will end up have having a lot of singular valued children ultimately degrading to linked list.


I think random binary keys are pretty good for B trees because they also lead to shallow depth and a balanced tree instead of potentially degrading to a linked list.


There’s a spec called ULID that’s pretty much this with default base32 encoding

https://github.com/ulid/spec

I’ve also worked on a UUID-ULID bridge for Go

https://github.com/sudhirj/uulid.go

And seeing as this is just 128 bits it’s quite easy to move seamlessly between formats and representations.

I’ve found this concept especially useful in nosql stores like DynamoDB, where using a ULID primary key makes objects time sortable automatically. It’s also quite easy to query for items by zeroing out the random component and setting only the time stamp bytes.


ULID has a serious security flaw though. In order to try to improve performance / increase prefix similarity, there is a mode where it reuses the random value, and just increments it, for each new ULID.

With purely random UUID, you can be pretty sure that nobody can guess them, so they're essentially identifier and access token in one.

However once you can easily predict the next uuid from the previous one (as by incrementing the random part by one), the access token property vanishes.

At least this proposal doesn't make the same mistake, however 80 bit's isn't THAT much entropy. (Remember every bit counts with exponential growth, and 48 bits less, means 281474976710656 times easier to guess.)

So it boils down to. Would you be ok to secure all your data with 10 character one time passwords, which might not be guessed all at once, but where guessing SOME is sufficient to get owned?

Death by a thousand paper-cuts.


Thanks for the comment!

Yes, I'd like to clarify Timeflake (and many alternatives suggested) is NOT meant to be used for security.

There's an important note on the readme about this:

> Since the timestamp part is predictable, the search space within any given millisecond is 2^80 random numbers, which is meant to avoid collisions, not to secure or hide information. You should not be using timeflakes for password-reset tokens, API keys or for anything which is security sensitive. There are better libraries which are meant for this use case (for example, the standard library's secrets module).


You have to ask yourself the question though. If "URL-Safety" is a main feature. What URLs / user-data are NOT security relevant?

Leaking users private pictures / documents / whatever, will also kill a company.

I guess one could maybe circumvent this, by using a hash of the ID, but then you have to store that somewhere too, and you're back to square one.

Nevertheless, I like that you fixed the main flaw with ULID. Also you provide way better insight into the tradeoffs, so kudos!


> using a hash of the ID

Hashing the IDs won't solve their lack of entropy. Crude example: If you hash your pincode I still have only 10^4 values to try.

The easiest way to fix this is to add an access token column that is cryptographically random and use both the ID and the token in the URL.

If you trust the 80 bits already in the ID the token only needs to be 80 bits for a total of 160 bits of entropy. But if you do that you have to make sure that a missing ID and invalid token are handled identically from the attackers perspective (same reply, same timing).


> Hashing the IDs won't solve their lack of entropy.

Yes and no. It would still significantly largen the search space, as an attacker wouldn't get a point of reference to latch on to.


Which implementation are you referring to? The Go package uses crypto.random and generation blocks if the system can’t provide enough randomness. It’s possible to override this with custom implementations, but either ways it’s no less secure than a UUID unless the implementation is horrible.

The spec doesn’t specify a source of randomness - an implementation that uses https://xkcd.com/221/ will or course not be a very good one.


If you’re referring to the monotonic counter implementation, then yes, that’s a documented choice you can make if you explicitly want strictly increasing ULIDs on the same millisecond, but given that you can’t ensure this across servers anyway, it’s more an edge case.


Yes, monotonic counters, are ULID's "billion dollar mistake".

They're not an edge case, in that the standard displays them quite prominently. They're actually the only code examples given.

The standard should deprecate them, because people WILL use them "wrong".


Dunno about that. The implementations I’ve seen so far default to /dev/random and make you jump through hoops to get the monotonic counter, so if you manage to enable it they should assume it’s what you want. I’ve actually used them effectively in a couple of NoSQL problems where I didn’t need cryptographic security, just distributed timestamps - I was creating two sequential events (create and update), and they had to always order one after another, even though I created them in the same loop and millisecond.


I've been using ULID on a couple projects. In one case, we're trying to convert an existing project to use them, and... it's a bit of a pain, but it's not ULID's fault as much as just... changing primary keys all over the place is time consuming and not simple.

The biggest pushback (and it's not a terribly big one, imo) I've read on ULID is that the time portion is sold as 'sortable' but is 'only ms-level resolution'. Apparently the critics I was initially reading on ULID needed sub millisecond accuracy, and somehow thought this made ULIDs bad. Seemed a non-useful criticism, as needing that level of time granularity isn't a day to day issue for most people, but might be something to be aware of if the idea of overloading one field with multiple uses is a requirement you have.


If you’re an edge case you’re an edge case. It’s frustrating when people start assuming their use case is normal without any considerations of the wider world.

Sub millisecond precision sorting of high speed database inserts is hardly a normal requirement. At that point even the size of your transaction can matter (depending on the database) and you have to start asking questions like do I want ordering by the time the transaction starts or the time it ends? And if your loading existing data from other sources and care about date ordering you should just be using the a dedicated column with the database appropriate time/date time/time stamp field anyway.


I adopted ULID for a project after seeing it recently mentioned in another HN comments thread[0] and it's very useful. The project uses Azure Table Storage, which encourages querying by ranges of record IDs for efficiency, instead of filtering datasets by arbitrary record properties.

My project revolves around processing windows of time in timestamp-ordered datasets, and ULID works well as a way to generate record IDs in a distributed system that naturally sort in timestamp order. With ULID, all of my major queries can operate on ranges of record IDs, as Azure Table Storage encourages.

[0] https://news.ycombinator.com/item?id=25650907


This is super useful and exactly what I need for a current project. Thanks you for that hint!


This is assuming that the time is linearly increasing, but what if you have a service running on a fast moving object between planets or in alternative dimension where time is not linear or not monotonic?


Can someone expand on this please.

> They should be random, but roughly-ordered over time so that your MySQL/Postgres indices stay fast and efficient as the dataset grows.

I'm currently facing an issue where the chosen uuid format was v4. From my understanding this is the least preferable because it's true random, which causes hell for mysql indices.

What I don't quite udnerstand however, is why is sorted random better if I'm using an index in both cases? I'd like to know what I should be doing and if there is a way to move/migrate if possible


As a rule of thumb, a table that grows over time probably has a "creation time" field, and a query that touches one row probably touches other rows created around the same time (like a weekly report). Since rows are probably stored on disk in chronological order, you can index the "creation time" field and do a good job, but if your primary key is already roughly chronological you may be able to get away with one fewer index and keep inserts/updates cheaper.


In MySQL and some other database systems, primary keys are clustered indexes. A clustered index defines the order in which data is physically stored on disk.

If you have an auto-incrementing primary key like most sane people do, new rows will be sorted to the very end of the table. Inserting data to this table essentially becomes an append-only operation, which is very fast. On the other hand, inserting random values all over the place is like inserting data in the middle of a file and pushing back everything that comes afterward, over and over again. Even on modern SSD's, this is much slower than simply appending data to the end of a file.

For best performance, your primary keys should be as close as possible to auto-incrementing integers. In other words, each new value should be greater than all the values that currently exist in the table. Timestamps are quite useful for this purpose.


This is an InnoDB thing for MySQL. (The old MyISAM was a simple append only thing with separate index file.)


> What I don't quite udnerstand however, is why is sorted random better if I'm using an index in both cases?

Data locality. The default index in most RDBMS is a btree which is ordered and clustered so when you look up items in an index if they sort similarly they’ll be in the same index page(s) and thus cheaper to retrieve. For PKs that means when you go through your records it’s cheap to get siblings.

When the ids are randomly distributed it’s way more expensive to add them because they get added at random positions (writes are scattered which is bad) and when you retrieve them you have to jump through the index at random loading a bunch of pages to get just one or two records from it.

Things are even worse if the table itself is clustered after the PK because now your scattered index is also ducking with your actual data reads.


The indexes advantages would be the same when you think of searching. But for inserting (and deleting) it is much faster if they are already orderded. You don't need (or well, the DB engine) to traverse any trees or anything. If they are 100% ordered, inserting new ones to that index would be just add something at the end of a list, much faster.


Datomic uses a similar approach by default. They called it squuid: https://docs.datomic.com/on-prem/identity.html#squuids


Just to see the big picture, what are the main differences between this and Snowflake, or ULID (https://github.com/ulid/spec)?


I like ordered UUIDs, but if you care about 100% strict and accurate ordering you need to be careful about generating them in app code vs. on the DB side. With the former + using multiple app servers, server clock drift can become an issue. Which is why I still generate my ordered UUIDs on my Postgres server. It means writing some funky PL/pgSQL (or however you decide to do it... I think you can run Python too), but it gets the job done.

I wish ordered UUIDs were built-in to more DBs natively. I remember seeing UUIDv6 being proposed, but I don't know if/when that will be standardised.


How does one generate ordered UUIDs right on the PostgreSQL server?


Use this: https://github.com/tvondra/sequential-uuids

If you cannot install extensions, I just wrote a PL/PGSQL implementation for the time-based generator I could share.


Please do I have been looking for that. Posting it in the repo would help a few people k am guessing.



There's probably a nicer way (or more performant at least), but I created a function that uses the built-in `uuid_generate_v1mc` and then reverse the order of the time component so that high values come first.


UUIDs in URLs reminds me of dakboard's use of UUIDs for accessing their user customizable displays. There isn't any auth, just guess the right $uuid in dakboard.com/screen/uuid/$uuid and you've got access to whatever info people put on their board (lots of calendars). You can find several (mostly public churches/community/sports) just by googling the base url.


IMHO for small projects

where entities are generated in response to a manual action of a user (i.e. relatively rare - the chance 2 users on different machines do it in exactly same moment is very low)

or everything is done in one thread on one machine

a nanosecond/microsecond timestamp makes a practical unique identifier.


How does this compare to Mongodbs ObjectId?

They have roughly the same structure of timestamp+random but an additional incrementing counter-field at the end, sec-resolution rather than ms, and some drivers use something host-dependent in the random component. 96bit total.


> Roughly ordered (K-sortable), incremental timestamp in most significant bits enables faster indexing and less fragmentation on database indices (vs UUID v1/v4).

What does this mean for database performance in simple CRUD scenarios?


Using an ordered ID prevents index fragmentation. Inserts will be slower and ultimately so will reads. In a database like MS SQL Server this is especially true since data is physically stored on disk based on the clustering key which is the PK by default.


not quite sure why the requirement of being compatible with the UUID 128-bit representation. i personally try to use KSUID (160bit) when possible and can’t see any downside so far


Being 128-bits allows you to utilize the same database/object types as UUID.

That's the main benefit IMO. I couldn't stick a KSUID into my Postgres UUID column, i'd have to write a new data type to store that.


Base62? That seems strictly worse than base64 to me. Why that choice?


The two (or three if you retain padding) additional characters are symbols and can cause issues in many places. Base64-url and other variants exist for this reason, but there is no single base64 that works everywhere.


There's no "single base62 that works everywhere" either. So while there's valid criticism to be had of base64, I still don't see how this choice is better.


62 is the number of letters and digits in ASCII, so it's the most compact encoding that doesn't need special punctuation characters (which aren't always safe to use).


How many situations don't let you use _ and -? URLs certainly do.


This is really nice, anyone aware of a NodeJs alternative?


https://github.com/ulid/spec isn't timeflake, but is a good spec, and there are implementations in multiple languages. there are 2 listed for js.

https://github.com/aarondcohen/id128 and https://github.com/ulid/javascript



I filed an issue on the project, "Point out the privacy risks of timestamps in object IDs" [0].

----

Non-random object IDs have privacy issues. Developers need to learn about these before they choose to use non-random IDs over simple random IDs. How about adding a PRIVACY section to the readme?

Timeflake encodes the precise time that a user created an object. Timestamps in IDs can reveal:

User timezone.

Geographic location: If the client software creates multiple associated IDs at the same time (like an article and embedded media), then the differences in timestamps of the IDs can reveal the latency of the client's network connection to the server. This reveals user geographic location. This can also happen if the client creates a single ID and the server adds an additional timestamp to the object.

User identity (de-anonymizing)

- Most Android apps include Google's libraries for working with push notifications. And some iOS apps that use Google Cloud services also load the libraries. These Google libraries automatically load Google Analytics [1] which records the names of every screen the users view in the app, and sends them to Google. So Google knows that userN switched from screen "New Post" to screen "Published Post" at time K.

- Some ISPs record and sell user behavior data. For example, SAP knows [2] that userN made a request to appM's API at time K.

- Even if the posting app does not share its user behavior data with third-parties, the user could post and then immediately switch to an app that does share user behavior data. This provides data points like "userN stopped using an app that does not record analytics at time K".

- Operating Systems (Android [3], Windows [4], macOS [5]) send user behavior data to their respective companies.

- Browsers and Browser Extensions send user behavior data to many companies. Data points like "userN visited a URL at example.com at time K" can end up in many databases and sold.

- Posting times combined with traffic analysis can perfectly de-anonymize users.

How long the user took to write the post. This can happen if the app creates the ID when the user starts editing the post and also shares a timestamp of the publication or save time.

Whether or not the user edited the post after posting it. This can happen if the posts's displayed time doesn't match the timestamp in the ID.

Whether or not the user prepared the post in advance and set it to post automatically. If the timestamp is very close to a round numbered time like 21:00:00, it was likely posted automatically. If the posting platform does not provide such functionality, then the user must be using some third-party software or custom software to do it. This information can help de-anonymize the user.

----

[0] https://github.com/anthonynsimon/timeflake/issues/3

[1] https://firebase.google.com/docs/cloud-messaging/android/cli...

[2] https://www.eff.org/deeplinks/2017/03/five-creepy-things-you...

[3] https://digitalcontentnext.org/wp-content/uploads/2018/08/DC...

[4] https://www.eff.org/deeplinks/2016/08/windows-10-microsoft-b...

[5] https://www.eff.org/deeplinks/2020/11/macos-leaks-applicatio...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: