Things you should never do: use integers as ID's. This is literally a solved problem, and the solution is UUIDs, which were invented for exactly this job.
The current domain model I am working with utilizes a global integer sequence to key all entities. This implicitly eliminates the class of bugs where the same keys of different types overlap and would otherwise mask exceptions. It also enables powerful domain modeling techniques in which the identities of things are themselves to be thought of as first class entities and referred to as a common class of thing. This is a little mind-bending at first, but it enables some really powerful abstractions that would otherwise be infeasible if we had to switch over all possible types of keys.
The benefits of an integer key vs a guid key are quite profound when you get into the academics of information theory. They provide implicit creation order of things, whereas GUIDs cannot. They are deterministic in that there will never be a collision. Their range can be made to be infinite. Integers are perfectly efficient, even if the computer representation isn't necessarily so - BigInteger types scale gracefully.
> The benefits of an integer key vs a guid key are quite profound when you get into the academics of information theory. They provide implicit creation order of things, whereas GUIDs cannot.
We had this requirement, but a semi-UUID solution is desirable in a distributed setting. Ref prior art by Instagram engineering: https://archive.is/Dydln
(from your linked comment)
> If you are worried about security (i.e. someone hitting sequential keys in your URLs), then this is arguably an application problem. You should probably generate an additional column that stores some more obfuscated representation of the primary key and index it separately.
The creation of integers doesn't scale for large data volumes. Plus you need a place to create these integers, and a failover location, which adds complexity. Multiple machines can each be creating their own guids in a very simple manner.
And the odds of a guid collision is extremely low, and for most applications is acceptable. Having worked with petabytes of data guid performance isn't really an issue as there are more important factors to worry about.
I would recommend reviewing my prior comments on this, as I address the concerns of multiple nodes needing to be able to independently produce identities without collisions or coordination.
If you know beforehand the maximum number of participants in your system, you can divide the keyspace across that quantity. If you are using BigInteger or equivalent, you have an infinite number of these things to work with, so it doesnt really matter if you wind up skipping trillions of identities at first. The original article even advocates for this as its first point, but without as much practical justification.
If you're going to use a key space in the trillions, and partitioned and sparsely populated rather than sequential... isn't that just reinventing what the UUID already is?
A UUID is just a 128-bit integer, with creation algorithms designed to partition that space by things that already have enough entropy to need no further synchronization.
What you're proposing sounds like roll-your-own-UUID, which might be similarly inadvisable as roll-your-own-crypto.
Don't try to compare this to rolling your own crypto. The stakes are nowhere near the same for IDs as for crypto, and the stakes are the defining feature of the "don't roll your own crypto" meme.
Yes I'm comparing it and you don't get to tell me I can't.
It's not about the stakes. It's the idea that in rolling your own, you're going to get it wrong, or otherwise do worse than existing ways that have already solved the same problem.
That applies to literally all code. So if it's not about the stakes, only experts are ever allowed to deploy code of any kind that they've written themselves.
UUIDs also introduce performance challenges. Using them as the primary key will absolutely torpedo your insert performance. They also subtly harm performance on pretty much every other operation. You can fit half (or 1/4, if you can live with 32-bit ints) as many UUIDs in a cache line, and can't even fit one of them in an integer register. The impact on join performance can be pervasive at production scale even when it's not visible at dev scale. And UUIDs tend to result in records being physically arranged in ways that defeat the page cache.
> If you order things by id (often useful for pagination, since incrementing ids are usually ordered by creation date) uuid are of no help.
The use of ordering by a surrogate key is to have a stable ordering with no semantic meaning, so UUIDs work just as well as IDs there.
> You would need to sort by created_at and this would require an additional index.
Yes, if you want to sort by a semantically important data element in a table with surrogate primary key, you'll probably want an index on the data element. So?
> No to speak of pagination by id ranges or whatever is used sometimes.
Assuming a serial column is creation-ordered is not a good idea, but it's usually not too wrong; assuming it's dense, as well, is going to be wrong more often than it is right, except where data is never deleted. Assuming it is dense over the range of a query is even less reliable.
Paginating by serial ID ranges is, almost without exception, a horrible idea.
So, for example, one previous database I worked on had an access pattern where we always wanted to get results sorted by the order in which they were inserted, and we needed this to be a completely stable ordering. Even two rows that get inserted during the same microsecond, as part of the same transaction, needed to have some official (even if arbitrary) designation of which one is first and which one is second. And both of those requirements also meant that the ordering key needed to be unique. And we also needed table access to be as fast as possible, and disk-friendly, and tend to offer the fastest access for the most recent data.
All of this points to using serial numeric IDs as both a unique key and the clustering key. UUIDs do nothing to help with any of those requirements, and generally hurt all of them. They do that because of certain characteristics they have that are specifically there in order to solve one of the few problems we didn't have to worry about.
Tangentially, in software systems design, I've long since realized that the word "always", when left to roam around freely, unchaperoned by any qualifiers to limit its universality, is an indicator of limited breath of experience. So, when you encounter it, it's useful to mentally insert "in my experience" as a stand-in qualifier. Having done so, the next conundrum is that "in my experience" advice is only actionable to the extent that you know what experience the advice giver has to draw on.
integer ids are still often used internally for database primary keys with UUIDs being the done thing for external interfaces.
Personally I've never experienced the "whole class of bugs" that starting with a big integer is supposed to solve. I'm not using PHP so maybe that's why?
I've run into one, but it was pretty dumb. Multitenant application using the account ID as the first element in the URL on a Rails app. when it got to account 404, my address.com/404 went to the static 404 page rather than account 404. That user was pretty confused for awhile.
I don't know rails, but that is just terrible design. address.com/accounts/404 is what the devs should have used, and "404" is a response code and an error page, not a URL you redirect/rewrite URLs to.
The reason (other than "don't expose integer PK externally") people say you should use integer as PK and UUID as a secondary/external facing id is that conventional B-tree indexing of UUID is not as efficient as B-tree indexing of autoinc integers in most databases.
However if you want any sort of efficient lookup on the external key (UUID), your database still needs an index on the UUID, and you are back at square one.
I choose to forego the integer PK and just use UUID since I have to create index over it anyway.
> However if you want any sort of efficient lookup on the external key (UUID), your database still needs an index on the UUID, and you are back at square one
Yes and no. It depends on whether your database treats primary keys differently than other indexes. For example, in InnoDB primary keys are always clustered indexes: the row data is directly stored in a btree arranged by the PK; secondary indexes just store PK values in their leaf nodes, so that they can do a lookup on the clustered index.
As a result, in InnoDB smaller PKs are preferable. So performance is generally better when using an incremental ID as PK and then UUID as a secondary index, as opposed to the reverse. Assuming you have multiple secondary indexes, the total table size will also be smaller in InnoDB with integer PK than with UUID PK.
> However if you want any sort of efficient lookup on the external key (UUID), your database still needs an index on the UUID, and you are back at square one.
Well. Many systems already do a form of this. They store a session ID (the browser cookie) in something like redis which maps to an internal database id which is often incremented. The internal database ids are never seen outside the DB.
In this case it's fine because the external IDs are ephemeral (relatively) and centralization is a hard criteria (you typically can't have two people creating an account with the same user name, or having one email address linked to multiple record IDs, etc.).
This is really why these discussions are pointless without specifics and a concrete system.
Odds are very big that the set of externally visible entities is much smaller than the set of database entities.
That is, unless you decide to put the same interface into your database and your API, what is not rare for OOM-only programmers to do, but always ends in tears.
One bug I saw came from picking an integer id as “one more than the largest id of the current elements we have.” This worked fine until they added support for deleting elements, which also worked fine most of the time.
A common disadvantage of integer keys is that programs will have bugs and use a foo_id as a bar_id. If most ids are small integers then it is likely that a valid foo_id may be a valid bar_id, whereas uuids probably won’t collide. This can be somewhat mitigated with a sufficiently strong type system. Even in a dynamic language like lisp you can represent your ids as e.g. (foo . <id>), and only need to get the tags right on the boundary.
An advantage of integer ids is density: if your ids are likely close together, there are probably some better or more compressed data structures you can use.
This is terrible from a UX perspective though. Nobody wants 36 random digits in their URLs.
Assuming you can stomach the latency hit, the best solution is usually a lookup table where you take all the friendly URL keys and map them to internal identifiers. So if you're making a multiplayer game site and want to create a page where folks can find their friends, then you might support yourgame.net/user/username, yourgame.net/character/charactername, yourgame.net/steam/steamlogin, yourgame.net/xbox/xboxgamertag, etc. Internally you have an inverted index that maps [type, string] to the internal ID for the player, then proceed normally.
sooner or later you'll make a mistake in your code and use a document_id where you meant to use a user_id, and send a bunch of someone else's data to a user.
the "big integer" stops this happening for other integers that might be used in your code and that you might accidentally send to the database as a user_id.
Tell that to your database which is highly optimized for using integers as IDs. Using a UUID everywhere when you could be using an integer is an invitation to low-performance-city.
This sounds like a case of "premature optimization is the root of all evil." Do you have any performance numbers that show UUIDs being any noticeably slower?
A UUID is only the size of two 64-bit integers anyway. And either identifier likely will be a tiny fraction of the data in a particular row. So I'd doubt this is any real performance problem in the vast majority of applications.
UUIDs look big and scary in hex notation, but underneath it's a compact and fast binary format, just a 128-bit integer.
One of the best pieces of programming advice I got at Google was consider IDs carefully. This is not a solved problem, and there isn't a one-sized-fits-all solution. Rather, there are a bunch of guide rules, and then you need to understand your problem domain very well to choose good identifiers.
An ID should be:
1.) Actually an identifier, i.e. it needs to identify objects uniquely and must be immutable throughout the lifetime of the object.
2.) Assigned at object-creation time; you can't have an object without an ID (in persistent storage), lest you'll have no way to refer to that object later. This presents particular problems for a lot of workflows where you want to introduce the concept of a user early, before soliciting personal data. Think of storing browsing histories for guest users, or persisting shopping carts before checkout.
3.) Integral to the storage system. IDs will usually be the ways that you lookup and join objects; the performance characteristics of your database can influence the type of data you choose for an ID.
4.) Because of #2 & #3, oftentimes a significant concurrency bottleneck for object creation. This is the downfall of many auto-increment integer schemes.
5.) The foreign key for other objects. This has space implications that you often have to trade-off against latency implications. If you store useful information within the ID, you can use that info without needing to make a separate query or join against the DB. However, you need to carefully consider whether that'll run afoul with #1, and the more information you put in the ID, the bigger the size bloat for other objects that reference that ID.
6.) Oftentimes a security & PII risk. Because IDs are the primary means for lookup and are guaranteed to be unique, it's awfully tempting to put them in your public APIs (like HTTP URLs). But then anyone who has a URL has all the information included in the ID. If you use sequential IDs, they also have the ability to scan your entire database; this was the downfall of Parler.
Integer keys do really well for #1, #2, #3, and #5, but are very problematic for #4 and #6. But then, there are fairly easy workarounds for those: #4 is often solved by hashing a unique natural key of the object (also solving #6), while #6 can be solved by never exposing internal IDs to the outside world and instead using a lookup table on some friendly URL scheme, which is better for UX anyway (at the cost of #3). GUIDs do well for #1, #2, #4, and #6, but perform worse for #3 and #5.
Natural keys (where you make the identifier some combination of the actual data) are also frequently underrated. The most obvious use for these are relation tables, where each row just indicates that two entities have the relationship that the table describes; you wouldn't normally put an auto-increment ID on that. But think also of something like a search refinement: the most natural key for that is [query, language] => [list of suggestions], and that uniquely identifies each set of refinements, and it has other useful properties where the rest of your search engine doesn't need to know that refinements exist (it already has the query), and you don't need a lookup call from query to some search_refinement_id or vice versa, and you can easily enumerate the set of languages that a given query has data for, and you can follow a chain of refinements without any intermediate lookups. If you were to suggest either integers or UUIDs for this problem I'd say that you're overengineering. And if the problem domain changes such that the key-set changes (for example, you want to include past search context, or you want to personalize refinements to each user) then I would recommend you start a different system from scratch and eventually replace or merge in the current one, because those problems have sufficiently different requirements that your whole data pipelines are going to be different. (In particular, personalized search requires PII handling, a lot of care in logging, encryption of the data when at rest, knowledge of a lot more entities in the system, etc.)
The main issue in my experience are so called "user enumeration attacks", especially with sequential ids that are normally used.
This is where an attacker is able to leak information from your system just by guessing ids. If you used a sequential id then you can cycle from 1 to X and probably easily find which are valid users. You can then likely see how many valid users/ids there are and potentially pull their data if the authentication has been implemented incorrectly.
By using a UUID, the key space is so large that you cannot reasonably guess user ids. So you can't use those same brute force techniques and it makes extricating data much harder.
Obscure user IDs could be a defense in depth measure but really you need to be authorizing the data you release against the authenticated session cookie. A view meant for the user's own consumption shouldn't take a user ID at all, just pull it from the session.
This is really an ancillary benefit and not a real solution to enumeration vulnerabilities. The concern is usually over sharding the data, unique ID generation guarantees, and performance. You shouldn't really be deciding UUID vs. Int based on the possibility for ID enumeration. It's almost certainly easier to come up with a solution to slow/prevent enumeration than move from Ints to UUID.
There's also leaking financial, business data. E.g you create an order and you see your order id being 3000 the first day and 4000 the next, so then you'll be able to guess how many orders this company has etc.
The primary key serves a database function, and its data type should be optimized to suit that function. If you need an enumeration resistant id for public use, a uuid is a great idea, but it's a mistake to make the primary key in a database do double duty as a public identifier and impose inappropriate constraints on it as a consequence.
this, also there's no possibility of collision. You can't have two processes generate the same ID (which means you can generate your IDs in code and send them to the database, which is really useful in some situations).
Also, and this is the main one for me - if I mess up my code and accidentally use a document_id instead of a user_id, I'll just get "not found" instead of someone else's data.
UUID collision is not possible in practice. You can absolutely code in the absolute knowledge that your application will not generate 2 identical UUID's ever. The odds are astronomical. You have bigger concerns.
And yes, every DB backend has solved collision. But as I said, it's useful to be able to generate ID's in code sometimes. This is possible with UUID's and not possible with integers.
UUID is a loosely defined concept. There are many implementations. Some of them have no chance of collision at all, some have an astronomical chance, and some have very real odds that you'll receive a call at 3AM during the new year's celebration.
In the types of systems that need UUIDs there is probably no easy way to check for collisions. The prospect of mystery data corruption with no ability to trace it down frightens the hell out of me.
The only reason that issue was reported is because someone was actually doing the collision checking. That's not going to be the norm in UUID systems. Think about it.
Many shops (I presume, since I work at non-FAANG) have their own version of UUID generation that includes info about the machine/instance it was generated from. I don’t think we’ve formally proved it but I think ours is guaranteed unique in our fleet.