For me #1 would be to add a version to any data format or communication
protocol. If you want to know how hard not doing so can bite, don't look further than Git and it's tourcherous migration from sha1.
Why? We use protocol buffers for communication between services, evolve them all the time, and don't use version numbers. Any new functionality is enabled by booleans or optional fields in the proto. The approach works fine, I haven't noticed any problems in years.
I've started to lean more and more against versioning as time has gone on. The problem is that it's an invitation for users to stay on a specific version indefinitely. Which makes your life much harder as you have to ensure all of these old versions continue to work while backporting critical fixes. I prefer a continuous deprecation cycle where you support things in parallel for a while, then add deprecation warnings, and finally remove the functionality. How fast this happens depends on how critical the issue is.
Versions make sense in a lot of areas where stability is needed. But they should be seen as issuing binding contracts to your users. You should spend a lot of time thinking about what the terms and conditions are before doing so.
> With each change, we explicitly assign a distinct version to serializers.
> We do this independent of source code or build versioning. We also store the serializer version with the serialized data or in the metadata. Older serializer versions continue to function in the new software. We find it’s usually helpful to emit a metric for the version of data written or read. It provides operators with visibility and troubleshooting information if there are errors. All of this applies to RPC and API versions, too.
Backwards compatibility and the ability to roll out new changes (that can co-exist with current data) are the primary drivers.
Things you should never do: use integers as ID's. This is literally a solved problem, and the solution is UUIDs, which were invented for exactly this job.
The current domain model I am working with utilizes a global integer sequence to key all entities. This implicitly eliminates the class of bugs where the same keys of different types overlap and would otherwise mask exceptions. It also enables powerful domain modeling techniques in which the identities of things are themselves to be thought of as first class entities and referred to as a common class of thing. This is a little mind-bending at first, but it enables some really powerful abstractions that would otherwise be infeasible if we had to switch over all possible types of keys.
The benefits of an integer key vs a guid key are quite profound when you get into the academics of information theory. They provide implicit creation order of things, whereas GUIDs cannot. They are deterministic in that there will never be a collision. Their range can be made to be infinite. Integers are perfectly efficient, even if the computer representation isn't necessarily so - BigInteger types scale gracefully.
> The benefits of an integer key vs a guid key are quite profound when you get into the academics of information theory. They provide implicit creation order of things, whereas GUIDs cannot.
We had this requirement, but a semi-UUID solution is desirable in a distributed setting. Ref prior art by Instagram engineering: https://archive.is/Dydln
(from your linked comment)
> If you are worried about security (i.e. someone hitting sequential keys in your URLs), then this is arguably an application problem. You should probably generate an additional column that stores some more obfuscated representation of the primary key and index it separately.
The creation of integers doesn't scale for large data volumes. Plus you need a place to create these integers, and a failover location, which adds complexity. Multiple machines can each be creating their own guids in a very simple manner.
And the odds of a guid collision is extremely low, and for most applications is acceptable. Having worked with petabytes of data guid performance isn't really an issue as there are more important factors to worry about.
I would recommend reviewing my prior comments on this, as I address the concerns of multiple nodes needing to be able to independently produce identities without collisions or coordination.
If you know beforehand the maximum number of participants in your system, you can divide the keyspace across that quantity. If you are using BigInteger or equivalent, you have an infinite number of these things to work with, so it doesnt really matter if you wind up skipping trillions of identities at first. The original article even advocates for this as its first point, but without as much practical justification.
If you're going to use a key space in the trillions, and partitioned and sparsely populated rather than sequential... isn't that just reinventing what the UUID already is?
A UUID is just a 128-bit integer, with creation algorithms designed to partition that space by things that already have enough entropy to need no further synchronization.
What you're proposing sounds like roll-your-own-UUID, which might be similarly inadvisable as roll-your-own-crypto.
Don't try to compare this to rolling your own crypto. The stakes are nowhere near the same for IDs as for crypto, and the stakes are the defining feature of the "don't roll your own crypto" meme.
Yes I'm comparing it and you don't get to tell me I can't.
It's not about the stakes. It's the idea that in rolling your own, you're going to get it wrong, or otherwise do worse than existing ways that have already solved the same problem.
That applies to literally all code. So if it's not about the stakes, only experts are ever allowed to deploy code of any kind that they've written themselves.
UUIDs also introduce performance challenges. Using them as the primary key will absolutely torpedo your insert performance. They also subtly harm performance on pretty much every other operation. You can fit half (or 1/4, if you can live with 32-bit ints) as many UUIDs in a cache line, and can't even fit one of them in an integer register. The impact on join performance can be pervasive at production scale even when it's not visible at dev scale. And UUIDs tend to result in records being physically arranged in ways that defeat the page cache.
> If you order things by id (often useful for pagination, since incrementing ids are usually ordered by creation date) uuid are of no help.
The use of ordering by a surrogate key is to have a stable ordering with no semantic meaning, so UUIDs work just as well as IDs there.
> You would need to sort by created_at and this would require an additional index.
Yes, if you want to sort by a semantically important data element in a table with surrogate primary key, you'll probably want an index on the data element. So?
> No to speak of pagination by id ranges or whatever is used sometimes.
Assuming a serial column is creation-ordered is not a good idea, but it's usually not too wrong; assuming it's dense, as well, is going to be wrong more often than it is right, except where data is never deleted. Assuming it is dense over the range of a query is even less reliable.
Paginating by serial ID ranges is, almost without exception, a horrible idea.
So, for example, one previous database I worked on had an access pattern where we always wanted to get results sorted by the order in which they were inserted, and we needed this to be a completely stable ordering. Even two rows that get inserted during the same microsecond, as part of the same transaction, needed to have some official (even if arbitrary) designation of which one is first and which one is second. And both of those requirements also meant that the ordering key needed to be unique. And we also needed table access to be as fast as possible, and disk-friendly, and tend to offer the fastest access for the most recent data.
All of this points to using serial numeric IDs as both a unique key and the clustering key. UUIDs do nothing to help with any of those requirements, and generally hurt all of them. They do that because of certain characteristics they have that are specifically there in order to solve one of the few problems we didn't have to worry about.
Tangentially, in software systems design, I've long since realized that the word "always", when left to roam around freely, unchaperoned by any qualifiers to limit its universality, is an indicator of limited breath of experience. So, when you encounter it, it's useful to mentally insert "in my experience" as a stand-in qualifier. Having done so, the next conundrum is that "in my experience" advice is only actionable to the extent that you know what experience the advice giver has to draw on.
integer ids are still often used internally for database primary keys with UUIDs being the done thing for external interfaces.
Personally I've never experienced the "whole class of bugs" that starting with a big integer is supposed to solve. I'm not using PHP so maybe that's why?
I've run into one, but it was pretty dumb. Multitenant application using the account ID as the first element in the URL on a Rails app. when it got to account 404, my address.com/404 went to the static 404 page rather than account 404. That user was pretty confused for awhile.
I don't know rails, but that is just terrible design. address.com/accounts/404 is what the devs should have used, and "404" is a response code and an error page, not a URL you redirect/rewrite URLs to.
The reason (other than "don't expose integer PK externally") people say you should use integer as PK and UUID as a secondary/external facing id is that conventional B-tree indexing of UUID is not as efficient as B-tree indexing of autoinc integers in most databases.
However if you want any sort of efficient lookup on the external key (UUID), your database still needs an index on the UUID, and you are back at square one.
I choose to forego the integer PK and just use UUID since I have to create index over it anyway.
> However if you want any sort of efficient lookup on the external key (UUID), your database still needs an index on the UUID, and you are back at square one
Yes and no. It depends on whether your database treats primary keys differently than other indexes. For example, in InnoDB primary keys are always clustered indexes: the row data is directly stored in a btree arranged by the PK; secondary indexes just store PK values in their leaf nodes, so that they can do a lookup on the clustered index.
As a result, in InnoDB smaller PKs are preferable. So performance is generally better when using an incremental ID as PK and then UUID as a secondary index, as opposed to the reverse. Assuming you have multiple secondary indexes, the total table size will also be smaller in InnoDB with integer PK than with UUID PK.
> However if you want any sort of efficient lookup on the external key (UUID), your database still needs an index on the UUID, and you are back at square one.
Well. Many systems already do a form of this. They store a session ID (the browser cookie) in something like redis which maps to an internal database id which is often incremented. The internal database ids are never seen outside the DB.
In this case it's fine because the external IDs are ephemeral (relatively) and centralization is a hard criteria (you typically can't have two people creating an account with the same user name, or having one email address linked to multiple record IDs, etc.).
This is really why these discussions are pointless without specifics and a concrete system.
Odds are very big that the set of externally visible entities is much smaller than the set of database entities.
That is, unless you decide to put the same interface into your database and your API, what is not rare for OOM-only programmers to do, but always ends in tears.
One bug I saw came from picking an integer id as “one more than the largest id of the current elements we have.” This worked fine until they added support for deleting elements, which also worked fine most of the time.
A common disadvantage of integer keys is that programs will have bugs and use a foo_id as a bar_id. If most ids are small integers then it is likely that a valid foo_id may be a valid bar_id, whereas uuids probably won’t collide. This can be somewhat mitigated with a sufficiently strong type system. Even in a dynamic language like lisp you can represent your ids as e.g. (foo . <id>), and only need to get the tags right on the boundary.
An advantage of integer ids is density: if your ids are likely close together, there are probably some better or more compressed data structures you can use.
This is terrible from a UX perspective though. Nobody wants 36 random digits in their URLs.
Assuming you can stomach the latency hit, the best solution is usually a lookup table where you take all the friendly URL keys and map them to internal identifiers. So if you're making a multiplayer game site and want to create a page where folks can find their friends, then you might support yourgame.net/user/username, yourgame.net/character/charactername, yourgame.net/steam/steamlogin, yourgame.net/xbox/xboxgamertag, etc. Internally you have an inverted index that maps [type, string] to the internal ID for the player, then proceed normally.
sooner or later you'll make a mistake in your code and use a document_id where you meant to use a user_id, and send a bunch of someone else's data to a user.
the "big integer" stops this happening for other integers that might be used in your code and that you might accidentally send to the database as a user_id.
Tell that to your database which is highly optimized for using integers as IDs. Using a UUID everywhere when you could be using an integer is an invitation to low-performance-city.
This sounds like a case of "premature optimization is the root of all evil." Do you have any performance numbers that show UUIDs being any noticeably slower?
A UUID is only the size of two 64-bit integers anyway. And either identifier likely will be a tiny fraction of the data in a particular row. So I'd doubt this is any real performance problem in the vast majority of applications.
UUIDs look big and scary in hex notation, but underneath it's a compact and fast binary format, just a 128-bit integer.
One of the best pieces of programming advice I got at Google was consider IDs carefully. This is not a solved problem, and there isn't a one-sized-fits-all solution. Rather, there are a bunch of guide rules, and then you need to understand your problem domain very well to choose good identifiers.
An ID should be:
1.) Actually an identifier, i.e. it needs to identify objects uniquely and must be immutable throughout the lifetime of the object.
2.) Assigned at object-creation time; you can't have an object without an ID (in persistent storage), lest you'll have no way to refer to that object later. This presents particular problems for a lot of workflows where you want to introduce the concept of a user early, before soliciting personal data. Think of storing browsing histories for guest users, or persisting shopping carts before checkout.
3.) Integral to the storage system. IDs will usually be the ways that you lookup and join objects; the performance characteristics of your database can influence the type of data you choose for an ID.
4.) Because of #2 & #3, oftentimes a significant concurrency bottleneck for object creation. This is the downfall of many auto-increment integer schemes.
5.) The foreign key for other objects. This has space implications that you often have to trade-off against latency implications. If you store useful information within the ID, you can use that info without needing to make a separate query or join against the DB. However, you need to carefully consider whether that'll run afoul with #1, and the more information you put in the ID, the bigger the size bloat for other objects that reference that ID.
6.) Oftentimes a security & PII risk. Because IDs are the primary means for lookup and are guaranteed to be unique, it's awfully tempting to put them in your public APIs (like HTTP URLs). But then anyone who has a URL has all the information included in the ID. If you use sequential IDs, they also have the ability to scan your entire database; this was the downfall of Parler.
Integer keys do really well for #1, #2, #3, and #5, but are very problematic for #4 and #6. But then, there are fairly easy workarounds for those: #4 is often solved by hashing a unique natural key of the object (also solving #6), while #6 can be solved by never exposing internal IDs to the outside world and instead using a lookup table on some friendly URL scheme, which is better for UX anyway (at the cost of #3). GUIDs do well for #1, #2, #4, and #6, but perform worse for #3 and #5.
Natural keys (where you make the identifier some combination of the actual data) are also frequently underrated. The most obvious use for these are relation tables, where each row just indicates that two entities have the relationship that the table describes; you wouldn't normally put an auto-increment ID on that. But think also of something like a search refinement: the most natural key for that is [query, language] => [list of suggestions], and that uniquely identifies each set of refinements, and it has other useful properties where the rest of your search engine doesn't need to know that refinements exist (it already has the query), and you don't need a lookup call from query to some search_refinement_id or vice versa, and you can easily enumerate the set of languages that a given query has data for, and you can follow a chain of refinements without any intermediate lookups. If you were to suggest either integers or UUIDs for this problem I'd say that you're overengineering. And if the problem domain changes such that the key-set changes (for example, you want to include past search context, or you want to personalize refinements to each user) then I would recommend you start a different system from scratch and eventually replace or merge in the current one, because those problems have sufficiently different requirements that your whole data pipelines are going to be different. (In particular, personalized search requires PII handling, a lot of care in logging, encryption of the data when at rest, knowledge of a lot more entities in the system, etc.)
The main issue in my experience are so called "user enumeration attacks", especially with sequential ids that are normally used.
This is where an attacker is able to leak information from your system just by guessing ids. If you used a sequential id then you can cycle from 1 to X and probably easily find which are valid users. You can then likely see how many valid users/ids there are and potentially pull their data if the authentication has been implemented incorrectly.
By using a UUID, the key space is so large that you cannot reasonably guess user ids. So you can't use those same brute force techniques and it makes extricating data much harder.
Obscure user IDs could be a defense in depth measure but really you need to be authorizing the data you release against the authenticated session cookie. A view meant for the user's own consumption shouldn't take a user ID at all, just pull it from the session.
This is really an ancillary benefit and not a real solution to enumeration vulnerabilities. The concern is usually over sharding the data, unique ID generation guarantees, and performance. You shouldn't really be deciding UUID vs. Int based on the possibility for ID enumeration. It's almost certainly easier to come up with a solution to slow/prevent enumeration than move from Ints to UUID.
There's also leaking financial, business data. E.g you create an order and you see your order id being 3000 the first day and 4000 the next, so then you'll be able to guess how many orders this company has etc.
The primary key serves a database function, and its data type should be optimized to suit that function. If you need an enumeration resistant id for public use, a uuid is a great idea, but it's a mistake to make the primary key in a database do double duty as a public identifier and impose inappropriate constraints on it as a consequence.
this, also there's no possibility of collision. You can't have two processes generate the same ID (which means you can generate your IDs in code and send them to the database, which is really useful in some situations).
Also, and this is the main one for me - if I mess up my code and accidentally use a document_id instead of a user_id, I'll just get "not found" instead of someone else's data.
UUID collision is not possible in practice. You can absolutely code in the absolute knowledge that your application will not generate 2 identical UUID's ever. The odds are astronomical. You have bigger concerns.
And yes, every DB backend has solved collision. But as I said, it's useful to be able to generate ID's in code sometimes. This is possible with UUID's and not possible with integers.
UUID is a loosely defined concept. There are many implementations. Some of them have no chance of collision at all, some have an astronomical chance, and some have very real odds that you'll receive a call at 3AM during the new year's celebration.
In the types of systems that need UUIDs there is probably no easy way to check for collisions. The prospect of mystery data corruption with no ability to trace it down frightens the hell out of me.
The only reason that issue was reported is because someone was actually doing the collision checking. That's not going to be the norm in UUID systems. Think about it.
Many shops (I presume, since I work at non-FAANG) have their own version of UUID generation that includes info about the machine/instance it was generated from. I don’t think we’ve formally proved it but I think ours is guaranteed unique in our fleet.
I literally came here to comment that the very first bug I had to chase down in a production system in my first job was caused by this! There was a 'fuzzy' search field available to users and when one input a client's ID number (formatted as NNNNNNN) it was showing results for a different client because NNNNNNN was also the format of the unique ID for records in the database.
I encountered a similar problem recently while migrating an old app. It modeled IDs as 64-bit BIGINT columns (a decision made in the 2000s) and populated them with MySQL’s UUID_SHORT function. This worked fine for over a decade until we migrated to a different MySQL system and started getting conflicts for IDs already existing or not being able to find a row which had just been inserted.
They’d missed the part that UUID_SHORT() returns an unsigned integer and created the column as the default signed integer. MySQL uses an algorithm where the top n bits are based on the server ID, which worked on the old server which had id=1 and never returned a number where the first bit was 1. The new cluster fortunately always did so the problem was immediately identified, but it was confused by one of those bonus MySQL data-destruction features – the way it silently truncated data meant that it was silently truncating new IDs to the same value but the logged value wasn’t in the database at all.
A better tip would be to use something like time-ordered uuids: You can‘t misuse them for something different and the added bonus is that no one can iterate your db records by just incrementing the url.
If you are implying that having an un-guessable URL secures your data, you might want to reconsider that approach. Your data should be secured server-side based on the actual auth in your app.
Those shorter strings can come with their own caveats. Long-form UUIDs have the nice property of being URL-safe - but IIRC, path pieces in HTTP urls are not meant to be case-sensitive, and browsers have historically tried to "help" by re-casing URLs. For correct semantics, short UUIDs have to go in the request parameters instead.
Most of the url (including the path pieces) is supposed to be case sensitive unless otherwise specified for a given protocol (and treating e.g. %3A the same as %3a). This should be the most recent RFC on the topic: https://tools.ietf.org/html/rfc3986#section-6.2.2.1
Not bugs but I've encountered a consequence which is worse than a bug.
In a recent project I arrived at, I started seeing 612 in random places in the code. It was, naturally, the ID of a very specific user. Having an easy ID to remember, it is a temptation for sloppy programmers to just hardcode certain checks against a particular ID instead of following proper procedures. It was a bad project and a bad team, and after some 10 years or so, the code was now flooded with 612 and a couple of other IDs.
Sure, you could avoid such a thing with code revisions and such, but then again it's better to simply not put the temptation in front of the programmers, isn't it?
I once debugged a case where the id had left sided zero padding so the whole id length was a fixed number of digits, for example `238974` might be represented as `00238974`.
In a process that joined two tables the match counts were off. Somehow, on one side of the join the id was being converted to an integer and then back to a string, which stripped the zero padding. Meaning that `00238974` was failing to match `238974`.
I like that pattern too. I've tried using it in a rather ad-hoc way in C/C++ enums. I suppose the proper solution would be to have a header file, enum_starting_vals.h, to centralise them.
Personally I would also disallow anything below 32 to avoid having filenames contain escape sequences.
I have absolutely no need for a filename that contains an escape character, and would see this as a major bug, like his description of SQL injection. Better to fail fast.
Serious answer: this is still very much in flux. We're not at the point where you'd get audible gasps from using these terms, but in certain audiences you'll get some raised eyebrows.
If you believe that having "black=bad, white=good" connotations built into our jargon is harmful in some way, then yes, you should strive to use alternate terms.
If you don't think that's harmful, and you're forceful in defending "black=bad, white=good" as being too established and too inconvenient to change, that's when you're likely to get pushback.
Personally I've been trying to use "deny-list" and "allow-list" (with partial success – changing your jargon is hard). They feel a bit clunky, but I suspect that clunkiness will fade with time.
its not blacklist and whitelist because "black=bad, white=good", black is the abscense of light, hence the light is blocked, white would obviously be the opposite. So its black because you are blocking things, and white because you aren't.
“Blacklist” and “whitelist” are confusing terms. Why does “black” mean “no”? There are plenty of reasons to use better terms in new projects other than “they upset people” – though that should be a good enough reason on its own.
You’re not wrong that a segment of the population has deigned this term offensive, but it’s maybe a very extreme fringe segment. A lot of self-proclaimed “woke” types would roll their eyes at this one. But the same language reformers did convince GitHub to rename “master” branches to “main” so who knows.
Regardless, this article is from 2011, so it predates the discussion.
While I'll avoid the political nature, I do find the terms are misused. In Firefox extensions, for security it's more intuitive (based on prior usage) for security-related extensions like an ad block. Anything on a whitelist is allowed (to include ads), and on the blacklist, it is not allowed (ads are blocked).
But I have another extension that just modifies the format of some pages to make them easier on my eyes. It took experimentation for me to realize that items on the whitelist are modified, while those on the blacklist are not. My intuition told me whitelists are for sites that are good the way there are, and sites that are hard on the eyes should be put on the blacklist.
It's not the worst mix-up in the world, and perhaps many smarter than myself would have no such confusion, but I think it could be avoid with words that mean what they do. There's plenty of room in the extension configuration to put "apply to these sites" or "transformed", as well as "allow original format" or "unchanged."
What? Did I miss a memo or something? I would have hoped to get an actual answer to my question instead of rhetorical sighs (and downvotes). Note: I am not a native English speaker.
Yes, but why? How? I'm a non-native-speaker as well, and these things might be obvious to you, but they aren't to me. Is it a "should not use it", "must not use it", "maybe" or "it's fashionable"? How bad would it be if I accidentially use it, more like a four-letter-word or more like "well, he's a foreigner, he doesn't know"?
I'm the opposite of the other person: Use the known terms until someone complains. Very few people have a problem with `blacklist` just like very few people had a problem with `master` branch.
That's the problem with flamewars. You can't really recommend in any direction without baiting the flames. You'll just have to look for previous discussions and decide for yourself.
Yes, there was never some contract that we all signed or some thing some majority of us formally voted that says "we're not supposed to call it a blacklist/whitelist anymore.
Just what some random groups decided and enforced at their own domains (companies, orgs, etc.).
It's also based on an American preoccupation with race issues, seeing everything through it's own guilt-ridden history, concerns not relevant to other parts of the world (where code is written and English is also spoken, as first or second language for IT).
The connotations of black/white s terms have nothing to do with slavery or blacks, the term blacklist was first used (recorded) in an English theater play, as the list of the enemies of the kind (black alluding to shady, dark motives, etc, not to skin color), and its common colloquial use in the 20th century was also not about blacks or had anything to do with slavery: it was the list where employees put union members, strikers, etc not to hire.
It's better for people in the US to concentrate on fixing actual racial issues (from incarceration rates and cop shootings, to school funding, redlining and loan access) than to play with words to pat itself in the back.
People all over the world have used black/white to certain things (sometimes the inverse too, e.g. in some asian cultures white is associated with death), and it has nothing to do with the US practicing slavery, seggregation, and racism to blacks.
We use those terms with some connotations for centuries before blacks were sold as slaves to pick your cotton, even at times when slaves where whites working for other whites (as in Ancient Persia, Greece, the Roman Empire, feudal times, and so on).
"As of 2011 Facebook is in the second group, and spends several milliseconds of CPU time sanitizing every display string on its way to the browser, which multiplies out to hundreds of servers worth of CPUs sitting in a datacenter paying the price for the invalid UTF-8 in the databases."
I can imagine how companies will be taxed extra for this somewhere in the (probably not so near) future.
Why not sanitize it once when accessed, write the sanitized result back to the database and set a flag on the record. Use only presanitized strings if the flag is set. After some time, when enough of your strings are flagged, run a sanitizer over the rest that hasn't been touched. That way you don't waste those milliseconds. However, I can imagine that this would waste quite some storage bandwidth because such a read is now a write as well.
[1] https://secure.phabricator.com/source/phabricator/browse/mas...