Just because the table scan is under some threshold doesn't automatically make it better. If a table scan takes 250ms vs 0.01 for a indexed lookup, you're still gonna have to justify to me why making silicon work that damn hard is worth even the electrical use. Are you inserting and deleting so many rows that maintaining the index is prohibitive? Do you have space concerns, and are not able to keep the index in memory, or maybe even on disk? The Bash history thing makes sense, because indexing is a lot of work. But otherwise, just use an index and move on to the next problem.
EDIT: Has anyone else forgotten to add an index, then noticed the performance regression a year later? That's always fun, because now adding the index could mean downtime, and even knowing if or how much is difficult because putting that much data somewhere else to test isn't trivial. No thank you.
For small enough tables doing a full scan will be faster than an index. This is also true for regular non-database applications like checking if an item is present in a small collection: it's faster to do a linear scan over a small vector than it is to do a lookup in a hash table or b-tree. With a linear scan (whether it's for data on disk, or a data structure in memory) you don't have to do hashing or branching (except possibly to terminate the loop). With a binary search you basically get worst possible case for branch predictor, because if you're searching random values whether you go right/left on each recursion level is basically random. This is true for in-memory data structures, and it's even more true on disk, since disks (even SSDs) are especially optimized for linear scans compared to random accesses.
It's been a while since I looked at this, but I think MySQL and Postgres both take this into account and will let you add indexes to small tables but will actually ignore them and do a full table scan for all queries if the table is small enough.
Yes, that's correct. If you do an explain for a tiny table, as long as your stats are up to date, the index will be ignored. In that case, it's there for insurance if the table grows in the future.
Benchmarking this (for non-database applications) is really interesting because you get to find the crossover point, and it's usually in the hundreds of items. An algorithm that searches maximum 50 items in a hash table but does it across many hash tables could be made much faster using a linear search.
Some of his examples aren't great, though. The cost of adding an index to a database you're already using and which has a reasonably small amount of data is low, and the benefit of getting a few hundred ms less latency for your users is underappreciated. A hundred million records is definitely above the limit at which e.g. prefix search on a mobile device becomes painful.
> With a binary search you basically get worst possible case for branch predictor, because if you're searching random values whether you go right/left on each recursion level is basically random.
I think folks writing databases know a thing or two about writing faster search and sort algorithms.
I'm not sure what you are trying to say, given the context of what you quoted was. Binary search has optimal worst case runtime complexity, but for small tables it is, on real computers, overall still slower than a linear scan, which effectively has the worst worst case runtime complexity (unless you start to get pathological). This is because the constant in front of the runtime complexity, the one that O-notation explicitly ignores, starts to matter: Branch prediction and cache locality come into play.
What exactly do you mean with "writing faster search and sort algorithms"?
Those are neat optimizations of the constant, but even with that, a linear scan can still be much faster under the right (and not unrealistic) circumstances.
But I see now you were referring to the branch predictor part specifically, so all good.
All of this looks accurate, but it's worth contextualizing: this is an optimization that bears the most fruit in "local frames of reference" -- the timescales where linear scans beat index lookups are likely to be strictly dominated by the network latency to transceive from the database. The conclusion is then that the optimization ~only optimizes for cases that effectively don't matter.
What do you think the threshold number is where scanning a list will outperform a hash table? 10 items? 100? 1000? For what it’s worth, for small data sets I agree with you. But I think it’s very hard to be calibrated correctly on what small means here.
Re: binary search, the biggest cost to scanning data in a database is loading that data from persistent storage. A few mispredicted branches aren’t going to matter much if it means you can do fewer reads.
> But I think it’s very hard to be calibrated correctly on what small means here.
It depends.
If you measure or determine somehow the sizes of the L1-L2-L3 caches in the CPU, the relative speed and size of the main memory, and the relative speed and size of the local storage (in contrast with for example network-remote data), then it is a matter of performing some arithmetic, and just to be on the safe side, some benchmarks.
Any data set that fits into the CPU cache is definitely small, for a particular CPU.
If you want to decide a number that applies to every and all hardware, then it goes from hard to basically impossible.
The Rust BTreeMap implementation is an example of this as well. It does a linear scan through the nodes instead of a binary search because it's actually faster in this case!
I am glad the query execution planner (mostly) takes the strain. When it doesn’t it is a 99% chance you need to promote an ORM generated query to a, human<<<<< well probably GPT generated explicit one.
it's more then that, PostgreSQL will choose to ignore the index on a per query basis, if the query will return more then some % of the table it's faster to just do the scan.
Justify the dev time to save micro-pennies worth of electricity to me instead.
A typical naive index won't help with my regular expression based queries, which aren't easily accelerated by an index. Or given an in-memory index, you've just increased memory use from O(1) to O(N), and I'll OOM on large files. Perhaps you'll throw a database at the problem, complicating I/O (especially when the data is generated/accessed by third party tools that aren't database based), tying me to yet another library's update cycle, and perhaps forcing me to tackle the additional problem of cache invalidation. Perhaps I need reverse lookups, in which case whatever forward indexing I might've pessemistically added ahead of time will be of no help at all.
If it'a a 5 second "this is probably the right choice" kneejerk reaction, maybe it's fine. Or if you're google indexing the internet, I suppose. But I am frequently plagued by shit, buggy, useless indexes that merely distract from the proper alexanderian solution to the gordian knot - brute force - wasting more time than they ever saved, for both people and CPUs.
If you're already using a relational database you should almost certainly set up indexes on your table ids and foreign keys. But that's pretty different from the examples in the post!
I'm not anti-index, I'm anti-"if you ever have a full table scan in production you're doing it wrong".
> Do you need to know how fast I type before you run the calculation?
I'll assume 100WPM, call that two words, billed at $200/hour and call that $0.06, falling under "too cheap to be worth arguing against", which falls under the aforementioned:
>> If it'a a 5 second "this is probably the right choice" kneejerk reaction, maybe it's fine.
That said, there's a decent chance those 6 cents won't pay for themselves if this is a login on a single user system, with any theoretical benefits of O(...) scaling being drowned out by extra compile times, extra code to load - and I'd be plenty willing to NAK code reviews that merely attempt to replace /etc/passwd and /etc/shadow with this, as the extra time code reviewing the replacement still has negative expected ROI, and it'll be a lot more involved than a mere dozen characters.
Now, maybe we're attempting to centralize login management with Kerberos or something, perhaps with good reasons, perhaps which does something similar under the hood, and we could talk about that and possible positive ROI, despite some actual downtime and teething concerns?
First tell me the minimum amount of time typing this would have to take for you to agree it's not worth it and I will try to keep adding things like the time it takes for someone to ask you to do this, for you to start VS Code, find the file, press ctrl-s, deploy the changes, and possibly some estimation of how long it takes a new developer to read and understand this system (please tell me how fast you speak and an agreeable value for how fast the average developer reads as well for this part) vs a simpler one until it goes over that limit.
> Justify the dev time to save micro-pennies worth of electricity to me instead.
The time spent justifying it is longer than the dev time itself. Any semi experienced engineer will throw basic indexes into their data model without even thinking and cover the most common use cases.
If they never use them… who cares? It took no time to add.
An RDBMS is not the best data store for all data. Sometimes flat files or other are the simplest tool for the job, as shown in the article. Adding a database to any of those tools would definitely not be worth the trade-off.
My statement assumed that as a starting point. To be fair, the decision to use a database or other format, given a reasonable amount of experience, is also a fairly quick thought process. By the time I open my editor to write my first line of code I’ve already made my choice.
> If a table scan takes 250ms vs 0.01 for a indexed lookup
I'm not sure whether the units of that latter number are still ms or now s or whatever, but either way isn't that where you are wrong? On real computers, there are lots of situations where linear access is trivially faster than a "better" data structure with theoretically better runtime complexity.
1000 accesses to a single page for a linear scan are going to be many, many orders of magnitude faster than 5 accesses chasing pointers across 5 pages that have no TLB entry, or excruciatingly worse, that need to be paged in from disk.
This is an extreme (but absolutely not unrealistic!) example for illustration. Slightly more subtle scenarios involving e.g. branch prediction have already been named by eklitzke.
This problem exists across multiple layers, not just close to the CPU like above. (As eklitzke also already mentioned with disk involvement, even if it's an SSD.)
So, forgetting an index and then growth pushing you over the threshold is a valid concern. I think every dev has run into that at some early point in their career.
But what your comment is skipping past is there's potentially a 100x or higher difference in throughput for sequential scans vs indexes. If you know the data will have a bounded size this means indexes aren't necessarily a good choice. SSDs have narrowed this gap a great deal but it still exists because of latency and unpredictable prefecting. It even exists with pure in memory applications. Another key aspect is how much of the data the query ends up touching. If you're hitting 25% of the data anyhow a linear scan is likely faster.
There's also more niche ideas like arranging to convoy multiple queries along one linear scan, something impossible with indexed scans.
It's useful to know about this asymmetry and that sometimes a brute force linear scan is in fact the best tool.
There are situations where the indexes end up larger than, or the same size as, the actual data and the query doesn’t meaningfully benefit from having the data indexed because, for instance, all the data is going to be analyzed anyway, or the search type doesn’t index usefully with the normal index types (like geo searches, or clustering type queries).
Adding indexes that never get used have real costs on an ongoing basis with insert/updates and schema updates too, as it adds potentially significant overhead on every operation and can make certain schema operations impossible without downtime too.
Foreign key columns, ‘soft delete’ columns (like deleted/not), basic auditing type stuff (created on, updated on, etc), ‘unique’ or external reference values like a order id or whatever (even if not a primary/unique key in the schema), basic numeric/analysis columns are almost always worth indexing though, to your point.
Stuff that is not always a clear win without some real thinking is Freeform text fields, structured binary data (like a PDF or image), geo location data without a clear existing use (tends to require specialized index types which are also expensive to load/use), etc.
Many times some preprocessing is necessary anyway to convert what you have to what you actually want, and putting that in a secondary column to be indexed is far more valuable.
AWS is particularly bad with their performance credit system on RDS... and there's to my knowledge no way to tell MySQL to limit index creation IOPS, which means in the worst case you're stuck with a system swamped under load for days and constantly running into IO starvation, if you forget to scale up your cluster beforehand.
Even if the cluster is scaled to easily take on your normal workload, indexing may prove to be too much for IO burst credits
You can use gh-ost (or most other online schema change tools) to get around that. You'll still have some necessary increase in write load since you are double-writing that table but you can control the pacing of the data backfill (gh-ost gives even more control as it doesn't use triggers to duplicate writes). It's not quite as simple as just running ALTER TABLE on the master though.
EDIT: Has anyone else forgotten to add an index, then noticed the performance regression a year later? That's always fun, because now adding the index could mean downtime, and even knowing if or how much is difficult because putting that much data somewhere else to test isn't trivial. No thank you.