Just because the table scan is under some threshold doesn't automatically make i...

eklitzke · on May 25, 2023

For small enough tables doing a full scan will be faster than an index. This is also true for regular non-database applications like checking if an item is present in a small collection: it's faster to do a linear scan over a small vector than it is to do a lookup in a hash table or b-tree. With a linear scan (whether it's for data on disk, or a data structure in memory) you don't have to do hashing or branching (except possibly to terminate the loop). With a binary search you basically get worst possible case for branch predictor, because if you're searching random values whether you go right/left on each recursion level is basically random. This is true for in-memory data structures, and it's even more true on disk, since disks (even SSDs) are especially optimized for linear scans compared to random accesses.

It's been a while since I looked at this, but I think MySQL and Postgres both take this into account and will let you add indexes to small tables but will actually ignore them and do a full table scan for all queries if the table is small enough.

pkulak · on May 25, 2023

Yes, that's correct. If you do an explain for a tiny table, as long as your stats are up to date, the index will be ignored. In that case, it's there for insurance if the table grows in the future.

strken · on May 26, 2023

Benchmarking this (for non-database applications) is really interesting because you get to find the crossover point, and it's usually in the hundreds of items. An algorithm that searches maximum 50 items in a hash table but does it across many hash tables could be made much faster using a linear search.

Some of his examples aren't great, though. The cost of adding an index to a database you're already using and which has a reasonably small amount of data is low, and the benefit of getting a few hundred ms less latency for your users is underappreciated. A hundred million records is definitely above the limit at which e.g. prefix search on a mobile device becomes painful.

ignoramous · on May 25, 2023

> With a binary search you basically get worst possible case for branch predictor, because if you're searching random values whether you go right/left on each recursion level is basically random.

I think folks writing databases know a thing or two about writing faster search and sort algorithms.

anyfoo · on May 25, 2023

I'm not sure what you are trying to say, given the context of what you quoted was. Binary search has optimal worst case runtime complexity, but for small tables it is, on real computers, overall still slower than a linear scan, which effectively has the worst worst case runtime complexity (unless you start to get pathological). This is because the constant in front of the runtime complexity, the one that O-notation explicitly ignores, starts to matter: Branch prediction and cache locality come into play.

What exactly do you mean with "writing faster search and sort algorithms"?

ignoramous · on May 25, 2023

> What exactly do you mean with "writing faster search and sort algorithms"?

- Branchless binary search: https://news.ycombinator.com/item?id=35737862 / https://news.ycombinator.com/item?id=14598098

- Static B-Trees for faster binary search: https://news.ycombinator.com/item?id=30376140

Presumably there's even more literature on the topic.

anyfoo · on May 26, 2023

Those are neat optimizations of the constant, but even with that, a linear scan can still be much faster under the right (and not unrealistic) circumstances.

But I see now you were referring to the branch predictor part specifically, so all good.

pkulak · on May 26, 2023

Which is why they use b-trees and not binary search.

birdyrooster · on May 25, 2023

Why do you think that? Seems like a hard problem.

mcqueenjordan · on May 25, 2023

All of this looks accurate, but it's worth contextualizing: this is an optimization that bears the most fruit in "local frames of reference" -- the timescales where linear scans beat index lookups are likely to be strictly dominated by the network latency to transceive from the database. The conclusion is then that the optimization ~only optimizes for cases that effectively don't matter.

josephg · on May 25, 2023

What do you think the threshold number is where scanning a list will outperform a hash table? 10 items? 100? 1000? For what it’s worth, for small data sets I agree with you. But I think it’s very hard to be calibrated correctly on what small means here.

Re: binary search, the biggest cost to scanning data in a database is loading that data from persistent storage. A few mispredicted branches aren’t going to matter much if it means you can do fewer reads.

Shorel · on May 25, 2023

> But I think it’s very hard to be calibrated correctly on what small means here.

It depends.

If you measure or determine somehow the sizes of the L1-L2-L3 caches in the CPU, the relative speed and size of the main memory, and the relative speed and size of the local storage (in contrast with for example network-remote data), then it is a matter of performing some arithmetic, and just to be on the safe side, some benchmarks.

Any data set that fits into the CPU cache is definitely small, for a particular CPU.

If you want to decide a number that applies to every and all hardware, then it goes from hard to basically impossible.

chlorion · on May 26, 2023

The Rust BTreeMap implementation is an example of this as well. It does a linear scan through the nodes instead of a binary search because it's actually faster in this case!

https://doc.rust-lang.org/std/collections/struct.BTreeMap.ht...

pkulak · on May 26, 2023

Did we both just watch the latest Crust of Rust, or did you just know that? If the latter, I’m impressed!

chlorion · on May 27, 2023

>Did we both just watch the latest Crust of Rust

Yes that's exactly how I knew about this!

fulafel · on May 26, 2023

Interesting. But the doc doesn't go as far as to say it's faster, just fast - any measurements for eg string keys?

chlorion · on May 28, 2023

I can't seem to find any benchmarks for this but Gankra did a long write up about Rust's btree implementation!

I did not read this entire thing so it's very possible it contains a benchmark somewhere but I couldn't find any when doing a quick skim.

https://faultlore.com/blah/rust-btree-case/

ars · on May 25, 2023

Yes but when the table is so small even if the index is slower it hardly matters because the table is so small.

Or in other words add the index every single time, unless you have so many writes and such a large table that the actual writes slow down the index.

Leaving out an index on a small table never helps anything.

quickthrower2 · on May 26, 2023

I am glad the query execution planner (mostly) takes the strain. When it doesn’t it is a 99% chance you need to promote an ORM generated query to a, human<<<<< well probably GPT generated explicit one.

cwmma · on May 26, 2023

it's more then that, PostgreSQL will choose to ignore the index on a per query basis, if the query will return more then some % of the table it's faster to just do the scan.

MaulingMonkey · on May 25, 2023

Justify the dev time to save micro-pennies worth of electricity to me instead.

A typical naive index won't help with my regular expression based queries, which aren't easily accelerated by an index. Or given an in-memory index, you've just increased memory use from O(1) to O(N), and I'll OOM on large files. Perhaps you'll throw a database at the problem, complicating I/O (especially when the data is generated/accessed by third party tools that aren't database based), tying me to yet another library's update cycle, and perhaps forcing me to tackle the additional problem of cache invalidation. Perhaps I need reverse lookups, in which case whatever forward indexing I might've pessemistically added ahead of time will be of no help at all.

If it'a a 5 second "this is probably the right choice" kneejerk reaction, maybe it's fine. Or if you're google indexing the internet, I suppose. But I am frequently plagued by shit, buggy, useless indexes that merely distract from the proper alexanderian solution to the gordian knot - brute force - wasting more time than they ever saved, for both people and CPUs.

pkulak · on May 25, 2023

> Justify the dev time to save micro-pennies worth of electricity to me instead.

KEY (user_id)

I mean, it's a dozen characters. Do you need to know how fast I type before you run the calculation?

jefftk · on May 25, 2023

(author)

If you're already using a relational database you should almost certainly set up indexes on your table ids and foreign keys. But that's pretty different from the examples in the post!

I'm not anti-index, I'm anti-"if you ever have a full table scan in production you're doing it wrong".

Aeolun · on May 26, 2023

‘If you ever have a non-deliberate full table scan in production you are doing it wrong’ then?

grog454 · on May 26, 2023

If you ever have non-deliberate anything in production there's a non zero chance you're doing it wrong.

MaulingMonkey · on May 25, 2023

> Do you need to know how fast I type before you run the calculation?

I'll assume 100WPM, call that two words, billed at $200/hour and call that $0.06, falling under "too cheap to be worth arguing against", which falls under the aforementioned:

>> If it'a a 5 second "this is probably the right choice" kneejerk reaction, maybe it's fine.

That said, there's a decent chance those 6 cents won't pay for themselves if this is a login on a single user system, with any theoretical benefits of O(...) scaling being drowned out by extra compile times, extra code to load - and I'd be plenty willing to NAK code reviews that merely attempt to replace /etc/passwd and /etc/shadow with this, as the extra time code reviewing the replacement still has negative expected ROI, and it'll be a lot more involved than a mere dozen characters.

Now, maybe we're attempting to centralize login management with Kerberos or something, perhaps with good reasons, perhaps which does something similar under the hood, and we could talk about that and possible positive ROI, despite some actual downtime and teething concerns?

TylerE · on May 25, 2023

Now document the two words, run the test suite to verify no-breakagem commit to source control, and push to production.

Suddenly those two words cost a lot more than $0.06, and that's IF everything goes smoothly.

Aeolun · on May 26, 2023

This falls under the “if we need to test that indexes still work on our mysql/postgres we have bigger problems” header.

TylerE · on May 26, 2023

Depending on your IO pattern, adding an index can make writes quite a bit slower.

shpx · on May 25, 2023

First tell me the minimum amount of time typing this would have to take for you to agree it's not worth it and I will try to keep adding things like the time it takes for someone to ask you to do this, for you to start VS Code, find the file, press ctrl-s, deploy the changes, and possibly some estimation of how long it takes a new developer to read and understand this system (please tell me how fast you speak and an agreeable value for how fast the average developer reads as well for this part) vs a simpler one until it goes over that limit.

scott_w · on May 25, 2023

> Justify the dev time to save micro-pennies worth of electricity to me instead.

The time spent justifying it is longer than the dev time itself. Any semi experienced engineer will throw basic indexes into their data model without even thinking and cover the most common use cases.

If they never use them… who cares? It took no time to add.

yuliyp · on May 25, 2023

An RDBMS is not the best data store for all data. Sometimes flat files or other are the simplest tool for the job, as shown in the article. Adding a database to any of those tools would definitely not be worth the trade-off.

scott_w · on May 26, 2023

My statement assumed that as a starting point. To be fair, the decision to use a database or other format, given a reasonable amount of experience, is also a fairly quick thought process. By the time I open my editor to write my first line of code I’ve already made my choice.

Aeolun · on May 26, 2023

Sometimes, but I lean towards going with the RDBMS first, and then switch to flat text files if that proves to be a better choice.

Because 90% of the time the RDBMS is.

teen · on May 25, 2023

not sure if this is trolling?

muhehe · on May 25, 2023

> Justify the dev time

And this is exactly the sentiment that got us where we are.

lazide · on May 25, 2023

The mistake in your statement I think, is assuming there is a course of action that wouldn’t result in a problem somewhere.

It may look a little different, but there is nothing without tradeoffs.

Including indexing everything.

anyfoo · on May 25, 2023

> If a table scan takes 250ms vs 0.01 for a indexed lookup

I'm not sure whether the units of that latter number are still ms or now s or whatever, but either way isn't that where you are wrong? On real computers, there are lots of situations where linear access is trivially faster than a "better" data structure with theoretically better runtime complexity.

1000 accesses to a single page for a linear scan are going to be many, many orders of magnitude faster than 5 accesses chasing pointers across 5 pages that have no TLB entry, or excruciatingly worse, that need to be paged in from disk.

This is an extreme (but absolutely not unrealistic!) example for illustration. Slightly more subtle scenarios involving e.g. branch prediction have already been named by eklitzke.

This problem exists across multiple layers, not just close to the CPU like above. (As eklitzke also already mentioned with disk involvement, even if it's an SSD.)

Dylan16807 · on May 26, 2023

If your table is that small won't the index also be a single page? I don't understand this scenario.

jasonwatkinspdx · on May 25, 2023

So, forgetting an index and then growth pushing you over the threshold is a valid concern. I think every dev has run into that at some early point in their career.

But what your comment is skipping past is there's potentially a 100x or higher difference in throughput for sequential scans vs indexes. If you know the data will have a bounded size this means indexes aren't necessarily a good choice. SSDs have narrowed this gap a great deal but it still exists because of latency and unpredictable prefecting. It even exists with pure in memory applications. Another key aspect is how much of the data the query ends up touching. If you're hitting 25% of the data anyhow a linear scan is likely faster.

There's also more niche ideas like arranging to convoy multiple queries along one linear scan, something impossible with indexed scans.

It's useful to know about this asymmetry and that sometimes a brute force linear scan is in fact the best tool.

lazide · on May 25, 2023

I agree - for 90% of cases.

There are situations where the indexes end up larger than, or the same size as, the actual data and the query doesn’t meaningfully benefit from having the data indexed because, for instance, all the data is going to be analyzed anyway, or the search type doesn’t index usefully with the normal index types (like geo searches, or clustering type queries).

Adding indexes that never get used have real costs on an ongoing basis with insert/updates and schema updates too, as it adds potentially significant overhead on every operation and can make certain schema operations impossible without downtime too.

Foreign key columns, ‘soft delete’ columns (like deleted/not), basic auditing type stuff (created on, updated on, etc), ‘unique’ or external reference values like a order id or whatever (even if not a primary/unique key in the schema), basic numeric/analysis columns are almost always worth indexing though, to your point.

Stuff that is not always a clear win without some real thinking is Freeform text fields, structured binary data (like a PDF or image), geo location data without a clear existing use (tends to require specialized index types which are also expensive to load/use), etc.

Many times some preprocessing is necessary anyway to convert what you have to what you actually want, and putting that in a secondary column to be indexed is far more valuable.

leetrout · on May 25, 2023

Agree but you can always add an index without downtime. It just becomes a money and complexity issue ;)

zht · on May 25, 2023

can you do this (on a large table) without adding significant IO load/clogging up replication for an extended period of time?

lazide · on May 25, 2023

It’s worth noting that if your DB instance is so heavily loaded that this is a real concern, you already have a huge problem that needs fixing.

pornel · on May 25, 2023

It may be so loaded from all the full table scans it's doing.

mschuster91 · on May 25, 2023

AWS is particularly bad with their performance credit system on RDS... and there's to my knowledge no way to tell MySQL to limit index creation IOPS, which means in the worst case you're stuck with a system swamped under load for days and constantly running into IO starvation, if you forget to scale up your cluster beforehand.

Even if the cluster is scaled to easily take on your normal workload, indexing may prove to be too much for IO burst credits

lazide · on May 25, 2023

That does seem like a real problem! Adding indexes periodically is a pretty regular thing for any production system where I come from.

grogers · on May 25, 2023

You can use gh-ost (or most other online schema change tools) to get around that. You'll still have some necessary increase in write load since you are double-writing that table but you can control the pacing of the data backfill (gh-ost gives even more control as it doesn't use triggers to duplicate writes). It's not quite as simple as just running ALTER TABLE on the master though.

nightpool · on May 25, 2023

I have never had any problems with CONCURRENT index creations under significant load using Postgres, fwiw