“We’ll add an index when it gets slow” is saying “screw the guy who’s on call th...

bob1029 · on May 25, 2023

> Invisible cliffs in your code that it will fall off at some unknown point in the future are the antithesis of engineering.

I think this describes the general shape of problem seen in any engineering domain. After a certain point of time (i.e. unknown), we can no longer guarantee certain properties of the system. This is why we add all kinds of qualifiers and constraints in our discussions with the customer.

Certainly, you can make some predictions about how tables will grow over time, but there are also some potential headaches that can be incurred if you arbitrarily add indexing. "We might need to expire records to manage growth" so you go ahead and index that CreatedUtc column. Turns out, the usage patterns of your app changed such that that table is now experiencing 20x more insert volume than you could have ever anticipated and your prediction is now a gigantic liability. With 20x insert volume you would have _never_ indexed that column. You would have used some alternative path involving temporary tables, etc. Now, you are stuck with essentially same problem - a judgement call at 2am regarding whether or not you should drop an index while the rest of your team is in bed.

Since I am unable to predict the future I feel like waiting until the actual problem arises and then dealing with it at that moment is the best option. Strategically, not having an index will likely fail more gracefully than if an index is bottlenecking requests during peak load (i.e. instantaneous vs cumulative data volume).

kevan · on May 25, 2023

The key to sleeping at night is to add metrics and alarms near those cliffs or 'edges of the map' so they aren't invisible anymore. It's hard to anticipate every angle to this but in any case where you're making this kind of assumption or tradeoff it's a good idea.

A simple example is setting an alarm on your max load test traffic. When you get that alarm you know your system is now operating in unknown territory and it's probably time to do another load test or at least take a close look at scaling.

jameshart · on May 25, 2023

Not advocating for building indexes for queries you might later want to run.

But for the queries you are building today? Don’t build them such that they will gradually slow down over time.

jorblumesea · on May 25, 2023

Largely agree with this take, it often becomes "add an index because query might be slow" without much discussion or trade offs around query patterns. There's a lot of "what if" over engineering that happens where you end up in hypothetical scenarios.

Look at your real world use cases, query patterns and think hard about it.

jefftk · on May 25, 2023

> screw the guy who’s on call the night this system starts timing out

This was a very small billing practice, and that person was going to be me. I thought then, and still think now, that I made a reasonable trade off between what would be work at the time and potential future urgent work.

Additionally, this wasn't the sort of thing that would fail suddenly when you hit a critical point. Instead, running full text searches (a very small fraction of what the database was handling) would just slow down in a way that would have been very clear to the users.

jameshart · on May 25, 2023

Stealing time from future you also doesn’t pay off. Future you wants to take vacations and change jobs and have family events and sleep and stuff.

It doesn’t take much work to do the back of envelope math:

How long should these queries take? Less than two seconds?

How much slower do they get as record counts increase? 100ms every 250,000 records? Okay, so this will become intolerably slow when the record count hits about 5 million. When’s that going to happen? Never? Then we’re done here. Within a few months? Well let’s plan for that now. Within a few years? Maybe not within the lifetime of this system? Judgement call. Let’s put something in place that forces us to at least look back at this before it gets bad. Even if that’s just a google calendar reminder.

jefftk · on May 25, 2023

When building systems we are always making trade-offs between the present and the future. It's only "stealing time from future you" when you make bad trade-off; otherwise it's prioritization.

In this case, leaving out an index for full text search meant I didn't need to maintain the software for that index either, which would have been stand-alone at the time I was building this. Possibly this even was a choice that was, in expectation, time-saving for future me.

jameshart · on May 25, 2023

Feeling a little bit like you’re applying a motte and Bailey argument here.

The bold claim in the article was that there are many circumstances where adding an index isn’t necessary. Diverse examples were given. These included a MySQL database, where adding an index is no additional maintenance overhead whatsoever (FULLTEXT idx in the table DDL). The implication was that there are many circumstances that affect many people where even cheap and easy to implement indexes were not needed.

Now when challenged you’re retreating to the more defensible position that in the case where you’re querying an ad hoc data structure where you would have to implement your own index that that is often a bad trade off. Or maybe it was a bad tradeoff because this was a small app you were building that you knew you would be the long term maintainer for.

But that’s not what I’m talking about is it, and I don't know all the specific circumstances of your single specific project where this was the right choice, so we’re not going to get anywhere.

jefftk · on May 26, 2023

> These included a MySQL database, where adding an index is no additional maintenance overhead whatsoever (FULLTEXT idx in the table DDL)

MySQL didn't add support for FULLTEXT until v5.6, released 2013-02, a few years after I was working on this. At the time if I had wanted a full text search index it would have needed to be an additional service.

jefftk · on May 26, 2023

Looking now (no longer on my phone) it's a bit more complex than that: while MySQL has supported FULLTEXT indexes since 3.23.23 (2000-09-01) [1] if you wanted to use InnoDB (and you probably did -- it was much better than MyISAM [2]) you initially couldn't use FULLTEXT. That was added in v5.6 [3], and at the time I was developing this software the standard option was to set up Sphinx.

I've edited the post to add some of this history, so future readers understand this was about whether to add a dependency on an external indexing service.

[1] http://dev.cs.ovgu.de/db/mysql/News-3.23.x.html ("Full-text search via the MATCH() function and FULLTEXT index type (for MyISAM files). This makes FULLTEXT a reserved word.")

[2] https://stackoverflow.com/questions/7492771/should-i-use-myi...

[3] https://downloads.mysql.com/docs/mysql-5.6-relnotes-en.pdf ("MySQL now supports FULLTEXT indexes for InnoDB tables.")

fulafel · on May 26, 2023

Most systems get scrapped and never go to production. Early stage is validation and search for a useful concept. It often pays off very well to steal time from the future.

plugin-baby · on May 25, 2023

> Invisible cliffs in your code that it will fall off at some unknown point in the future are the antithesis of engineering.

On the other hand, they are the essence of _software_ engineering.

lmm · on May 25, 2023

Indices are far more likely to cause a cliff than to remove one, IME. Full table scans are linear and grow linearly. Indices are crazy fast until one day they're suddenly not.

TristanBall · on May 26, 2023

Can you clarify?

I don't think I've ever seen such a cliff where the database is using the same query plan? There's sometimes small incremental slow downs as you pass btree depth thresholds and have to add more, but tbh even that's not usually noticeable.

I have seen plenty of instances where the database decides an index is no longer the right access,but gets that wrong, and the result is terrible.. but I see that as a different issue.

lmm · on May 26, 2023

> There's sometimes small incremental slow downs as you pass btree depth thresholds and have to add more, but tbh even that's not usually noticeable.

Even this kind of "narrow" slowdown can be substantial. Well, I don't know specifically that it's depth thresholds, but that kind of behaviour - the table passes a certain size and it's suddenly taking 30% longer to insert/read than it did when it was 2% smaller.

> I have seen plenty of instances where the database decides an index is no longer the right access,but gets that wrong, and the result is terrible.. but I see that as a different issue.

IMO it's the same issue, at least from the perspective of the poor chump on call.

bcrosby95 · on May 26, 2023

> Even this kind of "narrow" slowdown can be substantial. Well, I don't know specifically that it's depth thresholds, but that kind of behaviour - the table passes a certain size and it's suddenly taking 30% longer to insert/read than it did when it was 2% smaller.

This has usually been a memory problem for us - our hot dataset no longer fits in it.

sa46 · on May 26, 2023

Joins, however, scale exponentially if you only have full table scans.

lmm · on May 26, 2023

Yeah but it's not like you get 10 new customers and that suddenly adds 10 more joins to your query in the middle of the night. At least I hope not.

hinkley · on May 25, 2023

As the number of cliffs increases, so does the likelihood that you will hit a second cliff in the middle of fixing the first one. At some point you hit a Kessler Syndrome situation where people start quitting and the people who should have fixed the problem in the first place aren't around anymore.

Nobody wants to stick around for a lecture they can see coming.

themerone · on May 25, 2023

Usually things will slow down gradually or fail right away under production loads.

pixl97 · on May 25, 2023

W.U.C.T

Works until critical threshold.

In enterprise software I noticed software tends to work in a WUCT'ed up manner. Things may slow down over time, but no one complains about it because "the software is just slow", then suddenly one day you hit the timeout component of some higher layer and then the software is completely and utterly broke.

marcosdumay · on May 25, 2023

Well, this isn't a valid reason. It's not the software that is causing the WUCT.

If you let WUCT run free in any complex process, you are just asking for everything to break all the time and people to do nothing but fire-fighting. So, you are much better dealing with the WUCT causes instead of insisting that software or any other component scales forever? (Because they never will, and you will just make something else bream.)

WUCT is one of the main reasons why we monitor system operations.

bawolff · on May 25, 2023

Everything has a critical threshold. Even your perfect db schema will still eventually run out of disk. There are no magic solutions that work on all scale levels.

jameshart · on May 26, 2023

But there are ways to tell that that is going to happen before it happens.

magicalhippo · on May 25, 2023

Unless production loads are non-uniform in time, think Black Friday or similar.

jameshart · on May 25, 2023

Yup. Quarter end reporting has been going swimmingly for years but this time the orders query is hitting a connection timeout and the numbers are due tomorrow.

That threshold doesn’t gradually creep up on you, it hits you all at once at the worst possible time.

david422 · on May 25, 2023

But when you have a 100 places all adding little slowdowns it can be difficult to realize.