Hacker News new | past | comments | ask | show | jobs | submit login
Demystifying Database Performance for Developers (crunchydata.com)
173 points by winslett on May 6, 2022 | hide | past | favorite | 29 comments



Generally I'd say do the same as you would with your compiler when trying to optimize your code: spend time with the query analyzer.

What the query optimizer will or won't do depends on your database software and your data, in addition to your query of course.

For example, the article states "publish_date < now() - interval '3 days'" is always faster than "publish_date + interval '3 days' < now()".

Well not for the database server we use, because it evaluates now() per row turning both into table scans... yay.

In addition to the suggestions in the article, I'd say use common table expressions (CTE), or sub-query if your database server doesn't have CTEs or treats them as an optimization barrier.

I've found the optimizer tries to be smart but doesn't recognize which of the joins will be the most specific filter, causing it to join tons of rows, just to throw most of them away with a later join. Using a CTE or sub-query can force the optimizers hand.

Also I've often found it useful to use CTEs or sub-queries for returning row values, typically aggregates. This can especially work well if you're limiting the rows returned using TOP or LIMIT. Again this can avoid the DB throwing away work.


what database server evaluates now() per row?


In Postgres terminology, all functions have a volatility classification: IMMUTABLE (input X will always return y forever), STABLE (input x will always return y for the current statement) and VOLATILE (input x can return anything). That classification impacts how the query gets planned and optimized. All the dbs I know have some version of this.

now() should be STABLE in pretty much every db, but not every db gracefully handles STABLE function calls in predicates.

You probably know all this but figured I'd share the database jargon for everyone else.


That seems like it would cause a lot of problems, other than the optimization.


That's how my colleague described it to me, seems he misspoke or I misremembered.

I checked the documentation and it says now() is evaluated at most once per request.

The point was now() and friends prevents optimization on the server we use, because they're not treated as constants.


It depends on context, for example this is true for SQL server if you have a filtered index based on datetime.

If you use a constant or set a @variable ahead of time, it will use the filtered index, but if you just use =now() equivalent

(or in SQL server GETDATE()) it will not. (You may also need to use OPTION(Recompile))


I have the opinion non-deterministic queries should be avoided in the first place.


This is a good intro! It's good to have quick articles like this where you can easily read and pick up the basics.

It's also pretty easy to build your own basic in-memory database if you have time/interest to show yourself these characteristics through experiment. And doing it experimentally ingrains the concepts better than just reading about it IMO.


How might one go about this?

Do you have any relevant articles?


I highly recommend "How Query Engines Work" by Andy Grove (author of the Apache Datafusion Query Engine + Apache Ballista Distributed scheduler)

It's written in Kotlin and is very intelligible + succinct (99 pages):

https://leanpub.com/how-query-engines-work

https://github.com/andygrove/how-query-engines-work

As a self-assessment, I rewrote the entire implementation from a columnar/vector-based engine (it uses Arrow) to row-based, using JDBC types, and it took me only 2 readthroughs of the book to be able to do that.

There's also "Database Design and Implementation" by Edward Sciore



I feel like the use of indexes is one of the least mysterious parts of databases performance.


I don't.

Well, to be honest, for a long time I had just assumed that the database "took care" of that kind of stuff. In reality, the people who managed the database had indexed columns for us, and/or provided guard-rails for developer folks in the form of stored procedures. For a long time I never worried about indexes because I didn't have to. I still have not even the slightest idea what "a cursor" is.

It's only when you _actually_ have to do it yourself that you learn, sometimes harshly, the nature of databases.


There are things like ottertune that do this.


Does it work?

Seems it's a now a commercial enterprise that has sprung from a CMU DB group project. An old version's still on Github: https://github.com/cmu-db/ottertune

They also had Peloton, a "self-driving SQL database management system", that's a dead project now.


A cursor lets you step through a table one row at a time. You can use it to move between rows and read and update them.


Superficially… being able to tell which index is missing or misconstructed from query plan isn’t always obvious and that’s where speed ups are usually the most needed.


In the sense that it's a sorted tree pointing to data in other trees, sure. But less obvious is how they're used by cost-based optimizers and in execution plans, which is where all the real nuance is.


i remember being surprised by how slow GROUP BY year(date), month(date) was in MySQL, despite the date column being properly indexed.

the solution was to store each of those in separate indexed columns instead of re-computing them at query time. think i set up an INSERT trigger to add these to a separate "index" table that i then joined to get the GROUP BY to be fast.


That concept was a big part of the data modeling story when I was helping people understand how to use Riak, a distributed key/value store. Generate your results as you receive the data, whenever possible, and run batch processes to generate them when you can’t.


Better databases will let you creat indexes on whatever functions of a field you want.


MySQL has supported index expressions for nearly four years, since MySQL 8.0.13.

Prior to that, you could accomplish the same thing indirectly by creating a virtual column with the desired expression, and then creating an index over that virtual column. This has been supported for six years, since MySQL 5.7. So at this point all non-EOL versions of MySQL support this.


The query planner and optimizer etc are still black boxes. As a table’s size grows the way a query runs or the query plan could change. In some systems if you have stale statistics on tables you can get weird performance.

The biggest hangup some devs see the DB like anything else and think lookups are O(1) when looking for a key. But btrees are more like O(log(n)).


All memory is O(log(n)) latency for cache misses, though. You really shouldn't need to worry about B Trees being used instead of hash maps or something like that. For joins, using sorted and even B Tree base data structures can even give asymptotic benefits due to being able to skip keyspace ranges, potentially even at higher levels of the B Tree.


Good article. Story of my career: Accepting as well as teaching that a faster CPU won't make my database faster.


Did a bunch of migrations against two Oracle databases once for different environments of the same app. Same table structure, same indices, exact same SQL for migrations, pretty similar data (e.g. tables with ~40k and ~50k records in each of the environments).

And yet, a SELECT query that worked great against one environment essentially made the query hang on the other, with lots of CPU usage. No locks to speak of. SELECT against individual tables working without issues, but whenever you had two particular tables within JOIN, it all broke down.

Worked:

  SELECT ... FROM A
  WHERE ...;
Worked:

  SELECT ... FROM A
  JOIN B
  WHERE ...;
Worked:

  SELECT ... FROM A
  JOIN C
  WHERE ...;
Stalled:

  SELECT ... FROM A
  JOIN B
  JOIN C
  WHERE ...;
Tried re-creating the indices, did not help. Tried looking at the data, did not help. Killed all of the sessions and restarted the app, did not help. Inspected query plans, they were different - for the same query and the same indices. Made the DB re-generate the table statistics, that solved everything. It's unsettling that the process didn't happen in the background after the migration and that issues like this can eventually surface.

There are some things that you just cannot easily demystify, because they don't make a lot of sense, given that everything had worked properly for years previously without this kind of an issue. Just know that depending on the RDBMS that you use, theory will be different from what you'll see in practice once you venture past the very basics.

On a similar note, Oracle lets you generate indices automatically if you so choose, something that i discovered some time later and had hundreds of missing indices added for noticeable performance gains. Now, i hate the Oracle implementation of it (you cannot manually delete those indices, and the entire set of functionality is behind a paywall, much like the DB itself), but personally i think that dynamic index optimization might as well be something that should be present in PostgreSQL and MySQL/MariaDB as well.

Perhaps not to the point of creating new indices, but just telling you that you might want to add certain ones: "If you add new index X by executing the code Y, then the performance should increase by Z% based on the analysis in the background that have been conducted over the past W months with Q queries analyzed."


"It's unsettling that the process didn't happen in the background after the migration and that issues like this can eventually surface."

It's something you learn the hard way (I did on a prod DB whose performance suddenly dived after months of good work, with clients worried and phoning us). Stats are expensive to update which is why it doesn't happen automatically. Just something you now know for next time.


Shameless plug(s). To address both the problems you described, I have developed the following Postgres extensions.

Index Adviser can be used to automatically analyze your workload, and it will suggest the indexes that might help your queries perform better.

Postgres Plan Guarantee (under development) helps you generate+freeze a plan, and ensure that the Query Optimizer will never pick a different plan for your query.

Postgres Index Advisor: https://github.com/DrPostgres/pg_adviser

Postgres Plan Guarantee: https://github.com/DrPostgres/pg_plan_guarantee


Great drawings! Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: