Deconstructing the Postgres planner to find indexing opportunities

jerrysievert · on Nov 4, 2021

I used to do this the "old fashioned" way:

* log the queries from the application

* parse the queries into an AST

* collate the ASTs to form the most common query patterns

* cross-reference existing indexes

* create index recommendations with and without predicates

that worked pretty well, up until you're trying to decide which index type is going to be best.

this is pretty darned neat in comparison. I do hope they open-source it like they did pg_query.

lfittl · on Nov 4, 2021

(author here)

Nice approach! The latter part of that workflow is where we are focused now, i.e. now that we can plan and recommend B-tree indexes for individual queries, how do we utilize this to recommend a set of indexes for a whole database (taking into account different predicate variations).

On open-sourcing this code: Not at this moment - we do however offer a WASM build of this logic for free use at https://pganalyze.com/index-advisor

jerrysievert · on Nov 4, 2021

> The latter part of that workflow is where we are focused now, i.e. now that we can plan and recommend B-tree indexes for individual queries, how do we utilize this to recommend a set of indexes for a whole database

what I've ended up doing is trying to approximate the number of rows, create dummy data that has a similar spread of data, and create indexes and query until I have a good understanding of exactly how each index behaves with the amount of data. It ends up being a lot of work, but sometimes getting a 50% performance increase is worth that amount of work overall.

> On open-sourcing this code: Not at this moment

completely understandable, the selfish side of me hopes that it will be open sourced eventually, but the practical side of me sees the benefits to your company for keeping it private for now.

gopalv · on Nov 5, 2021

> create index recommendations with and without predicates

I messed around with Postgres hypothetical indexes[1], which was a cheap way to create a lot of them and see what happens.

[1] - https://hypopg.readthedocs.io/en/rel1_stable/

jerrysievert · on Nov 5, 2021

those also require real or simulated data to get good recommendations, which is sometimes hard. if you're going to go that far, you might as well create real indexes.

besides, in some environments access to the real data (if there's PII) is much more difficult to get, so simulating is as far as you can go.

polskibus · on Nov 5, 2021

This is one of the feature that MS SQL Server has and it works very well. It can suggest indexes with a pretty good correctness.

Moreover, at SIGMOD 2019 there was a paper where they do that automatically in Azure, forking the dbs and loads!. https://www.microsoft.com/en-us/research/uploads/prod/2019/0...

not sure if this is where pganalyze is going, if not - it's certainly worth a read. Whatever you guys end up building, it may be a good idea to architect it in a way that makes auto-indexing like in Azure possible.

BrentOzar · on Nov 5, 2021

> This is one of the feature that MS SQL Server has and it works very well. It can suggest indexes with a pretty good correctness.

Not really: it's just a comma-delimited list of columns to consider indexing, sorted by the column order in the table. It just looks scientific to folks because they don't realize that. https://dba.stackexchange.com/q/208947/426

polskibus · on Nov 5, 2021

Thank you very much for this insight about MS SQL!

My point about building an indexing hint tool for auto-indexing in Postgresql still stands though - it could be great for cloud deployments.

devops000 · on Nov 4, 2021

I tried to use your index advisor but lots of functions present in my query are not supported (bool_or, numeric, etc..). Moreover it crashes with big complex query. I have also tried to setup pg agent but it’s been a few days and still not able to setup log insights for a self-hosted PG (and our pg logs works). Basically we spent our trial just to try to setup the agent . And still not completed.

As a feedback, I think you should invest more in the onboarding phase and not counting days in trial if not logs are collected. Ironically if you record a screencast installing pganalyze on a digital ocean docker would be enough.

lfittl · on Nov 5, 2021

(author here)

Thanks for trying out the pganalyze app! Sorry to hear about the challenges with log monitoring setup - happy to restart the trial and do a screenshare to debug - you can email me directly at lukas AT pganalyze.com

On the index advisor errors: Sounds like you are running into errors with cases where the planner calls functions (e.g. for type conversion, expression simplification). We've recently improved some cases there, but numeric specifically I know there is a pending bugfix. Happy to take a closer look.

wiradikusuma · on Nov 5, 2021

I use this "visualizer" all the time: https://explain.dalibo.com/

Clean UI, free.

rad_gruchalski · on Nov 4, 2021

> We utilized libclang to automatically extract source code from Postgres, just like we've done for pg_query.

How?

lfittl · on Nov 4, 2021

(author here)

Just realized that could use a more direct link (will add), here is the relevant section from our post on pg_query:

https://pganalyze.com/blog/pg-query-2-0-postgres-query-parse...

The quick summary: Call libclang using the Ruby FFI bindings to parse C code into AST and locate function/symbol definitions, build a tree of dependent code, remove all code that is not referenced + add a few stubs where there is direct disk access.

rad_gruchalski · on Nov 4, 2021

Thanks for that info. I was looking recently at extracting the parsing layer so this will be pretty interesting.

tomnipotent · on Nov 5, 2021

I believe DuckDB has also extracted the Postgres parser into a separate package.

flakiness · on Nov 5, 2021

pg_query is very nice! But I'm not sure how valuable it is to have the standalone analyzer. Why not just ask to the database? You need the statistics and the catalog to do the math anyway, don't you?