FWIW, Marginalia Search has had nearly a billion rows in a table. It runs on a s...

endorphine · on May 23, 2023

Generally speaking, why does SELECT COUNT(*) takes so much? I'd expect that the database maintains internal bookkeeping structures with table metadata that contain the number of rows in each table.

I reckon this is probably not true? If so, is it because it keeping a counter like that up-to-date would be inefficient

Edit: I just realized I might be misunderstanding what that query does

ddorian43 · on May 23, 2023

Learn about MVCC storage. You might have multiple running concurrent transactions. Which is the "true" row count? Hint: you have to count.

jl6 · on May 23, 2023

Makes me wonder if anybody would find a SELECT APPROXIMATELY COUNT(*) useful, which would ignore the impact of current transactions.

15155 · on May 23, 2023

While not "SELECT APPROXIMATELY COUNT (*)," both MySQL and PostgreSQL both offer various metadata tables with approximate (albeit completely unfiltered) total row counts.

qeternity · on May 23, 2023

Many DB systems have some sort of HLL function to provide a similar approximation (although I think you’re overestimating the costs that MVCC impose on large datasets).

ddorian43 · on May 23, 2023

Note that HLL is used for counting `unique` things. You don't need HLL for counting the rows of the table.

kgwxd · on May 23, 2023

Doesn't SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED do that? Or maybe a NOLOCK hint equivalent in MySQL?

clarkmcc · on May 23, 2023

Seems like you could just count the number of primary key entries in the index.

evanelias · on May 23, 2023

Essentially what SELECT COUNT(*) does in InnoDB is choose the smallest index and fully scans that, in parallel if there's no WHERE clause.

Meanwhile the primary key is typically the largest index, since with InnoDB's clustered index design, the primary key is the table. So it's usually not the best choice for counting unless there are no secondary indexes.

As other commenters mentioned, the query also must account for MVCC, which means properly counting only the rows that existed at the time your transaction started. If your workload has a lot of UPDATEs and/or DELETEs, this means traversing a lot of old row versions in UNDO spaces, which makes it slower.