keep in mind TPC-C scales out in a trivial manner. It was designed before distri...

jordanlewis · on Sept 2, 2022

Actually, the designers of TPC-C were ahead of their time - they deliberately designed TPC-C to not be trivially shardable. The new order transaction, which is the heart of the workload, is required to cause a read outside of a shard 1% of the time [0] for each of the 10 on average items in a new order. So, 10% of the time on average, the workload should emit a cross-shard transaction.

A correctly implemented TPC-C workload does actually need to issue cross-shard transactions. There are a lot of "TPC-C-like" workloads that don't actually implement this behavior, though, along with other harder to handle attributes like foreign keys and sleep/wait time.

TPC-C is really an enduring, interesting benchmark.

[0]: See section 2.4.1.5 2 of the spec https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-c...

AdamProut · on Sept 2, 2022

That's true, I should have been a bit more specific.

Each individual query runs within a single shard as far I recall (its been a few years...). There maybe transactions (like new order) that do multiple queries such that the transaction does read and writes to more then one shard. This doesn't really change how easy the workload is to scale out too much though. Harder to scale workloads require a single query to run across many (often all) shards or require data reshuffling for joins (there is not one good shard key that keeps all joins shard local). In these cases adding more nodes or shards will make each query a bit slower.

derekperkins · on Sept 2, 2022

That matches many, if not most most real world workloads. When hotspots happen, you just continue to shard, as they don't have to be equally sized, even if you shard a single warehouse id to a dedicated node and vertically scale it up.

AdamProut · on Sept 2, 2022

Yep, lots of real world workloads shard very well. That wasn't my point. My point was that "most" distributed SQL Databases that do sharding will do very well on TPC-C. You can google search for equivalently impressive results for many distributed SQL databases. Each of them can keep adding more warehouses to the TPC-C workload along with more hosts and shards to the distributed SQL database to push the TPC-C TpmC results into the millions.