The Expensify test was targeted at demonstrating the upper limits of SQLite itself. The hardware used:
> 1TB of DDR4 RAM
> 3TB of NVME SSD storage
> 192 physical 2.7GHz cores (384 with hyperthreading)
And a very simple read-only query w/o overlapping rows, with the full dataset already loaded in memory:
> The query itself just sums a range of 10 rows randomly located inside 10B row (447GB) two column database of random integers, one of which is indexed and the other not. This means that the test is fully cached in RAM, and the database is precached prior to the test. No writes are done during this test, only reads.
All that is good and fine for that purpose -- to demonstrate the limits of what SQLite can do in scaling UP.
The purpose of the PlanetScale post was demonstrating linear scalability by scaling OUT using a realistic workload (TPC-C is a warehouse / ecommerce site simulation) on small commodity VMs (ec2 isntances).
As noted, you're comparing two very different tests that had very different goals.
i strongly dislike planetscale and even i think this is not a useful or fair comparison .. sqlite works with local files, much simpler, no network latency, no HA
> i strongly dislike planetscale and even i think this is not a useful or fair comparison .. sqlite works with local files, much simpler, no network latency, no HA
To be fair, if you accomplish the same effect without needing distributed computing, then the solution gets much simpler, you cut out worrying about network latency and HA*
* the probability of node failure grows very rapidly with the number of nodes. Same reason you probably don't need RAID for 1 disk, but you definitely do for a dozen
If you actually crunch the numbers, you'll see that the likelihood of multiple node simultaneous failure of a larger cluster is actually often as high, or higher than the likelihood of a single node failure on a non-distributed application.
Despite running redundant distributed set-ups, services today really don't have particularly spectacular availability. Google, Facebook, Twitter, Github; all go down or brown out frequently.
No one is running their OLTP and batch processing workloads on sqlite (bedrock or not) at scale. There may be unique cases for specific situations but show me an org with large and broad concurrency demands running sqlite at scale and I'm happy to be proven wrong.
No, it's pretty much an sqlite problem. As good as it is for limited concurrency situations (and synthetic benchmarks like this expensify blog), it does not compare to mysql or postgres for mediating hundreds or thousands of transactions per second.
Maybe, maybe not, but that wasn't the claim here. It was just a shitty argument against sqlite, one that could equally apply to mysql, lisp, questdb, apl, and a host of other tech if it were valid. Organizations choose tech stacks for a ton of reasons, most of which are only loosely related to performance and maintenance characteristics. The (proposed) fact that most organizations don't scale out sqlite doesn't say much at all about whether doing so I'd a good idea.
My personal DB solution actually averages 45 GiB / second on my laptop on a single thread and scales 1-1 with threads. It's way faster than SQLite. This is painfully mediocre performance.
You have to make engineering trade-offs for this, though. In my case, you can only recover 1 KiB records. It doesn't support compression. And you can only write to /dev/null.
To read data from this DB (which I built with command-line tools any Unix has), run:
dd if=/dev/zero of=/dev/null bs=1K count=${count}
You can vary the $count parameter to get more data.
So what you're saying is that the same type of queries were ran in expensify's case and in planetscale's case, therefore we should not be excited for running 1m queries per second using MySQL?
I know their devs (or the vitess devs) are on here...
No foreign keys are supported - are there any other schema level constraints that are not supported that would also enable scaling out and achieving these numbers?
1. You can't have single database view of all shards
2. No distributed ACID
3. No distributed JOINs
If I were to do it in PostgreSQL 14 natively without any other third party plugins, I can get help of Table Partitioning (say root server) + Tables on Foreign Servers (say leaf servers) through FDW. Parallel scans and JOINs are allowed. Also, PostgreSQL 15 (in beta now) will allow parallel commits to foreign server tables through FDW.
1. You can't have single database view of all shards
- Vitess does query planner and routing for the application so application does not have to know about shards. Yes, Vitess gives a view of a single database.
2. No distributed ACID
- This has been talked a lot feature and it is in our roadmap to provide. Currently, we have a twoPC implementation which we do not recommend for production use.
Right now the users of vitess either choose their sharding key in such a way that all the queries in a transaction goes to single shard or the application can handle best effort multi shard transaction.
3. No distributed JOINs
- Vitess does support distributed joins, aggregation, subqueries and other complex queries. Vitess thrive to be 100% Mysql compatible. If anyone encouter such query, a github issue helps in knowing and prioritizing.
> If I were to do it in PostgreSQL 14 natively without any other third party plugins, I can get help of Table Partitioning (say root server) + Tables on Foreign Servers (say leaf servers) through FDW. Parallel scans and JOINs are allowed. Also, PostgreSQL 15 (in beta now) will allow parallel commits to foreign server tables through FDW.
I wish you all the best in operating that at scale and providing a reliable general purpose PostgreSQL DBaaS (even if only internally) around it. :-) PostgreSQL is great, not going to argue otherwise -- every solution has tradeoffs and if the tradeoffs of one solution fit your usage best then go with it. If you need to scale out and manage a large cluster of PostgreSQL instances as if it was a single logical database then I'd recommend something that's in a similar space as Vitess and PlanetScale -- Citus and Citus on Azure respectively. At the hand waving level it's always easy to "do this, and that", but it's a whole other matter when you develop a general purpose production-grade product that anyone can use and operate at virtually infinite scale. :-)
There is also Aurora Postgres, though I've heard it still suffers from vacuum issues for some workloads.
In my limited experience DB scaling usually necessitates knowing some specific quirks of whatever implementation is chosen. Otherwise their side effects or sharp edges will cause problems.
Postgres FDW [1] seems to me to be a crazy cool (i.e. unexpected and good) feature of Postgres, but I haven't seen much chatter about it on HN in the rdbms stories.
Has anyone here used it? What is/was your experience like?
FDW is not nearly as usable as you think. Joins of even 2 or 3 remote tables with more than a few hundred rows in the join are enough to make simple queries take minutes to return even if you tune all the settings and have statistics of remote tables.
I know this is not as care free as in NewSQL dbs like CockroachDB but its not that difficult either.
Choosing a shard key ultimately comes down to breaking down your ER diagram and inspecting the incoming write flow to certain tables which are bound to grow.
In any of the cases you don't have to let your user suffer from downtime. At least I have not. Even if you chose wrong shard key, You can gracefully transfer or re-distribute the data records by creating new tables and renaming then with old ones. The window of appearant downtime here is few milliseconds. This is not something which will happen frequently.
Yes, I read it and that is what I meant with "even if you tune all the settings and have statistics of remote tables".
I have tried all the options, if you have any remote join which is more than a few rows you'll have a bad time.
I'm not even talking about changing the shard key, even growing the cluster isn't easily done and much less with "millisecond latency". Unless you use Citus which technically isn't a pure PostgreSQL solution.
Many (in fact I would bet most) other AWS services themselves use DynamoDB under the hood for various control-plane and data-plane uses-cases. It is not unimaginable that the actual traffic to DynamoDB service in total at regular daily peaks would be in the nine or ten digits per second.
Granted, these are across different AWS accounts, across different tables but still delivered by a single logical service.
keep in mind TPC-C scales out in a trivial manner. It was designed before distributed databases existed. The workload itself is perfectly sharded on warehouse_id. All joins happen on this column and the query load is uniformly distributed on this key. This makes it perfect for distributed sql databases to scale out by sharding on warehouse_id (no hot spots and all joins and transactions will be local to one shard in the cluster)
This is one of the big reasons TPC-E was developed as a successor to TPC-C. It’s much harder to scale it out.
Actually, the designers of TPC-C were ahead of their time - they deliberately designed TPC-C to not be trivially shardable. The new order transaction, which is the heart of the workload, is required to cause a read outside of a shard 1% of the time [0] for each of the 10 on average items in a new order. So, 10% of the time on average, the workload should emit a cross-shard transaction.
A correctly implemented TPC-C workload does actually need to issue cross-shard transactions. There are a lot of "TPC-C-like" workloads that don't actually implement this behavior, though, along with other harder to handle attributes like foreign keys and sleep/wait time.
TPC-C is really an enduring, interesting benchmark.
That's true, I should have been a bit more specific.
Each individual query runs within a single shard as far I recall (its been a few years...). There maybe transactions (like new order) that do multiple queries such that the transaction does read and writes to more then one shard. This doesn't really change how easy the workload is to scale out too much though. Harder to scale workloads require a single query to run across many (often all) shards or require data reshuffling for joins (there is not one good shard key that keeps all joins shard local). In these cases adding more nodes or shards will make each query a bit slower.
That matches many, if not most most real world workloads. When hotspots happen, you just continue to shard, as they don't have to be equally sized, even if you shard a single warehouse id to a dedicated node and vertically scale it up.
Yep, lots of real world workloads shard very well. That wasn't my point. My point was that "most" distributed SQL Databases that do sharding will do very well on TPC-C. You can google search for equivalently impressive results for many distributed SQL databases. Each of them can keep adding more warehouses to the TPC-C workload along with more hosts and shards to the distributed SQL database to push the TPC-C TpmC results into the millions.
I think as headline "one million queries per second" is not particularly exciting these days. MySQL 8 had shown more than 2mil simple queries/sec few years ago and more than 1 million just IO bound queries
The great thing with PlannetScale and Vitess is what it is horizontally scalable while MySQL itself is not, though you need higher headline number for things to get exciting.
Too bad it doesn't support a whole bunch of MySQL features which probably makes it a non starer for us. Some of them are quite boring, like auto increment.
I'm curious about the latency between a backend app that runs on different cloud provider than aws(eg hetzner), because I've noticed that they run on aws(it shows up when choosing the region for your database). Has anyone used this and connected from a different cloud provider? Have you noticed big latencies?
You will generally run into different types of latencies in this case since you have the general connection latency when those are getting established plus the regular physical/network latency between where the database is located and your own servers.
For connections, since a TLS handshake is required, the impact of physical distances can have a greater impact on the connection time. The following article: https://sking7.github.io/articles/44961356.html actually provides a good 3.5x-4x figure which correlates with some connection tests I've completed.
In other words, if an initial TCP packet takes ~100ms to get from the database to your server, then establishing a TLS connection to the database will probably be around 400ms.
Once the connection is established, running queries over an open connection is generally going to be quicker, at least for simpler queries. More complex queries will still take whatever time they need to process on the database end before they can start sending results back so results will generally vary there.
But going back to that 100ms example...if the amount of data being returned from a very simple query is minimal than the response time here would be very close to that 100ms figure over an already open connection and likely would go up from there depending on the complexity of the query and amount of data needing to be returned.
Since the connection hostnames are publicly accessible and TLS is always required for connections you can easily test from your own provider's location. So long as the general physical location isn't too far away from a supported region, the latency overall shouldn't be unusable.
I may have mangled some terminology/analogies above but hopefully that helps provide a bit of a ballpark for you. If you have specific to/from regions in mind I might be able to try and collect some specific numbers for you!
I haven't spent time optimizing TLS between a database client and server, but in HTTPS, using TLS 1.3 without early data (or TLS 1.2 with somewhat optimistic handshake handling) gets you to one added roundtrip, TLS 1.3 early data gets you down to zero added round trips. Early data isn't always appropriate, because there's potential for replays, but the latency improvement might be worth considering for some selects.
I'm not an expert on the TLS 1.3 but the 0-RTT feature seemed like it wasn't implemented by a lot of clients so the new QUIC protocol used in HTTP/3 seems to be the workaround for that. The following recent comment and the first video linked actually had some great related info that I was recently reviewing on that topic:
https://news.ycombinator.com/item?id=32572825
I don't know if the MySQL protocol itself though be able to utilize the TCP-based TLS 0-RTT functionality or not however so connecting via a regular client may still end up with a lot of the back and forth handshaking.
The newer serverless driver for JavaScript has some opportunities to take advantage of QUIC within HTTP/3 in the future as Matt mentioned over here recently:
https://news.ycombinator.com/item?id=32513043
So that will be interesting to continue seeing how it evolves/improves over time.
It's depends on how far the two data centers are from one another. But the big cloud providers have data centers in similar locations since they usually want to be near an internet exchange[1]. For example most companies that provide cloud services have a data center in Virginia, USA. So if both data centers are in the same region then you can get numbers in the sub 5ms range. As long as you're not crossing the ocean the latency should be fine.
Given the restrictions I’d be curious to hear if anyone has compared the two approaches:
1. Something similar to the article
2. Use a generic key value store, pipeline the data to a rdbms like SQLite or Postgres. Use the kv store for all simple queries, delegate to the rdbms for joins/complicated queries
If you used a cluster of syclladb or foundationDB you could get similar read/write performance but those are kv stores.
If you happen to shard by customer you could duplicate keys and values into separate databases per tenant. You could effectively reindex common queries to use kv directly and have others go to rdbms.
—-
More generally I wish there was a rdbms with easily pluggable storage engines or one where you can use a key value store as the storage.
I'm unfamiliar with vitess, but what exactly is the achievement here?
If you have a shared-nothing architecture, you can keep indefinitely adding nodes to get more throughput. You can easily do a billion qps if you wanted to.
The downside is that you have to eliminate relations (or at least don't expect them to be performant/scalable).
You understand the point. It's a real point. Show me a modern relational database that can scale this predictably on the cloud with this level operability.
A relational database without relations is an oxymoron. As folks pointed out, you also have to throw ACID away. So what's left of the original database, SQL-dialect? I bet that gets limited too.
Look, I get it, you have to sell your product. Some folks want semi-infinitely scalable storage, and they don't understand that the only way to achieve it is turning their DB into a key-value store. As a side effect they would have to rewrite their whole application from scratch, but they would only realize it after they get vendor locked in.
You can advertise your solution as MySQL-compatible. And I can claim that it's dishonest.
> A relational database without relations is an oxymoron.
OK. You're the only one talking to this straw man though. :-) Every Vitess user that I'm aware of has a pretty typical 2NF/3NF schema design. A small sampling of them being listed here: https://vitess.io
You setup your data distribution/partitioning/sharding scheme so that you have data locality for 99.9999+% of your queries -- meaning that the query executes against a data subset that lives on a single shard/node (e.g. sharding by customer_id) -- and you live with the performance hit and consistency tradeoffs for those very rare cases that cross shard queries cannot be avoided (Vitess does support this). You should do this even if the solution you're using claims to have distributed SQL with ACID and MVCC guarantees/properties. There's no magic that improves the speed of light and removes other resource constraints. In practice most people say they want perfect security/consistency/<name your desired property here> but then realize that the costs (perf, resources, $$, etc) are simply so high that it is not practical for their business/use case.
I know MySQL fairly well (I started working at MySQL, AB in 2003) and you can certainly claim that "MySQL-compatible" is dishonest but I would offer a counter claim that either you don't know this space very well or you're not operating in good faith here.
To be fair, I skimmed through your docs and did misread them initially: I thought you don't allow foreign keys, but you actually don't allow foreign key constraints.
If you are still allowing JOINs within a shard, then I need to apologize.
To pile on your answer a bit the manual bucketing you describe is exactly how ClickHouse works in most cases. It won't allow joins / IN on multiple distributed tables--i.e., sharded/replicated tables--unless you explicitly set a property called distributed_product_mode. [0] This is to prevent you from shooting yourself in the foot either by bad performance or improperly distributed data.
This constraint will eventually be relaxed but most apps are able to work around it just fine. The ones that can't use Snowflake.
Indeed. Unfortunately the terminology is so easy to confuse. I have heard the term "relationship" used to describe foreign keys to distinguish it from "relation". But usage is inconsistent and everyone who has heard of "relational databases" but has not heard of "relational algebra" and "relational calculus" will think "relational" refers to foreign keys.
Sure, but what you are describing is no longer a multi-tenant application/database. It's essentially a single-tenant deployment of your tech stack per customer. Which is not very cost effective.
I would still consider it a multi-tenant system. It's a single database to manage, which distributes your customers using their identity as a partitioning key.
Why is that? It is very unlikely to answer more random requests with more nodes. You have to pay for the communication that is required in the cluster to locate the data. It will only be faster if you have some special knowledge and are able to partition the data in a smart way. Which will allow shortcuts when storing or receiving data.
I'm sure you could find some edge-case scenarios YOU dont like with MySQL but all round its a pretty sensible and stable choice if thats what you want to work with.
IDK that rename-table corruption bug was brutal around 8.14 or so. Configured properly (legacy defaults were kind of janky) and only running vetted versions is probably OK.
I feel like this will inevitably lead to some not-very-productive discussions, as it already seems to have started in some sibling comments.
Personally, I have to admit that there's definitely a trend of looking more in the direction of PostgreSQL, which is a pretty great choice - it follows SQL standards pretty well, has adequate performance, advanced procedural language with support for custom types and even nice projects like PostGIS, which let you work with geospatial data (previewing which is even integrated in pgAdmin). In addition, it has support for JSON (though personally I find the syntax uncomfortable) and other plugins as well, whilst being able to scale both up and down pretty far.
That said, I still think that MySQL/MariaDB is a sensible choice for lots of use simpler use cases, where you don't necessarily need custom types and advanced in-database processing with procedural languages (which in my opinion can be risky to begin with, mostly due to challenges in regards to testing and debugging such logic, outside of cases where you need to process a large amount of data and sending it over the network would be disastrous).
MySQL/MariaDB has a simpler structure (databases and users, instead of databases, schemas and users), good performance, very good driver support, excellent tooling (in my opinion, MySQL Workbench is ahead of pgAdmin, especially reverse/forward engineering and the whole ER visualization functionality), passable resource usage (a bit more RAM at idle and under load than PostgreSQL in my experience, at least with default container setup which you may/may not want to alter) and a nice ecosystem around it - be it clustering solutions or integrations with almost any language that you might want to use.
Also, projects like TiDB are interesting to behold, even though personally I wouldn't want to give up foreign key support for dynamic clustering.
At the end of the day, I think that either is a better choice than being stuck with Oracle which just feels more cumbersome and isn't as easy to run wherever you need it, licensing being a concern all of the sudden (and Oracle XE not letting you use certain features either, such as dynamic indices, which you might want to test on your dev environment/locally). Also, the modelling functionality on SQL Developer leaves a lot to be desired.
Cannot really comment on the SQL Server offering, because haven't worked a lot with it.
For anyone who's on the fence about what RDBMS to use, just pick whatever looks interesting from the top 10-20 here, do some benchmarks, read what others have said and come to your own conclusions for any given project: https://db-engines.com/en/ranking
It's hard to go absolutely wrong nowadays with any of the popular technologies (that have proven that they can survive for a decade or longer and still receive updates), unless you really misinterpret what would be a reasonable fit for a given workload.
The largest and mot successful companies in the world are still powered by MapReduce.
It doesn't mean it's a great technology anymore... it just means it's incredibly hard to migrate tech stacks once you're a huge and successful company.
"Hadoop" is never going to die. Even if it fades into the background, some vestigial remnant within Spark will use a HDFS wrapper to access S3 or something like that.
But MapReduce will eventually, finally, kick the can.
This was also accomplished from just 1 server, not needing 40 shards like blog post.
[0] https://blog.expensify.com/2018/01/08/scaling-sqlite-to-4m-q...