Hacker News new | past | comments | ask | show | jobs | submit login
How Uber Manages a Million Writes per Second (highscalability.com)
133 points by danjoc on Nov 3, 2016 | hide | past | favorite | 56 comments



tl;dr 20 clusters, one of which (the one with a million writes) records only gps data, running an application in a container in an api-driven os virtualization package

--

I like the part where 10% performance overhead is not a big deal while at the same time making a big deal about how mesos can save you 30% performance overhead.

And notice how at first they had a group of clusters for completely separate things like API, storage, etc, but now with mesos they program against an entire datacenter by using a different specific API for separate things like APIs, storage, etc. Much... better.

Also now with mesos they can run different kinds of services on the same machine. Kind of like containers. Kind of like separate applications on any kind of machine at all.

"If, for example, one services uses a lot of CPU it matches well with a service that uses a lot of storage or memory, then these two services can be efficiently run on the same server. Machine utilization goes up."

Because if I have a machine pegging 90% CPU, I totally want an i/o heavy job running on that machine, because interrupts are, like, imaginary, man. Yeah, i'm sure the machine's utilization goes up... right up to triple-digit load.

"Agility is more important than performance."

So the ability to automate rolling out a virtual machine and application in as abstracted a way as possible is more important than their 1-million-writes-per-second goal. I think I know why they don't find 10% loss to be a problem now... they can just keep spinning up machines until they reach their performance goal. 150 machines per DC to store ride-sharing data seems a bit much, but maybe there's a lot more going on behind the scenes than the few features on their app.


All those microservices add overhead. They read that google does it this way (multiple apps in a server) and now they all must do it to fully utilize aws vps.


> Because if I have a machine pegging 90% CPU, I totally want an i/o heavy job running on that machine, because interrupts are, like, imaginary, man. Yeah, i'm sure the machine's utilization goes up... right up to triple-digit load.

They're clearly tracking their own latency and reliability and comparing it to their availability goals - there's numbers right in the article, so I would think they would notice this and handle it somehow.

So I guess either they're lying in the presentation, or your sarcasm is targeted towards a problem you conjured up on you end for your personal satisfaction.


Because if I have a machine pegging 90% CPU, I totally want an i/o heavy job running on that machine, because interrupts are, like, imaginary, man. Yeah, i'm sure the machine's utilization goes up... right up to triple-digit load.

Honest question: can't you dedicate some cores to handling the interrupts, leaving the others for the other processes?


As a matter of fact, yes... that's a "common" very advanced optimization.

I've mostly seen it for load balancers. Pin the network card interrupt on one core and the load balancer on another.


Or turn interrupts completely off and poll. If you always have work in the queue, polling is less intensive, not more.


That makes no sensen. Polling means the core is running at 100% all the time.


Yes, but it's actually faster. Source: scylladb.


Still doesn't make sense in the context of a network card. There must be an interrupt to pre-empt the system or there's gonna be packet loss.


No, you just need to service the work before the buffer overflows.


Actually it does, and is quite measurable via standard tools like netperf. I have done this, but only because I've worked in Finance as a Linux monkey for the previous 9.5-ish years. I specialize in Linux tuning and reading far too much Linux kernel source code. That and big distributed systems.

In fact, modern kernels support this quite extensively and the RHEL7 documentation[1] has great tips on it for someone new to it.

You can tune an interface for max throughput via coalescing[2] on Linux via ethtool -C to change and ethtool -c to view current coalescing. Setting interrupt per packet helps with latency for certain workload types in addition to SO_BUSY_POLL[3] and the global or per-interface busy polling sysctl. However, interrupt per packet can trivially overwhelm CPUs if you don't isolate those cpus using things like isolcpus= on the linux grub commandline or cpu_exclusive=1 in a cpuset. RFS/RSS and NICs with multiple receive queues make this much easier to tune.

It is guaranteed to use more power, but you can do idle=poll[4] on the Linux kernel boot command line to stay in the highest C state and help with network latency.

You don't really always need an interrupt either as apps can DMA directly to and from the NIC <---> Application bypassing the CPU. How do you think RDMA works at a lower level? Finally, Linux "kernel bypass" networking aka 100% userspace tcp/ip stacks can be written to be truly interrupt-less take a look at Mellanox's VMA or Solar Flare's openonload. Heck, the default openonload config is interrupt-less. The more you know :)

[1] https://access.redhat.com/documentation/en-US/Red_Hat_Enterp...

[2] https://en.wikipedia.org/wiki/Interrupt_coalescing

[3] http://man7.org/linux/man-pages/man7/socket.7.html

[4] https://access.redhat.com/articles/65410

EDIT: This is a fantastic and simple overview with actual netperf results showing how this helps with latency: http://www.intel.com/content/dam/www/public/us/en/documents/...


Are you talking about the NIC interrupts or all hardware interrupts? If the later - how do you configure that?


It wouldn't matter because all your interrupts would be handled by a single core, becoming a bottleneck.

It also doesn't matter because heavy i/o will take up both system and user cpu, and the 90/10+ split (is the app even multi core??) puts you into full utilization, which is fine for bulk jobs, terrible for high performance requests. Even a single machine at 100% can (in unfortunate circumstances) cause domino effects. Better to build in excess capacity as a buffer for unexpected spikes, which means managing your clusters to not stack jobs which could compete for resources, but also not stack jobs that could unintentionally starve other jobs - this requires intelligent load balancing that's application and job specific. Or a cluster dedicated to specific jobs (which they have, ironically)


You can assign different interrupts to different cores. That's another advanced optimization.

e.g. One core per network card.


And you can go one step further using receive flow steering[1] or transmit flow steering. Most modern performance oriented network cards (Intel 10G, Solarflare 10G, anything from Mellanox, Chelsio, etc) surface each of these receive queues differently and can be seen as different on the right hand column in /proc/interrupts. You can distribute said rx/tx queues on different cores (ideally) on the same socket (but potentially a different core) as the application for minimum latency.

Linux has some really impressive knobs[2] for optimizing these sorts of weird workloads.

[1] https://lwn.net/Articles/382428/

[2] https://www.kernel.org/doc/Documentation/networking/scaling....


I don't follow the 'Why run Cassandra in a container and not just on the whole machine?' arguments. The first two seem to argue against the concept (because you need to coordinate what machine the container is running on). And the third talking about something completely different (its talking about a single cluster vs multiple clusters, nothing to do with containers).

I'm curious how they can start up one new node in a cluster per minute, when last I saw Cassandra required 2 minutes after bringing a new node online for things to settle before starting a new one. Its a pain point for me running tests that need to startup and teardown Cassandra clusters, and I'd love to know when and how it can be avoided.

(Maybe I need to watch the actual talk)


I haven't worked with Cassandra much but I know the Elastic Search guys suggest running multiple nodes per box if your hardware is too big. They claim that the JVM works better if you keep its heap size under 32GB. Since Cassandra also runs on the JVM maybe a similar logic applies to it? Later on he mentions each Cassandra process gets 32GB heap.


The 32GB is a global guidelines for ALL JVM applications.

Under 32GB, the JVM uses 32 bits pointers. [actually slightly under 32GB]

Over 32GB, the JVM uses 64 bits pointers.

Pointers consume a lot of space in a running program. Basically, a 30 GB heap (with 32bits pointers) is the same as a 40GB heap (with 64bits pointers).

That means, there is a chasm between 30-40GB heap where it makes no sense to run a server. The additional space is wasted by the bigger pointers.

Second, 30GB is a sweet stop. It's simple to justify and enough for most applications. (A bigger heap will give longer GC time and other side effects, most people don't need to get into that kind of optimizations).


Sounds like it would make tuning your RF for durability a pain in the ass if you run multiple cassandra nodes on the same box. Cassandra uses non-JVM managed memory for a fair amount of work and relies heavily on the OS filesystem cache (much like Pg), unless you are consolidating multiple Cassandra clusters to one box I would keep one node per physical machine for durability reasons alone.


They've been moving as many stuff as possible outside of the heap or to filesystem.


> For their largest clusters they are able to support more than a million writes/sec and ~100k reads/sec.

Why is read 10x more expensive than writes?


They've probably tuned Cassandra to favor write performance, accept a write as valid before it has actually replicated everywhere. The cost is that when reading the data, you now need to do more work to verify you got the latest value.


This is extremely common and is normally caused by an optimization for streaming writes that leaves data relatively disorganized, in which case you need to perform numerous reads to find data which only required a single write to store.


In the same topic, dunno if it applies to cassandra.

I've seen some databases work in "append only" mode. They write new data to the end of the file. They never erase existing data.

It's generally a very efficient write patterns (even on good old spinning drives) and it allows to always write in batch.

On the opposite, read are expensive, they require to "find" stuff from various places and read it and verify it and [if configured] repeat on multi nodes to compare the values.


Cassandra has an append-only log file during operation but unlike a RDBMS it's not just used for transaction replay, so you're half write. Cassandra periodically compacts the log and writes SSTable's to disk, but newer data and tombstones are stored in the log for a while which as you surmised does have a performance hit.


Performance hit is not from the log though, but from having non-reading-writes (and ttls). So you have to look other versions too on disk for latest value.

I think newer data and tombstones would stay for a while in sstables, not in the log.


2 writes from the driver, two from the passenger, each minute is 0.0666... per second.

To get 1M/s, we need 15M cars.

[1] says there are about 2M trips per day, so does this mean Uber records the location of all users every 30 seconds, rather than only when they're waiting for or in a car?

[1] http://expandedramblings.com/index.php/uber-statistics/


They record every second but it is not written continuously.


So assuming 75 hosts per cluster thats about 13333.33 writes per host. How is that impressive at all?


Depends on the spec of the servers :D

If they all have 16 cores 100GB RAM it's really unimpressive and we just found out why Uber is not profitable.


No, if they are single core with 1GB of ram they should be able to write 15k/sec easily


> Existing Riak installations will be moved to Cassandra.

What's the reason behind this?


Wouldn't dividing their database per city make the system a lot simpler? Transfer between cities should be rare overall and it would limit the reliance on replication.

Is there something obvious that I don't see?


What is a "city"? Where does it start, where does it stop?

What if some rider starts in city X and ends in city Y (New York to Hoboken)? How do you elegantly handle that transaction? You would need consistent transactions across two databases. Usually its not possible, or at the least extremely inefficient, to implement transactions at the application layer - especially in a distributed system under heavy load.

Also you end up with a lot more operational responsibility. If (when) nodes fail, how many nodes failed? How many instances of Casandra were effected? If there were X node failures, was there data loss to any of the effected instances? If there was data loss, how can we recover from a snapshot and replay all recent data to that instance without replaying it to others?

Meanwhile, is the application layer really where you want to solve these scaling issues? What happens if a single city becomes too large to handle, do you partition again by neighborhood? If you're going to invest into anything, invest into improving the database. Maybe create a specialized database to handle your usecase more generally. Ideally, you want the application layer to only provide heuristics that the database can use to efficiently partition its data.


I think that Uber is organized per city, at least on the business side. I'm talking about cities as used in [1]. I'm not even sure that you can do inter-city rides using Uber.

1: https://www.uber.com/cities/


I wonder if it were simpler for Uber to use RethinkDB instead of this complicated scheme they set up.


OT: So, Uber does a ton of shady, illegal things and then gets to continue on with cool tech posts on HN? Seems like the message is clear: If you're in tech, do as much illegal shady shit as you can, because once you make money your problems will all disappear and those you trampled will be forgotten.


Cassandra is designed to be eventually consistent; that means that if the node didn't yet get the update(s), a client will get stale data. This is basically the same issue as with DNS.

If I define that incomplete data equals to corrupt data, which I do, I am incapable of conceiving a scenario where corrupt data would be acceptable.

Why would one define it so? Because I'm convicted that computers should always deliver correct data, and in my conviction, data can never be correct unless it is also complete. Therefore, atomicity or bust; therefore, Cassandra = false, and in such a case, milion writes per second is of no consolation.


At Uber scale (or Facebook scale, or Google scale, or any non-trivial scale) you simply cannot say "Atomicity or bust", because there isn't a distributed database that can give you atomicity at 1M writes a second : That technology simply doesn't exist at this price point, or it would be 10x increase in server costs.

You DONT need atomicity all the time, sometimes you do (Financial Transactions), but a lot of the time you dont:

Facebook likes. Do they need to be atomic? Does it matter if a web request gets a stale "like count" and the number of likes for an item is 12,544 instead of the 12,545 on someones page view because the database isn't atomic? No, it doesn't matter.

Uber GPS data, Uber writes the GPS locations as your trip progresses - does it matter that the last 1 or 2 locations might not be visible to the app or external reports for a few seconds? No, it doesn't matter.

Ad-click data, Does it matter that the last 1 or 2 hundred impressions are not visible in a report who's absolute values are in the millions or 10s of millions? No, it doesn't matter - even though this can represent money, the value of a couple of hundred impressions missing on the report is chump change, not even worth thinking about.

With the exception of financial transactions, some safety critical machine data, and a few other cases, atomicity is not required, and that's where these distributed databases shine.


> Facebook likes. Do they need to be atomic? Does it matter if a web request gets a stale "like count" and the number of likes for an item is 12,544

> instead of the 12,545 on someones page view because the database isn't atomic?

> No, it doesn't matter.

Actually, it does matter. Atomicity isn't the specific issue there, but rather it's read-after-write consistency guarantees that are important. That is, product-data writes need to be immediately consistent (synchronous) in the context of whichever data store is serving subsequent read requests for the same user/session.

Read-after-write consistency is essential for social networks. If you take an action and then click a new link -- or even just refresh the current page -- the effect of your action definitely needs to still be visible.

Facebook doesn't use Cassandra, fwiw.


> You DONT need atomicity all the time, sometimes you do (Financial Transactions), but a lot of the time you dont:

In fact, financial transactions rather famously do not implement strict consistency.


Actually most of those companies run consistent systems compared to cassandra stuff. They may have a kafka-log at the beginning but mostly consistent dbs at the other end.


You DONT need atomicity all the time, sometimes you do (Financial Transactions)

Bingo. No atomicity, bad money or no money. I'd have a problem. I also have this extreme dislike of the "garbage in, garbage out" principle, so either what I'm delivering is correct, or I should be fired, because using computers at that point is pointless. If we don't care about correctness or completeness at all, then we just discounted two of the three reasons why computers exist at all. That leaves us with speed, and at that point, what use is it?


It's easy to sit back on principle and point out all the ways that other people's work fails to live up to your ideals.

It's a lot harder to go out and build something that actually delivers your ideals.

If you think a distributed database should provide full correct completeness from all nodes while hitting 1 million writes per second, go build one that does it. If you succeed, not only will you make yourself happy, you will make yourself a shitload of money.


If you think a distributed database should provide full correct completeness from all nodes while hitting 1 million writes per second, go build one that does it.

I do that for a living. If I didn't, I would not have commented.


> the three reasons why computers exist at all

In a commercial context, computers exist to serve human needs. If increasing correctness or completeness serves neither a user need nor a business need, then the work to do that is just developer self-indulgence.

And honestly, I don't have a problem with a little developer self-indulgence now and again. Sometimes I'll just clean up something because it's ugly and I don't like it. But it's important to recognize when we're doing that, because it's decidedly not what we're paid for.


But if Uber went the "correct and complete" route, which would cost them either a lot more or drastically reduce the number of writes, that would actively hamper their business. At what point do you consider it a valid tradeoff to keep the business alive?


I would consider correctness to be critical to the survival of any business; therefore, it is my unshaken conviction that this is a thing were no compromises can be made. I'm deeply convicted about that.


Then you're simply wrong. Spending 10x the money on making sure all data is atomically written and freshly read is idiotic if there's no problem with reading slightly stale data.


You are assuming that ten times more money needs to be spent, but without any real basis, that's nothing but a meaningless assertion.

It doesn't have to be expensive to be fast, reliable and correct.


Do keep in mind that's entirely irrelevant for uniquely-keyed data points. (append-only database usage)


If one is only ever going to append to a database and never read it or use data in it, then one does not even need to store the data being appended, let alone need a database. However, since the use case here appears to be storing GPS data, it would be reasonable to conclude that data is being actively consumed.


Reading is fine in an append-only datastore. That's my point. There's no stale reads if your data for a key that's never changed given quorum-style reads as far as I understand.


Sometimes there are no good (efficient) alternatives to eventual consistency. I guess it all depends on the level of complexity that a developer or architect is willing to accept.


Yep, correct, it comes down to acceptance. And I've yet to see a large scale scenario where it was not possible to implement software which couldn't be both correct and performant (I specialize in large scale applications for a living).

Therefore, I do not accept the premise that it is impossible to design systems which are both correct and fast on a large scale, and neither should anyone else have to accept such a thing. It's just not right.


>And I've yet to see a large scale scenario where it was not possible to implement software which couldn't be both correct and performant

CAP theorem says otherwise. I suggest you try implementing something as simple as distributed counters.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: