Scalability, but at What Cost [pdf]

Chris_Newton · on April 24, 2021

(2015), but still a paper I’d encourage developers and technical managers to read as a counterpoint to the cloud advocacy of recent years.

Note to mods: The title is correctly written “COST”; it’s short for “Configuration that Outperforms a Single Thread”, the central concept being discussed.

Item_Boring · on April 24, 2021

“While nearly all such publi- cations detail their system’s impressive scalability, few directly evaluate their absolute performance against reasonable benchmarks. To what degree are these systems truly improving performance, as opposed to parallelizing overheads that they themselves introduce?”

That’s one of the things that stuck with me after haven taken a class in high performance computing. What matters is the absolute performance, less the speed up one achieves by parallelisation.

nice2meetu · on April 24, 2021

This reminds me of my previous job where we had an optimized single-server solution, whereas the competition scaled horizontally. They couldn't out-scale us at what we did until they got to 20 servers or so, which only the biggest customers needed anyway, and even then they would be really slow. You can achieve some impressive speedups when you really focus on the problem.

Still, it wasn't that helpful from a business perspective, because apparently it is easier to justify mark up on solutions with lots of servers.

adreamingsoul · on April 25, 2021

I’m currently working on a project that sounds similar to this. A single server processes and aggregates steaming data with about 60Gbps of data. Customers want horizontal scalability and lots of buzz words are being suggested. A active-active configuration would be best (in my opinion) as that solves for redundancy and failover. Overall fun and challenging project. Somewhat frustrating at times though because of all the opinions of which technology or architecture would be “better”.

dilyevsky · on April 24, 2021

What happens when your single server solution is disconnected from the network for whatever reason or you exceed the current configuration? Bad times i bet. Cost of components isn’t the only factor here.

juancb · on April 24, 2021

This may seem trite, but if you can get 20 servers worth of performance out of one you can afford to run two active-active and still reap a 10x capex/opex savings. The technology to have simple but reliable systems has been around for decades. You also can't assume that the cloud is going to never fail, so you always have to defend against failure whether it be running two servers or two availability zones.

XorNot · on April 24, 2021

Also a lot of the time you're not trying to achieve (and can't anyway) an uninterruptible uptime - you just need a rapid recovery from infrequent outages.

dilyevsky · on April 25, 2021

Apples to oranges - you’re not gonna get 10x saving by running a/a in a strong consistency mode

Chris_Newton · on April 25, 2021

I don’t see how this follows. If you have one single-threaded server doing the job of 20 similarly specified servers running the distributed system, you could run every job twice, on two completely independent servers on two completely independent networks, and still be 10x as efficient unless both failed catastrophically during the same job. Or you could run three completely independent copies of the same job and still be at nearly 7x efficiency, etc. There is no need for any sort of “consistency mode” here. This is just brute force, without any synchronisation between servers or resumption of aborted jobs at all.

menzoic · on April 25, 2021

That only works if the servers don't rely on external dependencies or if every single opertaion is idempotent.

Chris_Newton · on April 25, 2021

Weren’t we comparing parallelised and single-threaded data processing jobs, though? If you’re introducing external communications in real time as a factor, wouldn’t that be a fundamentally different situation?

dilyevsky · on April 25, 2021

TP specifically refers to many servers. Conceptually it isn’t actually that different - the cores and memory are on the “network” of sorts.

tartoran · on April 24, 2021

That’s only if there is no contingency plan in place and a backup server. Usually these things are thought out beforehand. And it’s not like you’re excused from this type of problems when you’re running your solution in the cloud. From cloud outages to configurations nightmares, data inconsistencies, not knowing what is happening because pinpointing on a complex infrastructure takes more time and so on. Sometimes the cloud way is the way to go but other times it is not justified.

nice2meetu · on April 25, 2021

We had a replication HA solution, but it was "warm", and customers didn't really want to run another server doing nothing. I think we also charged a lot for it.

The area we were in, customers could tolerate some outage. Restoring from a backup didn't take that long. I believe we were known for good support helping people get back up.

But yeah, what you are saying is all the arguments that kept coming to us. People were used to having a few servers for the database, and different front ends, and layering things like that.

jdc · on April 24, 2021

Isn't this a solved problem? Just use a failover, right?

dilyevsky · on April 24, 2021

If your consistency model allows you to use their asynch replication and potentially lose data in surprise switchover - sure. Otherwise you’re looking at similarly worse performance and also worse reliability

nice2meetu · on April 25, 2021

In our case async replication was acceptable, so we had a failover solution available.

0zymandiass · on April 25, 2021

Reminds me of https://adamdrake.com/command-line-tools-can-be-235x-faster-...

Cluster computing can be useful, but until you're talking about petabytes of data, it probably isn't helping you

TylerE · on April 25, 2021

Kind of reminds me of the early days of like, Hadoop, when it was all the rage. Then people realized they could do most that stuff in a python script on a single machine in less time because of all the bookkeeping overhead and complexity.

samsquire · on April 24, 2021

Interesting paper. I was recently reading the Wikipedia for graph databases spurred on by Neo4j where there is a glaring statement:

"Research demonstrated that there is no benefit to using a graph-centric execution engine and storage manager" - which I take to be the kind of systems that the paper is critical of.

Which I suppose graph execution engines should outperform a single threaded "“think like a vertex" problem.

Which links to this paper The case against specialized graph analytics engineshttp://cidrdb.org/cidr2015/Papers/CIDR15_Paper20.pdf

They use a relational database to outperform dedicated graph databases.

Is part of the takeaway "hand coded single threaded think like a vertex" beats distributed system which involves communication or parallelisation.

jorangreef · on April 24, 2021

> “You can have a second computer once you’ve shown you know how to use the first one.”

cratermoon · on April 24, 2021

I think this paper just shows that Amdhal's Law[1] is just as relevant when discussing distributed systems as it is when talking about multi-core single machines.

1. https://en.wikipedia.org/wiki/Amdahl%27s_law

chubot · on April 24, 2021

It's related to Amdahl's law but not identical. As I understand it, Amdahl's law is talking about the situation where you have say 10 units of work, and 5 of them can be parallelized, and 5 can't and must be run serially.

The COST paper is talking about the situation where you have 10 units of work, but it takes 10 more units of work to distribute it across multiple computers.

Sometimes that's a win and sometimes it isn't, thus the phrase "parallelizing your overhead". If all you did was parallelize the additional overhead, then you didn't gain anything by using a distributed system.

IOW, the overhead for distribution in distributed systems can be very large. It's still there in parallel computing, but shared memory makes it less pronounced. (Although NUMA makes your single computer like a distributed system again.)

cratermoon · on April 24, 2021

> it takes 10 more units of work to distribute it across multiple computers.

That's just serialized work in a different form: in order for a computer to do a unit of work, it must be sent a unit of work. That's strictly serial -- a computer can't process a unit of work it doesn't have. So no matter how many units of work it takes to distribute the load, that's still 10 serial operations.

Amdahl wasn't only seeing things as separate chunks, with the parallelized work over one part of the code, and the serial work in a different part. The operations can be interleaved.

benlivengood · on April 24, 2021

I think more important is the actual cost of communication. Networks are slow. 100Gbit is still a couple times slower than the memory controllers in our phones.

https://en.m.wikipedia.org/wiki/Bulk_synchronous_parallel is a useful model for incorporating communication costs into design vs. PRAM.

cratermoon · on April 24, 2021

BSP is a specific model for parallel computation, but it still operates on the fundamentals of Amdahl's Law, and the whole reason it exists is find the optimal divide-and-conquer solution given a set of real-world limitations.

One might even say that the cost of implementing your algorithm with BSP is higher than PRAM, because of the extra layers. But you can't ignore some of the things that you can in a strict PRAM model, so you have to incorporate those into the cost as well.

Given Gustafson's law, if you have enough data, enough to work to do, you can sort of ignore the un-parallelizable fraction by throwing enough processors at the computation.

What starts to trip things up at this scale is synchronization.

emodendroket · on April 24, 2021

Interesting. But breaking stuff up in this way isn't purely a performance optimization either; it's also intended to be a method of organizational management, where someone is clearly responsible for failures in any part of the system, and where the parts of the program that are parts of the contract are very clearly delineated.

Ma8ee · on April 25, 2021

You are of course right, but I couldn’t help laughing. I’ve long tried to say this: the reason for the relatively success of SOA is mostly that it makes it harder to cheat and reach over into other modules to grab whatever you need. That it is another team responsible also make it less likely that we rely on implementation details of the other modules.

But the reason why I laughed is that for almost every bug reported, it starts with a minor war about whose service that is really responsible, and then how was this contract really defined again? Oh, so the only specification is this example file you sent us two years ago? Aha, so the public order number isn’t what is called order_number, we should obviously have used the CustONr. Et c...

emodendroket · on April 25, 2021

Ha, well... it beats the alternative, at least. All those problems are even worse when it's just a big free-for-all in a monolith with tons of people touching it all the time.

cle · on April 24, 2021

Yes! It also lets the components be deployed and operated independently (also organizationally critical), you can enforce granular security boundaries between components, etc.

benlivengood · on April 24, 2021

I wonder if a lot of these algorithms/libraries are a decade or more old and modern CPUs and RAM have caught up with problems that used to be intractable on a single machine, and were furthermore optimized for older generations of clusters with different interconnects. Modern CPU packages incorporate a lot of features from world-class supercomputers from a couple decades ago.

In theoretical CS classes there were discussions of the tradeoffs between networked, NUMA, and other fabrics. Analysis of what actually ran the fastest was talked about briefly beyond Big O notation, but there is a definite advantage to making problems tractable that otherwise wouldn't be. In the FAANGs it was mostly embarrassingly parallel algorithms with a skin of distributed computing for synchronization/coordination, and so the focus had always been on absolute speed or efficiency.

devit · on April 25, 2021

They should also have tested hand-coded parallel implementation.

From the paper it's not clear how much of the overhead is fundamental and how much is due to the frameworks being just terrible (many of them are written in Java, so they can't be good since no decent programmer would use Java for efficient algorithmic code instead of Rust or C++ before Rust was available).