Containers and Distributed Systems: Where They Came from and Where They’re Going

ChuckMcM · on Oct 12, 2017

This was a lot of fun, one of the things that doesn't get much air time is that back in the early 2000's when "clusters" and "NUMA SMP machines" were competing with each other the big argument for large SMP iron was ACID compliant SQL databases like Oracle. Now that Google has implemented an ACID compliant SQL database across clusters it puts the final nail in the argument (for me at least) that "Some things only work on SMP machines"

0xbear · on Oct 12, 2017

I actually thing "NUMA SMP machines" are about due for a renaissance with the advent of extreme multicore CPUs. Next year we will see CPUs that run 32 cores (64 hyperthreads) per socket, and don't cost an arm and a leg. Current max is 28 cores per socket, at a staggering $13K per chip.

With relatively economical 64 core / 128 thread options with nearly unlimited RAM capacity appearing, a lot more workloads will "fit". Needless to say, single-node systems are much easier/cheaper/faster to get right.

ChuckMcM · on Oct 12, 2017

The irony of clusters of many core NUMA SMP machines is not lost on me :-)

If things follow the previous patterns that will open up the market to a single/dual core, large memory, machine with a high speed networking and storage ports that is lower cost and lower power.

0xbadcafebee · on Oct 12, 2017

Some things do only work on SMP, because for some workloads no network fabric (that I know of) is as fast as the system bus. It's the classic Beowulf vs Mainframe problem: some things can be split up into discrete parts, and some things have to share so much that bandwidth is the bottleneck. Then you go from SMP to NUMA because your bus is too slow. But both are still faster than most (all?) networks.

A lot of this also centers around how apps are designed. If you can rewrite it you can pick your architecture. But it's hard to take a pre-existing one and slap it on a new architecture and make it just scale.

Mosix attempted to have a magical SSI that just scaled single pre-existing apps (even threaded shared-memory ones) across machines with no extra work, but it's got such a hard set of problems it never caught on. Now we just add intermediate abstractions until we can jam an app into whatever architecture we have.

ChuckMcM · on Oct 12, 2017

It is exactly the Beowulf vs Mainframe problem. And it is this: "Some things do only work on SMP, because for some workloads no network fabric (that I know of) is as fast as the system bus." is under siege.

From a systems architecture point of view it is a really interesting exploration of Amdahl's law. So many things that people said "You'll never do that on a networked cluster of machines." have fallen (data bases being one of the larger ones). And while it used to be mainframes won on I/O channel capacity, Google and others have shown that when you can parallelize the I/O channels effectively that advantage goes away as well.

kuschku · on Oct 12, 2017

Indeed, the trend is clear, but for now, nothing has changed.

Google's implementation is not very helpful to all those companies, individuals, NGOs, and governments that have to follow privacy laws, HIPAA, etc though, because Google's implementation isn't open source, and these entities can't use Google Cloud. Or don't want to.

Until we get an open source solution for this, SMP machines will be useful.

And even then, you save money by having less and larger machines in your cluster than just having tiny ones. Larger machines means your overhead is reduced.

numbsafari · on Oct 12, 2017

Uhhhh... most orgs that run HIPAA workloads do so slavishly on Windows and EPIC, neither of which are famously what most would consider open source.

kuschku · on Oct 12, 2017

You’re right, the real distinction is on-premises vs off-premises, I should have made that more clear (although free software implies on-premises being possible)

numbsafari · on Oct 12, 2017

I’m still confused... most orgs running HIPAA workloads are still doing so on prem, AND using very much non-free software.

Google, and Amazon, and MS, undergo extensive third-party audits and people can, and do, run HIPAA workloads there.

I’m not sure what distinction you are trying to draw.

kuschku · on Oct 12, 2017

The distinction is that companies that require HIPAA workloads aren’t going to upload their entire dataset into Google Cloud Spanner, which is not available as on-prem version, and which isn’t HIPAA certified.

So either we need an on-prem version of Cloud Spanner, Cloud Spanner needs to be HIPAA certified, certified to match German privacy laws, etc, or Cloud Spanner can’t serve these situations.

jabl · on Oct 12, 2017

There's cockroachdb which is roughly open source spanner you can deploy on your own systems.

kuschku · on Oct 12, 2017

Which has still horrible join performance, and several other tradeoffs. (see https://www.cockroachlabs.com/blog/cockroachdbs-first-join/)

Maybe in a year, or two. But not today.

zzzcpan · on Oct 12, 2017

Spanner has a lot of tradeoffs too, I'm not sure what do you see as problematic in Cockroach, joins are good enough. The biggest tradeoffs are inherent to strongly consistent distributed systems. Even precise clocks and fast networks won't help as much as you might think. You still have to accept vastly different latencies and performance than in traditional single-node RDBMSs.

kuschku · on Oct 12, 2017

If I run several servers in the same rack, with a local private 10Gbps network, with CockroachDB, running on spinning HDDs, I expect to get the same (or better) throughput as with a single PGSQL instance, and a similar latency. (When accessing it from another server, via the public internet).

That’s not always the case, though.

phaemon · on Oct 12, 2017

That blog post is from a year ago. And since Cockroach DB 1.0 was only released in May this year, it's a bit misleading to link to that post as though it was the current state of the sofware.

kuschku · on Oct 12, 2017

That blog post is referenced in their FAQ today, under the topic of what it can't do right now. Sorry if I misunderstood the situation, I’d appreciate any updated links.

See: https://www.cockroachlabs.com/docs/stable/frequently-asked-q...

phaemon · on Oct 13, 2017

Ah, my apologies then. It's their fault for not updating the FAQ and you can hardly be blamed for quoting from it!

There was an update to that 6 months later: https://www.cockroachlabs.com/blog/better-sql-joins-in-cockr...

I'd assume they've made even more improvements since, so they really should update their docs.

Pyxl101 · on Oct 12, 2017

Distributed ACID transactions across multiple DBs or clusters has been possible for a while if you have been willing to pay the latency penalty for the transaction to complete, hasn't it? As far as I know, Google systems have to pay this latency cost too. From an analysis of Spanner and Calvin [1]:

> The cost of two-phase commit is particularly high in Spanner because the protocol involves three forced writes to a log that cannot be overlapped with each other. In Spanner, every write to a log involves a cross-region Paxos agreement, so the latency of two-phase commit in Spanner is at least equal to three times the latency of cross-region Paxos.

I suspect that Google's innovation in making these cross-region or cross-cluster transactions viable is partly in their network: having an extremely reliable network with consistent low latency, a lot of which allegedly uses first-party network fiber between disparate locations. This is touched on in a paper that Google published [4]:

> Many assume that Spanner somehow gets around CAP via its use of TrueTime, which is a service that enables the use of globally synchronized clocks. Although remarkable, TrueTime does not significantly help achieve CA; its actual value is covered below. To the extent there is anything special, it is really Google’s wide-area network, plus many years of operational improvements, that greatly limit partitions in practice, and thus enable high availability.

I'm also not sure that technologies like Spanner necessarily address the same use-case as SMP machines. Aren't those use-cases concerned with achieving the highest possible transaction throughput for transactions that are potentially highly contentious and require low latency? Systems that solve that problem today still largely look like Oracle DBs from 2000 -- or a fleet of replicas reading executing the same state machine on a transaction log.

Transactions against Spanner likely have high latency compared to traditional DBs[2]:

> On the downside to Spanner, Zweben noted, that the price of Spanner’s high availability is latency at levels unacceptable in some applications. If you’re executing banking transactions, like in airline reservations, and you can afford a minute of delay while you make sure they’re committed in multiple data centers, Spanner’s probably a good database for that, he said.

> Zweben says, however, if you’ve got a customer-service app that requires you to be extraordinarily responsive over millions of transactions and at the same time analyze profitability of a customer and maybe train a machine learning model, those simultaneous transactional and analytical requirements are better for a database [...]

One benchmark I found suggests that read and update latencies can be quite high [3]. The graphs are missing units, but it seems to suggest update latency is higher than with traditional DBs. I've had difficulty finding other benchmarks just now, but the one other I found reported the same finding [5]:

> An interesting pattern above is that queries 14-18, which are all updates, perform with higher latency on Spanner than the easy selects and non-bulk inserts.

It's definitely innovative technology in that it allows you to scale read-oriented SQL workloads to a far greater degree, with better availability, but from what I've read it does not tackle all of the same problems and use-cases that traditional relational DBs are best at solving: high transaction throughput and low latency for contentious data sets.

[1] https://fauna.com/blog/spanner-vs-calvin-distributed-consist...

[2] https://thenewstack.io/google-cloud-spanner-view-field/

[3] https://www.nuodb.com/techblog/benchmarking-google-cloud-spa...

[4] https://research.google.com/pubs/pub45855.html

[5] https://quizlet.com/blog/quizlet-cloud-spanner

0xbadcafebee · on Oct 12, 2017

> I suspect that Google's innovation in making these cross-region or cross-cluster transactions viable is partly in their network

I think this understates it. Production distributed systems suck balls with unreliable infrastructure, and anyone who has ever tried to do realtime replication of all of their data (and lots of it) across the internet knows how completely unrealistic it is. It's like running a power cord across a highway to power a refrigerator.

raarts · on Oct 13, 2017

> It's like running a power cord across a highway to power a refrigerator.

??? Color me baffled.

bogomipz · on Oct 12, 2017

>"back in the early 2000's when "clusters" and "NUMA SMP machines" were competing with each other"

Did you mean Numa vs SMP? Or something else maybe? How can a machine be a NUMA SMP? NUMA and SMP are fundamentally different architectures.

ChuckMcM · on Oct 12, 2017

No, I was thinking non-uniform memory architecture, which is to say an SMP machine where the "speed" at which you can access RAM is dependent on the core or 'thread' from which you accessed it. Lack of memory uniformity was the compromise to achieve larger effective address spaces and "simple" programming.

Today on a 24/48 core dual socket server you'll see the same sorts of thing, having a core using memory on the 'other' physical chip's memory bus will impact the overall performance significantly.

In the 2000's, before multi-core chips were a thing, there were two major camps, the 'super computer' camp, and the 'cluster' camp.

The 'super computer' camp insisted on cache coherent memory between all of the cores or threads. You got these very expensive fabrics from people like cray that would snoop access to memory from the cores and send coherency messages around to insure that if someone wrote something in to memory somewhere, everyone else's L1 or L2 cache got the message to invalidate what they were holding (shoot downs). These machines are very expensive and take months to build.

The 'cluster' camp said, "We can use a network fabric and just parameterize shared memory usage." So they put together independent machines connected by a network fabric and no cache coherency protocol. If you wanted to use shared memory you could build something like memcached and wrap your access with network calls. With that architecture even if it took twice as many cores to do what you wanted to do, the price of the machine was one tenth what it was for the big SMP machine.

For something that was trivially parallellizable like internet search or serving up web sites, that was a much more cost effective way to go. When people started doing stuff that they previously used 'super computers' and big SMP machines for on these Linux clusters it became a sort of race to pull apart these problems into "shared nothing" clusters.

bogomipz · on Oct 12, 2017

>"Lack of memory uniformity was the compromise to achieve larger effective address spaces and "simple" programming."

Oh interesting. Might you have any links or suggested reading on this discussion and eventual compromise?

>"In the 2000's, before multi-core chips were a thing, there were two major camps, the 'super computer' camp, and the 'cluster' camp."

Is the cluster camp Beowulf then basically?

>"You got these very expensive fabrics from people like cray that would snoop access to memory from the cores and send coherency messages around to insure that if someone wrote something in to memory somewhere, everyone else's L1 or L2 cache got the message to invalidate what they were holding (shoot downs)."

Is this the MESI protocol then?

Thanks.

EDIT I just saw the link below about CCNUMA.

zurn · on Oct 13, 2017

A big category of cluster-friendly HPC cluster codes works by running in lockstep on all the nodes. Apparently it works quite well for things like weather forecasting where the problem is naturally divided into a grid but there is still significant communication needed between grid tiles. So it's quite different from memcached type things.

https://www.cs.fsu.edu/~engelen/courses/HPC/Synchronous.pdf

godelmachine · on Oct 12, 2017

A quick Google search after reading this comment of yours gave me this → http://www.google.com/patents/US5887146

ChuckMcM · on Oct 12, 2017

Ah yes the CCNUMA patent. This bit is the relevant part:

However, SMP systems suffer disadvantages in that system bandwidth and scalability are limited. Although multiprocessor systems may be capable of executing many millions of instructions per second, the shared memory resources and the system bus connecting the multiprocessors to the memory presents a bottleneck as complex processing loads are spread among more processors, each needing access to the global memory. As the complexity of software running on SMP's increases, resulting in a need for more processors in a system to perform complex tasks or portions thereof, the demand for memory access increases accordingly. Thus more processors does not necessarily translate into faster processing, i.e. typical SMP systems are not scalable. That is, processing performance actually decreases at some point as more processors are added to the system to process more complex tasks. The decrease in performance is due to the bottleneck created by the increased number of processors needing access to the memory and the transport mechanism, e.g. bus, to and from memory.

Alternative architectures are known which seek to relieve the bandwidth bottleneck. Computer architectures based on Cache Coherent Non-Uniform Memory Access (CCNUMA) are known in the art as an extension of SMP that supplants SMP's "shared memory architecture." CCNUMA architectures are typically characterized as having distributed global memory.

zurn · on Oct 12, 2017

Is this clear cut in the terminology? A plausible definition would also be that it's still symmetric MP if remote memory access has non-uniform performance - since the nodes and their memories are symmetrical. After all, you get that just with caches and 2 sockets plugged to the same DRAM.

The historical opposite of SMP used to be asymmetric multiprocessors in the heterogenous sense - different kinds of processors, for example scalar/vector/io processors. Or for a modern day take, ARM SoCs with little low-power cores and faster & more power hungry cores, and GPUs thrown in for good measure.

bogomipz · on Oct 12, 2017

>"Is this clear cut in the terminology?"

I think in the context that OP was referring to "early 2000s" which presumably means the introduction of Opteron and HyperTransport then yes I believe the NUMA vs SMP distinction would be pretty clear.

mathattack · on Oct 15, 2017

Good to see you outside the context of HN. You're an immense value to the community.

godelmachine · on Oct 12, 2017

Hi Chuck, by SMP do you mean Symmetric Multi Processing? Sincerely,

ChuckMcM · on Oct 12, 2017

Yes. All threads have equal computation capability and cache coherent (although not uniform) access to main memory. This was the 'big iron' of the day like the Sun, SGI, and Cray "big iron" boxes.

chubot · on Oct 12, 2017

Just to double-click on container technology, why do you think it took so long for this technology that was built into Solaris to become mainstream with Docker?

I think VMWare deserves a mention here? And the terms OS Virtualization vs Hardware virtualization do as well (Ctrl-F doesn't find them.)

For awhile hardware virtualization (VMWare) was more prevalent, but it's more complex and has more overhead than OS virtualization (containers). That is how people solved the problem of having powerful machines and small workloads (or workloads with a lot of variance).

Although historically it might have been that hardware virtualization actually came first, in IBM mainframes. In the Unix world I guess OS virtualization came first.

kakwa_ · on Oct 12, 2017

I think docker succeeded not because it was a better container or virtualization solution, or because it was lighter than VMs. Jails or Solaris zones have existed for years or decades, and on Linux, we had openvz and vserver long before Docker. And I'm not even mentioning plain old chroot (with indeed some security issues).

I didn't follow the latest development, but for a while, Docker was not even that great in term of robustness and stability. I never was a big fan of the userland proxy for example. And from what I've read, upgrading from one docker version to another could be quite painful (disclaimer: I toyed with docker a little, I haven't used it in production yet).

IMHO, Docker succeeded because it brought a complete ecosystem around containers:

* Ways to generate the container image through Dockerfile

* Ways to compose an image on top of another (itself on top of another, etc...)

* Ways to share and distribute these images with Docker Registry

cat199 · on Oct 12, 2017

"You basically came up with Docker before Docker was around, or at least with things like Solaris zones and C groups."

Except jails already existed on FreeBSD, and CP/CMS on mainfraimes in the 70s existed long before this...

bpicolo · on Oct 12, 2017

Docker won the UX game, which mattered the most here. :)

havetocharge · on Oct 12, 2017

It only mattered the most because it was thought of last. If it was UX that increased impact by improving access, then it indeed mattered a great deal.

msarrel · on Oct 12, 2017

The article says " It turned out that about twenty years earlier IBM did sort of solve that problem with partition isolation, but they did that on custom mainframe hardware and architecture."

msla · on Oct 12, 2017

Right. I remember looking at technologies for virtualization on x86 chips before actual virtualization hardware existed on those chips, and they were pretty heavyweight, including rewriting the binary to replace privileged opcodes which would fail silently or behave differently in ring 3 compared to ring 0 with sequences of opcodes which would work to maintain the isolation. It was closer to emulation than virtualization given how much work the host needed to do to maintain the illusion that the guest OSes were running on bare hardware, and it really pointed up how the x86 architecture of the time just wasn't designed for that kind of thing.

For the past decade or so, mainframe-level virtualization on x86 has been possible, at least, starting on high-end chips and working its way downwards to the commodity world. And, yes, it exists in ARM as well.

https://genode.org/documentation/articles/arm_virtualization

LinuXY · on Oct 12, 2017

Minus CP/CMS and OpenVZ, here's an article on the concepts and tech Docker is built on. https://www.kentik.com/the-evolution-of-docker/

naasking · on Oct 12, 2017

Containerization long predates Solaris, even on commodity hardware. Capability operating systems dating back to the 60s and 70s support even more extreme isolation by default.

swizydo · on Oct 12, 2017

yea indeed