Testing Distributed Systems

ths · on Feb 11, 2022

Related to this topic: When running integration/e2e tests, setting up the environment (all the required services, data stores etc.) in the right sequence, loading them with test data and so forth can be thorny to automate.

Good automation around preparing/provisioning the testing environment is a necessary companion to the testing tools/frameworks themselves.

Most commonly, fully-capable testing environments aren't available during the inner loop of development (where the dev setup can usually only run unit tests or integration tests for 1-2 services + a database).

Because of this, people tend to rely solely on their CI pipelines to run integ/e2e tests, which can slow things down a lot when one of those tests fails (since the write/run/debug loop has to go through the CI pipeline).

As an industry, I think we should start taking automation and developer productivity more seriously—not least when it comes to writing and debugging tests for complex distributed systems. The more we can lower the marginal cost of writing and running tests, the more effective our test suites will become over time.

Shameless plug: My company (https://garden.io/) is developing a framework and toolchain to bring the full capabilities of a CI pipeline to the inner loop of development, so that developers can efficiently run all/any test suites (including integ/e2e tests) in their personal dev environments.

We do this by capturing the full dependency graph (builds, deploys, tests, DB seeding etc.) of the system in a way that can power CI, preview environments and inner-loop development.

y4mi · on Feb 11, 2022

The issue isn't tooling, it's hardware resources and in some cases licencing.

While the problem isn't inherently trivial, the issues the tooling can solve is, which is the order of startup and is usually solved with a waitforit startup script, as these services generally talk over network with each other.

The real challenges, like determining seed data etc is too project specific to be abstracted away.

Not to take away from your problem domain, it would be nice to have framework to use so the developers don't have to wire the plumbing for these automations manually anymore - it's just not going to solve the issue that developers will have to wait for the ci pipeline to get their results

ths · on Feb 11, 2022

> The issue isn't tooling, it's hardware resources and in some cases licencing.

Hardware resources are definitely an issue. That's why we generally recommend using remote development environments, which aren't as resource-constrained as the local dev machine. Making that comparably smooth to the local dev experience (e.g. for live reloading of services without rebuilding containers) needs some clever tooling (which is partly the reason we're building our product).

With production-like remote dev environments, you get the same capabilities as your CI environment, but can run test suites ad-hoc (and without having to spin them up and tearing them down for every test run).

There's no fundamental reason why CI environments should have capabilities that individual dev environments can't have—it's all a matter of automation in the end.

> The real challenges, like determining seed data etc is too project specific to be abstracted away.

Very much agree with that! The generic stuff (dependencies, parallel processing, waiting for things to spin up etc.) should be taken care of by the tooling, but without constraining the project-specific stuff (which is highly individual).

znpy · on Feb 11, 2022

> Hardware resources are definitely an issue. That's why we generally recommend using remote development environments, which aren't as resource-constrained

On the contrary, I almost always advocate for having a rack full of machines in a/the office, running some kind of workload management (kubernetes, vmware/proxmox or a combination of the two).

Hardware is dirty cheap, and plenty fast these days.

If you have an office, chances are you already have a server room anyways (enterprise grade network switches require cooling anyway) you might as well throw a bunch of physical machines in there.

The only real issue I see is that most developers have literally no idea of the runtime resources needed by their code, for a number of reasons (like runtimes hiding that kind of information and the general mindset that pushing out new releases/features is more important than tuning existing ones) so in the cloud developers will just provision bigger and bigger environments. It's all fun and games until two things happens: (A) environment provisionin in the cloud will take long times just like on developer machines andb (B) company earnings will be eroded by cloud infrastructure bills (on-prem infrastructure OTOH provide tax shielding)

Rowern · on Feb 11, 2022

I would expand your point to add that another missing key part of dev on big distributed platform is being able to run parts of the system locally. For some shop it is a lot harder than a simple docker-compose (think of envs with 10 or 100 of micro-services), any laptop cannot handle such load and it is critical for devs to work on the machine, otherwise you lose a lot of time with shared dev envs, SSH tunnels or rsync... I agree that CI pipelines are a pain in a distributed system but I heard more complains from devs not being able to test locally some new feature in serviceA that depends on 10 other services + DBs to work.

spaetzleesser · on Feb 11, 2022

Running locally is great but I would already be happy if I could step through a CICD pipeline with a debugger. This includes stepping though the services the pipeline calls. Also include breakpoints.

exdsq · on Feb 11, 2022

I sort of fell into distributed systems testing in 2019 after leaving my job as a dev for a startup and I’ve loved it since. It is such a fun area to work in. These resources look great :)

mbalex99 · on Feb 12, 2022

What are you doing with testing that interests you the most?

remram · on Feb 12, 2022

This is interesting, I was looking this week into using the new distributed tracing tools in CI, to catch slow queries or problematic dependencies during integration tests. There is already good support (in Python at least) for generating test reports, benchmark summaries, code coverage in nice static HTML files that can be exposed as artifacts from any CI platform.

I was surprised to see that while there is so much tooling in opentelemetry (protocols, agents, SaaS) there doesn't seem to be any option for generating static reports right now. Is my plan just wrong?

mieubrisse · on Feb 11, 2022

This is crazy cool to see pop up on the front page as learning how to build, manage, and ship distributed systems feels like the next evolution of software engineering. We have all these tools for doing so for a single binary - great IDEs, easily-accessed logs, debuggers, etc. - but our tooling for working with multiple systems running in parallel hasn't caught up yet.

This is the ethos of what we're building at https://www.kurtosistech.com/ : expecting devs to learn the nuances of Docker and Kubernetes is like expecting them to code in assembly. You can do it, but there are higher levels of abstraction that are more effective (e.g. "I want this binary talking to this binary on this port with these args; you figure out the rest").

The hypothesis is that a unified platform for packaging a microservice, wiring it together with other services, setting up dev & testing environment environments, hooking it into CI, and furnishing a toolkit for debugging them (with some of the tools on this list, like a built-in Chaos Monkey) can reduce the cost to develop by distributed systems by an order of magnitude. If we can achieve that, does managing a distributed system become a single-developer job rather than a whole-team job?

1equalsequals1 · on Feb 11, 2022

This is a goldmine thanks!

meling · on Feb 11, 2022

Thanks for posting this.

hardwaresofton · on Feb 11, 2022

+1, this is an amazing group of compiled resources -- Jepsen and TLA+ I was aware of but there's so much more out there.

The summary papers from cloud providers are also especially useful.

urthor · on Feb 11, 2022

Whilst not making light of the complexity of the field.

Distributed systems, it honestly is just a toy for the tech monopolies.

Nobody I've ever heard of in a company with less than a billion dollars of revenue is bothering.

Who ever thought "let's build a software program which has major issues dealing with clock errors!"

My day job doing distributed compute at a "main street" company is/was completely eaten by "why the hell would you even, just throw Postgres on a 64 core 2TB RAM machine with replication."

Why indeed.

antoinealb · on Feb 11, 2022

But in a way you are doing distributed systems; that's what the "with replication" part of your architecture is saying. Now you can choose to never ever test that your replication setup is working, but I think that "distributed systems testing is for big tech" is bad advice. Something that would work pretty well in your setup: take down the Postgres leader, and see if your website is still up. That's already Chaos Engineering.

urthor · on Feb 11, 2022

Well yes exactly.

That's kind of why I brought up the idea of clock errors.

Single leader distributed systems are honestly an entirely different paradigm to multi-leader. You have a main node, the main node controls routing and which data goes where.

There can only be one main node at one time, hence no latency/clock errors/multi leader fail-over/PAXOS/voting/stale keys/exotic application specific conflict resolution logic.

Done, simple.

(Secretly not so simple, all of the above problems are actually hugely important to resolve in any monolithic database. However, Postgres sorts it out for you.)

Really, what I am calling "distributed systems problems" are the subset with multiple simultaneous, geographically separate leaders.

jorangreef · on Feb 11, 2022

> There can only be one main node at one time, hence no latency/clock errors/multi leader fail-over/PAXOS/voting/stale keys/exotic application specific conflict resolution logic.

This is not accurate.

You can certainly have clock errors in "single leader distributed systems", simply by virtue of the fact that the primary rotates throughout the cluster.

And, perhaps even more surprisingly, you can certainly have clock errors in "single node non-distributed systems like Postgres", i.e. clock errors are completely orthogonal to distributed consensus protocols (unless you're doing something dangerous like leader leases, which I don't recommend).

The problem is simple: what if the system time strobes or gets stuck in the future? How does this affect things like recording financial transactions? What if NTP stops working? Distributed systems aside, it's pretty tricky.

I've worked on this problem of clock errors specifically for TigerBeetle [1], a financial accounting database that can process a million transactions a second. In our domain, we can't afford to timestamp transactions with the wrong timestamp, because this could mean that financial transactions (e.g. credit card payments) don't rollback if the other party to the transaction times out, so liquidity could get stuck and tied up, for the duration of the clock error.

[1] https://www.tigerbeetle.com/post/three-clocks-are-better-tha...

urthor · on Feb 11, 2022

Fair point.

dikei · on Feb 11, 2022

>> There can only be one main node at one time

This is how most distributed system solve the consensus problem: the nodes automatically run leader election algorithm so that eventually only one leader remains.

>> (Secretly not so simple, all of the above problems are actually hugely important to resolve in any monolithic database. However, Postgres sorts it out for you.)

Except that in case of network partition of the primary node, Postgres can't promote a new primary node from the replicas automatically, unlike consensus systems such as ETCD/Zookeeper.

yakshaving_jgt · on Feb 11, 2022

In a way, yes. But that’s not what most people are referring to when they talk about “distributed systems”, so I think you’ve constructed a straw man.

eklavya · on Feb 11, 2022

What do those people mean by "distributed systems"? Are they running a single node server with web app and database on the same node? Is there a load balancer/cache/replica/cluster? If it's 1+ machine, it is distributed.

In your opinion what would qualify as a "distributed system"?

urthor · on Feb 11, 2022

Distributed system probably isn't precise enough in of itself.

But I was very much thinking of multiple leader distributed systems.

Anecdotal. But dealing with clock errors and multi-leader issues are of far greater complexity than caching data.

Replicas, load balancers, and all the vagaries of single leader systems are not absurdly difficult technical problems.

They're problems that can be solved by a team with a single senior engineer who knows what they're doing.

yakshaving_jgt · on Feb 11, 2022

I'm not taking a position on what would qualify as a distributed system, and neither was that the point of my comment.

I believe that when most people talk about "distributed systems", they're talking about some form of microservices architecture, where components of a single system are arbitrarily separated by network requests for nebulous reasons.

zozbot234 · on Feb 11, 2022

> Distributed systems, it honestly is just a toy for the tech monopolies.

Not so, decentralized systems are also inherently distributed. The whole point of both the old "Web 3.0" and the newer "Web3" is to make "distributed" thinking a part of the web, beyond the now trivial "linking to a document-like resource on a different host" scenario that defines hypertext/the "Web of documents".

(Of course one may then retort, as Jack Dorsey apparently has, that "Web3" itself is merely "a VCs' plaything". One has to be aware that these developments are somewhat speculative and pretty far from a real consensus.)

hughrr · on Feb 11, 2022

That’s ok until you reach the logical IO and compute capacity of off the shelf PC hardware. Also 20 years of insane coupling and complexity makes it pretty difficult to move away from that architecture which is where most companies seem to end up with from experience.

Also it’s a complete fallacy that “if we’re successful or grow large enough, we’ll just rewrite it”.

And then there’s the logical problems of managing complexity, distributing work to development teams and even things as determining change impact that become terrible difficult.

I say we should design with distribution in mind but deploy without it. Assuming you can scale up forever is an expensive and stupid mistake.

Incidentally it turns out that even adding any deferred processing (in our case SQS+workers) on simple LOB systems can have nasty problems like clock skew and QoS to consider which are very distributed pains :)

urthor · on Feb 11, 2022

> 20 years of insane coupling and complexity makes it pretty difficult to move away from that architecture which is where most companies seem to end up with from experience.

Pretty much. I don't want to generalise for legacy context however.

> I say we should design with distribution in mind but deploy without it.

Oh yes, absolutely. You can pretty reasonably predict when your compute needs will scale up such that you genuinely need to make the shift, and plan for it with a modular architecture.

> Assuming you can scale up forever is an expensive and stupid mistake.

See here's the thing. I think it has been, in the past.

I'm definitely sure that there will be applications in the future which will scale past single leader, multiple nodes.

What I've seen in the last few years was we just didn't ever have those demands (except for deep learning).

I wasn't working for small business, at a genuine mega-corp.

The total production data inflows and egresses of my mega corp, never peaked past I think 200MB/s in any year (for business critical systems. There was some user facing video that was jettison-able). The daily peak was far lower than that.

All production compute needs, bandwidth and compute, were DWARFED by employees on Zoom and Youtube.

The sum total of all proprietary OLTP data, across the company? And we had roughly 14 different legacy proprietary business systems from acquisitions. We had COBAL, we had DB2, we had it all.

The architect of our consolidation had it at less than 5TB (excluding photographs, backups and duplication etc).

40+ years of business critical OLTP data. And all the analytics was done on trailing views of the replicas.

Given the direction of Moore's law, I can safely say that for my former multi-billion dollar employer, Moore's law is going to outpace any of our business requirements.

(Except for photographs. But all our deep learning was being done by a spin out).

hughrr · on Feb 11, 2022

One size doesn’t fit all. Wait until you’re bought by another multi billion dollar employer and expected to scale out to their workload…

urthor · on Feb 11, 2022

Oh definitely. I'm not saying it doesn't happen.

legulere · on Feb 11, 2022

If you have good enough engineers and manpower for distributed systems, you can also optimize your systems enough to fit again, if you are not a billion dollar company.

Distributed systems are exciting and interesting, but currently not worth their investment for 99% of the cases. Things might change though in some decades.

urthor · on Feb 11, 2022

Generally speaking, the second you're running tricks to "jam" something that's overly large into a single machine, yeah you should just give up.

The issue becomes when you reach the scale of compute needs such that you cannot "jam" all your requirements into a single leader multiple node cluster.

The amount of data you'll need for THAT, is becoming unimaginable.

64 core 128 thread leader nodes have outrageous potential to scale.

That said, that doesn't quite apply to my Postgres example (barring some nifty DBA tricks and third party extensions).

But you can appreciate how much data you're going to need to blow past what 128 threads can do for you.

hughrr · on Feb 11, 2022

Actually 20 years of shitty code built on assumptions like this can break an AWS 96 core instance running SQL server. That’s my life.

It does however shift 10k shitty queries per second which is impressive.

hughrr · on Feb 11, 2022

Been there done that. That gave us two years and cost a year of developer time.

dikei · on Feb 11, 2022

Distributed systems are not limited to distributed computing: once you start breaking your monolith application into multiple services, you have yourself a distributed system.

cle · on Feb 11, 2022

Distributed systems are really about persistent state. As soon as you have/need that, you run into these problems. If you need to deploy while maintaining state (like DB contents), or fail gracefully without losing your entire DB, then you have entered the realm of distributed systems. These properties are present even in most "monoliths".

cmeiklejohn · on Feb 11, 2022

A bit of a shameless plug here, but I'm a Ph.D. student at Carnegie Mellon University (Ph.D. in Software Engineering) and my thesis is on the resilience of microservice applications at scale. We've published a little bit on this topic, with some new work under submission currently, but I've put together a website that talks about a.) why doing this work as a student is very difficult due to the lack of open-source applications in this style and b.) proposed a new technique for addressing these problems.

It's called Filibuster:

https://filibuster.cloud/index.html

Example of testing a small Java application with Filibuster:

https://www.youtube.com/watch?v=BhZLHpxQ7mI

nijave · on Feb 12, 2022

I think there's a fair bit of microservices in the CNCF space. Cortex is implemented as micro services (and Kubernetes largely is, too)

cgio · on Feb 11, 2022

Plenty non tech monopolies out there with more than 1bn revenue. Also distribution is not just relevant on the basis of data volume but also e.g. to accommodate variety of use cases. Don’t get me wrong, Postgres is a great tool but the moment you do replication, the only reason you potentially don’t have to test your distributed system is because other people did it for you, not because it’s not relevant.

tzone · on Feb 11, 2022

High availability + uptime is now necessary for any business no matter how small. If you need to run 24/7 available high uptime services you can’t avoid distributed systems one way or another. (Distributed systems vary by complexity of course but anything that has some failover type of mechanism is going to be a distributed system )

Also, industry is now transitioning where even small business might need geo distribution , not just for compute but also for data. That is going to make distributed systems even more complex even for the small guys.

Cloud services have made a lot of this stuff easier for average Joe but you can’t really avoid dealing with at least some fundamental concepts of distributed systems

adevx · on Feb 11, 2022

Not sure I agree. My Hetzner vps has less downtime than some AWS regions. To be honest, I have yet to experience any downtime from them over multiple years. There is a good chance you introduce more potential downtime by blowing up the complexity with a distributed system.