Towards zero-downtime upgrades of stateful systems

vlovich123 · 2024-03-09T04:13:22 1709957602

This seems like a problem you can’t solve generically and you always end up making trade offs. The two main challenges I can think of:

* State in old system isn’t representable in new system (not a backwards compatible upgrade or more likely a bug exists in handling the new state)

* There’s state outside of the program that’s impossible to transition gracefully (e.g. dirtied IO socket where you don’t know what it’s state is & it’s a resource owned outside of your program)

* Transitioning state means there’s a possibility of failure because the program never reaches a graceful transition point to snapshot the state. So you either have to choose between running the old program forever or abandoning the graceful state transition anyway.

Distributed systems I’ve observed pick one of two strategies:

1. Using the load balancer strategy of migrating off the old version & then terminating it after some grace period.

2. Use a formal distributed state system like CockroachDB, Yuggabyte, DynamoDB, S3, etc etc.

This is probably a big reason why most programs use external storage solutions even if they’re less efficient - it centralizes maintenance of state onto a system that has well defined semantics and can handle repair transparently.

stevan · 2024-03-09T08:16:11 1709972171

> This seems like a problem you can’t solve generically and you always end up making trade offs.

That shouldn't stop us from solving the problem in the cases where it's possible though? We can tackle the corner cases separately with manual overrides.

> This is probably a big reason why most programs use external storage solutions even if they’re less efficient - it centralizes maintenance of state onto a system that has well defined semantics and can handle repair transparently.

This is certainly the case today, what I'm asking is: does it always have to be like that in the future?

vlovich123 · 2024-03-09T08:20:26 1709972426

I suspect that it’s impossible in the sense that the “possible” space will look like a distributed storage solution and the rest will look similar to graceful handoff of new connections to new version + shutdown of old version after some time (with forceful disconnect of sessions hanging around).

stevan · 2024-03-09T09:31:23 1709976683

I give two examples of a stateful upgrade in Erlang/OTP in the motivation, neither rely on distributed storage.

vlovich123 · 2024-03-09T11:02:57 1709982177

Unfortunately the documentation for Erlang doesn’t really describe any pros/cons for anything and I’m not an expert in it so I don’t know what the limitations are for the Erlang approach but they certainly must be (e.g. if you have long running sessions and do several upgrades, are you running N versions of the code & eating up RAM because the old sessions aren’t complete?).

As I understand it, Erlang/OTP captures the entire state of the program and it’s a feature of the language and VM to accomplish this. It’s not something you can retrofit into any arbitrary language. For example, your JS app or your Python app or your Rust app won’t be able to do the same easily which means it won’t be robust and it will be error prone. Thus I stand by that there’s no “generic” solution you can bolt onto an arbitrary language.

stevan · 2024-03-09T11:24:36 1709983476

> if you have long running sessions and do several upgrades, are you running N versions of the code & eating up RAM because the old sessions aren’t complete?

I believe Erlang supports two versions running along each other. They capped it at two because back when this was developed there wasn't enough RAM. Joe Armstrong gave at least one talk where he says if he'd have liked to support arbitrary number of versions and garbage collect them as old sessions complete.

> Thus I stand by that there’s no “generic” solution you can bolt onto an arbitrary language.

The main point of the post is centered around Barbara Liskov saying "maybe we need languages that are a little bit more complete now". I'm not interested in the limitations of current languages, I'm interested in the future possibilities.

vlovich123 · 2024-03-09T15:55:39 1709999739

There’s no free lunch and I’m suggesting the trade offs to support this are not worth it vs simpler approaches of doing a graceful drain & upgrade approach w/ a timeout for long running sessions if those may exist (+ if you have a lot of large state to migrate, it could be insanely long to complete an upgrade). This is because availability will never be 100% anyway in any scenario and this kind of transition can easily fit within your failure budget.

toast0 · 2024-03-09T12:26:59 1709987219

> As I understand it, Erlang/OTP captures the entire state of the program and it’s a feature of the language and VM to accomplish this. It’s not something you can retrofit into any arbitrary language. For example, your JS app or your Python app or your Rust app won’t be able to do the same easily which means it won’t be robust and it will be error prone. Thus I stand by that there’s no “generic” solution you can bolt onto an arbitrary language.

I say you can do hotload in any language that supports dlsym/dlopen or eval. I've done it (rather poorly) in Perl and C, and I'm sure others have done it in other languages.

It's a lot nicer in Erlang, so IMHO, if your use case includes long running processes with expensive to construct or transfer state (such as long running sockets), it's worth considering Erlang or something than can do hot loading.

gregors · 2024-03-09T14:06:12 1709993172

Don't know if you care this much or not, but figured I'd link this Elixir talk that goes into details regarding hot upgrades.

https://www.youtube.com/watch?v=IeUF48vSxwI

vlovich123 · 2024-03-09T15:52:34 1709999554

That’s a great link thanks! It really makes it clear that a) correct state changes aren’t automatically correct (there’s both a manual and automated piece and either can go wrong) b) while the language makes it possible, there’s still a lot of manual work involved & footguns (e.g. if you have a contended resource held while something is being migrated, you’re going to experience degraded availability for other sessions to the point of downtime).

Rygian · 2024-03-09T08:58:44 1709974724

> * State in old system isn’t representable in new system (not a backwards compatible upgrade or more likely a bug exists in handling the new state)

New system starts in a backwards compatible mode where it accepted all state that was representable in old system. Transition is achieved after the upgrade, with flag variables.

vlovich123 · 2024-03-09T11:04:21 1709982261

Yes, but I’ve seen people regularly struggle to write code that accepts all back compat state + handles it correctly. It’s a very hard problem and bugs are very real. Not convinced it’s a better strategy to go for seamless transition for every session vs other approaches.

stevan · 2024-03-09T11:28:14 1709983694

> I’ve seen people regularly struggle to write code that accepts all back compat state + handles it correctly.

From the post:

> In a world where software systems are expected to evolve over time, wouldn’t it be neat if programming languages provided some notion of upgrade and could typecheck our code across versions, as opposed to merely typechecking a version of the code in isolation from the next?

vlovich123 · 2024-03-09T15:39:16 1709998756

That’s not an answer because type checking only protects against that class of errors. But you can have a logic bug in your upgrade code `if (old state) { buggy implementation of old compat } else { implementation for new state }` (or `convert(old state) -> new state` if the conversion is external). If you transparently just run the old code instead, then you’re not actually transferring state seamlessly and you run into the choice of “run N versions of code vs terminate sessions” when you have long running sessions. In any case, I think you start to run into real constraints and it’s not clear to me how Erlang solves these with it’s “2 simultaneous versions of the code only”.

toast0 · 2024-03-09T16:55:40 1710003340

> I think you start to run into real constraints and it’s not clear to me how Erlang solves these with it’s “2 simultaneous versions of the code only”.

There's different ways to handle it, but the 'easy' way is to write your new version that updates the old state to new state on first touch, and make sure either you have a no-op message you can send so the state gets updates or you have some periodic thing that means every state will be updated in X time.

There's also a way to have a 'try_update' that fails without changing anything if there are two versions active already. (Or you can just YOLO and anything with the old old version in the stack gets killed).

I'm not sure if there's better tooling for it now, but there wasn't anything to help you test transitions when I was doing it. For automated tests, you'd need to build a state with the old version, load the new version, run a test, etc. It's the same hole in testing if you do mixed version frontends against a shared database, or mixed version frontend vs backend; it's just more apparent because it's on a single system.

Rygian · 2024-03-09T11:26:32 1709983592

Writing code without bugs is indeed a hard problem. Not only for upgrade paths.

Sometimes the tradeoffs of the application make it worth to spend QA resources to validate a staggered upgrade path.

naasking · 2024-03-09T04:54:57 1709960097

> This is probably a big reason why most programs use external storage solutions even if they’re less efficient - it centralizes maintenance of state onto a system that has well defined semantics and can handle repair transparently.

Indeed, now contrast with the article on HN discussing orthogonal persistence.

mdaniel · 2024-03-09T18:18:10 1710008290

because the front page is a fickle thing: https://news.ycombinator.com/item?id=39615228

zubairq · 2024-03-09T04:33:44 1709958824

I love this and often think about how we can write more robust systems without having to rewrite the whole thing when the implementation of just one part changes. State has a way of becoming a global variable which all modules depend on

KingOfCoders · 2024-03-15T04:57:17 1710478637

"There’s one exception, that I know of, where upgrades are talked about from within the language: Erlang/OTP"

Java OSGi. Great tech, didn't take off, probably to complicated with not enough gain.

AtlasBarfed · 2024-03-09T03:42:44 1709955764

Use cassandra?