I'm a distributed setup I imagine there could be cases where you want to atomica...

AlphaWeaver · 2024-12-13T00:56:47 1734051407

Erlang does have a mechanism that allows a module to control when it moves from the "old version" to the "new version" of its own code. Calls to the module with the fully qualified name (e.g. `module:function()`) will invoke the "new code" once it's loaded, but calls within that module using only function names (just `function()`) will continue to invoke the "old code".

If the portion of the app you were hot upgrading was an OTP process like a GenServer, you could theoretically wait for some sort of atomic coordination mechanism to make that fully qualified function call after the new code has loaded, at least in theory.

We use hot code reloading at my work, but haven't had a reason to atomically sync the reload. Most of the time it's a tmux session with `synchronize-panes` and that suffices. If your application can handle upgrades within a module smoothly, it's rare to have a need for some sort of cluster-level coordination of a code change, at least one that's atomic.

Muromec · 2024-12-13T00:48:18 1734050898

There can't be anything atomic in a distributed system. You can't even atomically hot upgrade it on a single VM anyway -- you instead load the new version of the module and let dispatcher know to route new calls into it, the same as you would do with a load balancer and a bunch of load bearing docker hosts, just inside your app.

knome · 2024-12-13T13:56:17 1734098177

erlang has a code_change function in the otp that allows the gen_server to update its current state and start using new code. No connections need be broken with clients, no long running processes need be stopped. Just updated in place.

It's not just a routing change.

https://www.erlang.org/docs/24/man/gen_server

Muromec · 2024-12-13T15:06:00 1734102360

It's a routing change in a sense that gen_server is routing function calls to the new module definition. I know about gen_server and code_change, the point was that conceptually the same mechanism, just on a different level of abstraction.

knome · 2024-12-14T00:30:29 1734136229

Routing in progress connections to a new module seems a rather different thing to me than merely routing new ones.

toast0 · 2024-12-13T08:48:48 1734079728

I mean, yes, there's cases where you want that. But there's no mechanism for it, because you would have to stop the world, do the load, and then resume.

Even within a single VM, hot loading doesn't stop the world, during the load some schedulers will switch before others. Although there are guarantees that mean when a process runs new code and sends a message to another local process, that process will have the new code available when it reads the message. (It may still be running the old code, depending on how it's called though)

Dealing with multiple versions active is part of life in most distributed systems though. You can architect it away in some systems, but that usually involves having downtime in maintenance windows.

A typical pattern is making progressive updates, where if you want to change a request, first you deploy a server that can handle old and new requests, then you deploy the client that sends the new request, then you can deploy a server that no longer accepts old requests.

For new replies, if the new reply comes with a new request, that works like above... a client that sent a new request must handle the new reply. Otherwise, update the client to handle either type of reply, then update the server to send the new reply, finally remove handling of the old reply in the clients.

It gets a bit harder if your team dynamics mean one person/group doesn't control both sides... Then you need stats to tell you when all the clients have switched.

Sometimes you do need more of a point in time switch. If it needs to be pretty good, you can just set a config through a dist 'broadcast'. If it needs to be better than that, you can have the servers and clients change behavior after a specific time... but make sure you understand the realities of clock synchronization and think about what to do for requests in flight. If that's not good enough, you can drop or buffer requests for a little bit before your targer time, make sure there are no in progress requests, then resume processing requests with the new version.