Rainbow Deployments with Kubernetes

andrewaylett · on Feb 14, 2018

It feels like it should be possible to fix the reconnect experience, especially in the planned termination of underlying container case: if you ask the client to reconnect, rather than abruptly disconnecting them then they could possibly even wait until their new session was fully established before dropping the old one.

That doesn't take away from my appreciation of the pattern, though: I'm very much in favour of rolling releases forwards, rather than being limited to two colours.

bdimcheff · on Feb 14, 2018

We actually have this functionality as well: we can send a signal to the process which will cause it to display a message to the user asking them to reload to upgrade. There is a similar feature in slack and riot that I've seen as well.

Making this handoff automatic is definitely possible as well, though we do want people to reload occasionally to get new client-side code.

specialist · on Feb 15, 2018

I worked on an autobahn like thingie. Server-side could tell clients to reconnect. Useful for draining servers, etc.

I haven't checked to see if autobahn has this functionality.

https://crossbar.io/autobahn/

pronoiac · on Feb 14, 2018

I agree. I can think of reasons for urgent updates - kernel security, software bugs, etc. - and this feels like it would lead their engineers to possibly support weeks-old versions of their systems. And they're running a multiple of the needed servers.

wjossey · on Feb 14, 2018

Question for the OP-

I haven't ever worked on chat services, so this may not be reasonable. Would it be possible to use some other termination endpoint that sits in front of the service, that allows you to maintain persistent connections to the clients, but make for more transparent swaps of backend services?

So, for example could you leverage nginx or haproxy as the "termination" point for the chat connection, with those proxying back to the kubernetes service endpoint, which then proxy back to the real backend service. So, when you go to swap out the backend service, nginx / haproxy start forwarding to the new service transparently, while still maintaining the long-lived connection with the client.

If this was doable, it would mean you'd only have to drain if you needed to swapout the proxy layer, which is likely a less-frequent task, and thus allows you more agility with your backend services.

bdimcheff · on Feb 14, 2018

Yes, as lotyrin pointed out below, the backend is already effectively an XMPP <-> Websockets bridge. Every time a user logs in, one way or the other we need to establish a connection between the websockets backend and the XMPP backend. The backend does do other things besides simply proxy, and we certainly could separate out the part that needs to keep the state. This deployment strategy is effectively a way to avoid having to do that work, at least for now.

jkarneges · on Feb 14, 2018

This is essentially how Pushpin (http://pushpin.org) works. It can hold a raw WebSocket connection open with a client, but it speaks HTTP to the backend server, and the backend can be restarted without the client noticing.

codegladiator · on Feb 14, 2018

This is a nginx module for this functionality

https://nchan.io/

I have used it before. Super easy to setup. Even with kubernetes.

jkarneges · on Feb 14, 2018

Interesting, I didn't think Nchan would be able to drive a raw WebSocket session with the client, but upon closer look it seems like it might be possible using the auth and message forwarding hooks. Very cool.

lotyrin · on Feb 14, 2018

It sounds like the product is already a proxy (Websocket -> XMPP) so I'm not sure what exactly they're deploying multiple times per day.

It also seems like they could have done simple blue/green and extend their websocket protocol and client to support a hand-off "hey, there's a new version of the proxy, reconnect in x seconds" message (and have idempotency of messages) such that they could have a rather smooth schedule where everyone can reconnect then disconnect gradually over some period of minutes instead of hours and not have either sudden spikes or any period of interruption for clients.

ah- · on Feb 14, 2018

Or just a generic reconnect in case the websocket connection breaks for any reason really.

lotyrin · on Feb 14, 2018

That would require an interruption in service between the server closing the old connection and the client reestablishing a new one, which they sought to avoid.

vitalus · on Feb 14, 2018

In practice you probably would want pretty thin proxy layer as you said, which is then forwarding requests to other services as needed...but you would still need to re-deploy this proxy layer, and would thus need a similar solution to the one proposed in the article.

ninkendo · on Feb 14, 2018

Seems like the kind of thing that a Deployment should be able to manage on its own... some kind of DrainPolicy object maybe?

Also, if the previous ReplicaSet a Deployment is rolling past has several pods, maybe only some of them need to stay alive (maybe some drain sooner than others.)

Perhaps the whole endeavor should just be to make Pod drainage a bit more explicit than just terminationGracePeriodSeconds... perhaps letting a pod signal with a positive confirmation that it's shutting down (letting connections drain) and the rest of the k8s controllers can just leave it alone until it terminates itself.

Although really, I think a combination of setting terminationGracePeriodSeconds to unlimited, and having a health check that ensures that it doesn't get wedged and miss the termination signal (by checking that a pod status of "shutting down" corresponds to some property of the container, like a health endpoint saying the shutdown is in progress...) and then nothing else needs to be done. Basically, color me skeptical when they say:

"We used service-loadbalancer to stick sessions to backends and we turned up the terminationGracePeriodSeconds to several hours. This appeared to work at first, but it turned out that we lost a lot of connections before the client closed the connection. We decided that we were probably relying on behavior that wasn’t guaranteed anyways, so we scrapped this plan."

(This also depends on the container obeying the standard SIGTERM contract to properly drain connections but not accept new ones, which is pretty standard in most web servers nowadays.)

bdimcheff · on Feb 14, 2018

yeah I don't know why terminationGracePeriodSeconds hacks didn't work. It could have been a different, unrelated factor that we didn't discover. It certainly could have been service-loadbalancer/haproxy's fault instead of the termination grace period itself. I'm certainly happy to be proven wrong there.

smarterclayton · on Feb 14, 2018

Not 100% sure about your scenario, but if you set a preStop hook to an exec probe you can arbitrarily delay shutdown inside the gracePeriod, because the kubelet won’t terminate the container until preStop returns.

So if you set a 5 hour grace period, and a preStop hook that invokes a script that doesn’t return until all connections are closed (but which tells the container process not to accept new ones) you can control the drain rate.

There is some app level smarts required - to have new connections rejected and have any proxies rebalance you. Haproxy does this in most cases, but the service proxy won’t (in iptables mode).

If that’s not the behavior you’re seeing, please open a bug on Kube and assign me (this is something I maintain)

bdimcheff · on Feb 14, 2018

Yeah I think that there is still some potential in the terminationGracePeriod strategy, but we found this other way that worked reliably and stopped exploring that path. If I can repro the issue I'll let you know.

One extra thing I remember that was sort of problematic was that when a pod was Terminating it'd get removed from the Endpoints, so any tooling that was using the API info to keep an eye on connections was basically unusable at that point.

gouggoug · on Feb 14, 2018

I'm not sure what problem the author is solving. I might be misunderstanding something.

The author points out that the issue with Blue/Green/AnyColors deploys is that they need 16 pods per color at all time (which in their case would end up being 128 pods) and 24/48 hours for each connection pool to drain.

But how is using a SHA instead of a COLOR any different? Unless I am missing something, and, if running 128 pods and 24/48 hours of draining is the issue, then using SHA instead of colors is not solving those 2 issues.

You'll still need 16 pods and 24/48 hours per SHA-deploy, and you're actually making it worst by not using fixed colors since you have quite a lot more SHA at your disposition.

thomaslangston · on Feb 14, 2018

It appears the issue was running pods for deployment colors not in use if they only deploy 1/week, and this solves it because they are cleaning them up regularly. This does nothing for the overhead of needing lots of pods to support a high number of deploys/week.

You could do the same with $Color, just seems to be clearer since people often think of $Color as being static deployment infrastructure, whereas people are used to SHA's pointing to branches that are naturally cleaned up.

bdimcheff · on Feb 14, 2018

The way this Rainbow Deploy works means that if I only deploy once a month, I only have a single "color" running for that whole month, plus couple days of overlap where there are 2. If I have blue/green, I have 2 colors running all month. If I have more fixed colors, even more. The sha thing is just a convenient way of creating "colors" dynamically when we need to do a new deploy, without having to use a meaningless representation like "blue", "green", "taupe", "chartreuse", etc.

rockostrich · on Feb 14, 2018

I think their problem is that they need to keep old deploys alive for a long time because they have stateful long-running connections. So a blue/green deploy wouldn't work because their blue deploy has to stay around for at least a month after green is deployed. There's no difference between the SHA and COLOR, I think the choice to use a git hash is because it's the logical choice (instead of randomly choosing a color from a list).

erikrothoff · on Feb 14, 2018

This was really interesting. I'm thinking about moving to Kubernetes and have wondered how to gracefully deal with websocket connections.

I'm curious though, if the rollout was over a couple of hours for example, why would reconnections be that big of a problem? We host about 10,000+ websocket on a $20 VPS, and the Go server hosting it crashes from time to time. A surge of 10,000 reconnections instantly afterwards has never lasted for more than a minute or so, so why is it so bad? Moments of peak load aren't that big of a deal, or?

bdimcheff · on Feb 14, 2018

In our case it's not the websockets that are the problem, it's the XMPP connection that each websocket connection creates. Logging in thousands of users takes several minutes. While a user reconnects, any conversations that the users are having with their website visitors are disrupted.

markbnj · on Feb 14, 2018

(work with OP on the same team) Basically there are a lot of other things that happen when a websocket connection is established and we don't necessarily have the capacity to handle that volume in a complete reconnect scenario, especially if the system is already near the daily load peak. We have hopes that autoscaling some things in the future will make it possible to handle peaks like this more gracefully.

derekperkins · on Feb 15, 2018

This is a great use case for kube-metacontroller that was introduced in the Day 2 Keynote at Kubecon. With minimal work, you can replicate a deployment or stateful set, but with custom update strategies.

Live demo: https://youtu.be/1kjgwXP_N7A?t=10m46s Code: https://github.com/kstmp/metacontroller

discordianfish · on Feb 14, 2018

Nice pattern! You could throw in a HPA to automatically scale deployments to zero that aren't in use.

jsjohnst · on Feb 15, 2018

> So far we haven’t found a good way of detecting a lightly used deployment, so we’ve been cleaning them up manually every once in a while.

Am I missing something, or wouldn’t it be as “simple” as connecting to the running container and running netstat and conditionally killing the pod based on the number of connections? I bet you thought of that, so I’m curious why it didn’t work for you.

bdimcheff · on Feb 14, 2018

One thing I didn't put in here that's also turned out to be useful: We can prerelease things relatively easily this way too. Each deployment has a git sha, and we can have a canary/beta/dogfood version that points at an entirely different sha.

_asummers · on Feb 14, 2018

> We still have one unsolved issue with this deployment strategy: how to clean up the old deployments when they’re no longer serving (much) traffic.

Could probably solve this with a readiness probe / health check of sorts that is smart enough to know what low usage means.

bdimcheff · on Feb 14, 2018

Yeah I think if restartPolicy were changeable at runtime, we could simply have the pods exit once their connections are drained enough. If we were to exit using the current strategy, they'd just be restarted by kube.

SEJeff · on Feb 14, 2018

Or a horizontal pod autoscalar that can scale down to 0 replicas perhaps?

vitalus · on Feb 14, 2018

Curious about the 24h-48h burndown...could it potentially be longer for you guys or is there some mechanism in place to force disconnection (and thus risk a spike) after some TTL?

bdimcheff · on Feb 14, 2018

We force reconnects eventually, there just aren't that many people affected at that point. There's a very long tail of people keeping their browsers open for days, but it's only a handful of people.

drdrey · on Feb 14, 2018

So... just a blue/green deployment with a 24h delay before deleting the old cluster?

sulam · on Feb 14, 2018

I don't think so actually, it seems like they are having a series of old SHAs hanging out, not just one new and one old. I did have the reaction you did initially though and had to do some reading between the lines to come to the conclusion that this is not what they're doing, so you could be right!

45h34jh53k4j · on Feb 14, 2018

Using the 6 hex digits of the git commit hash for color is genius. I really like this pattern!

jefurii · on Feb 15, 2018

Dang, now I want to figure out how to print my git-log commit hashes in colors based on the hashes themselves...

xir78 · on Feb 14, 2018

TL;DR

You can drain stuff by changing a Service's selector but leaving the Deployment alone. Instead of changing a Deployment and doing a rolling update, create a new deployment and repoint the Service. Existing connections will remain until you delete the underlying Deployment.