> we have 27 SSH backends, and 18 HAproxy No autoscaling? This load pattern is a...

btilly · on Aug 27, 2019

I disagree.

Autoscaling sounds great, but in practice it notices that there is a problem after you already have once, and by the time you've scaled up, there is no problem any more. Which is the worst of all worlds.

For example in this case you'd realize there is a problem at the top of the hour. And 2 minutes later you'd have a bunch of autoscaled instances up, all wondering what the fuss was about. So they scale down again, but then at the top of the next hour the same thing happens again with the same result.

The same thing happens with ad servers. An ad buy goes in for 30 seconds. When it hits, you have a firehose. That then shuts off.

In fact this problem is so common that I recommend against autoscaling unless you have specific reason to believe that it will work for you.

hinkley · on Aug 27, 2019

And god help you if that starts your autoscaler to flapping.

Too much CPU load, start some servers. Too little, kill some servers. Oops now CPU is too high, bring them back. No, it's too low...

It reminds me of the old UI bug where you get an off-by-one error for scrollbar logic and so it appears and disappears in a cycle until you resize.

Edit: I think we want something akin to a learning thermostat, but for servers. Figure out that I get a CPU spike at 5:20 on week days and spin some servers up at 5:10. Then spin em down at 8:00 when EST starts to go to bed.

auslander · on Aug 27, 2019

> .. Oops now CPU is too high, bring them back. No, it's too low...

You just described closed-loop control system oscillation. Cause is wrong gain and/or delay parameters. AWS ASG has some knobs for tuning like Cooldown, MaxSize. What you described above is most probably long boot time (delay) problem. Service should be ready in less than 30 secs after boot. To get there, one should bake AMIs, not install afresh all the stuff on boot.

hinkley · on Aug 28, 2019

You can get an oscillation without long boot or ramp-up time but they sure don't improve the situation.

Lazy loading of modules is a common cause, but any other resources that are loaded on demand or in the background. But that's more of a situation of thinking you need two additional nodes, getting three, settling back to two after warmup, and then killing them all off again when the spike ends.

I may have said elsewhere that I'm more comfortable with scaling load based on daily and weekly cycle patterns, with perhaps a few exceptions for special events (sales, new initiatives, article publications) and making it easy for someone to dial up or down.

To use a car analogy, get really good at power steering and power brakes before you attempt cruise control, get really good at cruise control before you attempt adaptive cruise control. And don't even talk about self-driving unless you're FAANG.

sethammons · on Aug 28, 2019

The load spikes here are predictable. They could spin up right before the extra resources are needed. However, it only masks the problem. Queue the argument for horizontal vs vertical scaling.

auslander · on Aug 27, 2019

> For example .. In fact this problem is so common ..

So whats the problem here? Assuming there are no misconfigurations anymore, and just spiking loads.

btilly · on Aug 27, 2019

The problem is that people try to use autoscaling to handle a spiking load, and it fails abysmally.

In my opinion for those with a spiking load profile, autoscaling should be considered harmful.

certifiable · on Aug 28, 2019

As others have said here, autoscaling has gotchas. I won't recount those reasons, but there are two other relevant points:

1) Check out the graph in the 'An illuminating graph' section. The connection rate spikes by a factor of 3 in the space of 5 seconds (and that's actually a consolidated average over a number of hours; the worst spikes at the top of the hour are even bigger). We'd need super-responsive, practically magic autoscaling, or pro-active auto-scaling that does so at known intervals (every hour, 10 minutes, 15 minutes, etc). But given that the actual CPU usage doesn't really vary that much over those timescales (it all sort of evens out once the connections are made and git takes over), auto-scaling just to add more connection slots in SSH would be a poor use of resources, when we can just increase MaxStartups as far as necessary.

2) We do want to autoscale, and will in the medium-term future, because the access patterns at a weekly scale are quite variable and predictable, and we can save a lot of resources by scaling down at the weekends (or even the EMEA evenings) when the bulk of our users go home and stop creating MRs, and it's only bots, CI, and cron jobs (I jest, but it feels like that some days). But not because of the issue described in the blog post

toast0 · on Aug 28, 2019

Autoscaling would only work well if it could react quick enough. Since there are peaks in each second, that means a very rapid startup is needed. You would also want to stop rapidly, but some of the ssh sessions are going to be lengthy, so you can't kill instances early.

We don't even have any indication that the load on the system is particularly affected, there's just an arbitrary connection cap. Average load in the cpu graph [1] looks pretty flat, and around 50%; so it's probably already over provisioned depending on their disaster recovery plans and their normal load patterns (didn't see a daily/weekly graph to armchair that)

[1] https://gitlab.com/gitlab-com/gl-infra/infrastructure/upload...