WRT to "the cloud" are you conflating infrastructure, systems tooling, and opera...

perlgeek · on Oct 15, 2016

Reformulated: If my services run on-premise, a developer isn't the best person to investigate and deal with a hardware failure. That's a role for a traditional ops person, even in the future.

donavanm · on Oct 15, 2016

I think we're driving towards the same point with different language. You traditional ops person is my hourly employee working manual tasks from a queue.

Your hardware failure case is actually remarkably similar for both on-premise and "cloud". In either case the development team can either invest in negating single component failures, or pay it in an adhoc fashion when it occurs. If that single server dying in the night wakes anyone up your business has chosen the latter for you.

The difference that Ive seen is scale. With 100 servers it makes sense to pay the adhoc failures a few times per year. With 1000 servers its a few times per month. And at 10,000 its time to acknowledge the continual cost, get over The Really Big Server design, and hire an hourly tech to take touch that constant queue of broken hardware. Feel free to substitute "instance" or "droplet" or "router" in the proceeding paragraph.

perlgeek · on Oct 15, 2016

Even if a single server failure doesn't wake anybody in the middle of the night, it has to be dealt with eventually. Otherwise dead hardware piles up in your racks. So, work for operators.

But that's just the simplest case. Network congestion needs to be debugged, operating systems updated, security breaches investigated, and so on. To think that these types of activity can be automated in the near future is unrealistic. And burdening application developers with such tasks also seems like a weird choice.

And since you made a point about scale: The larger the scale, the harder it becomes to investigate such issues.

donavanm · on Oct 15, 2016

:/ Pretty clear that your "operator" is my hourly tech in both our examples.

> To think that these types of activity can be automated in the near future is unrealistic.

Uh, a bunch of those are automated in multiple companies. Or at least to the degree where only exceptional cases are seen by human eyes. And then those exceptional cases become more use cases to address next quarter.

> And burdening application developers with such tasks also seems like a weird choice.

I think this is where we're talking past each other. I'm saying that some companies can specialize Developers in to internal/infrastructure tooling. That then drives down the cost & impact of exactly the use cases we're talking about. Which (in theory) drives up total productivity for those applications or services that generate revenue.

> And since you made a point about scale: The larger the scale, the harder it becomes to investigate such issues.

Yeah, that depends. More data can wash out the signal. Or the right tool can use more samples to isolate the root cause. We're hiring https://aws.amazon.com/careers/