The Ops Identity Crisis

_qc3o · on Oct 15, 2016

I look forward to the day when every software developer has half a clue about monitoring, logging, high availability, configuration management, orchestration/scheduling, performance tuning, build/deployment pipelines, data management/archiving, security and exploit mitigation, etc.

The full-stack developer like the serverless/ops-less future is a pipe dream. Most technology organizations barely even know how to build the software to begin with let alone figure out the right way to operate it.

snovv_crash · on Oct 15, 2016

Not everybody develops applications that are hosted. Who do you think writes your hardware drivers, IDEs, display managers, browsers, command line tools?

The list you presented is about as one-sided as saying everybody needs to know how to write shaders or bluetooth drivers - it is niche to the work you do, and just because somebody isn't well versed in it doesn't mean that they don't have a large body of industry-specific knowledge that doesn't even feature on your radar.

_qc3o · on Oct 15, 2016

You're making my point even better. So how exactly is someone that is making a game going to get by with a serverless/op-less/shader-less/driver-less future? Is there a game developer crisis or a kernel driver developer identity crisis we are not aware of? When put that way the claim sounds even more nonsensical.

So before we/anyone claim there is an identity crisis maybe the terms should be defined better since a lot of the terms the author uses are marketing gimmicks. There is a trend toward employing tools and patterns for managing distributed systems that reduces operational burden and consequently requires fewer people because the tools are handling more of what used to be handled by humans. That's great but I wouldn't call that a crisis, identity or otherwise, in fact I'd call that progress.

randoTroll · on Oct 15, 2016

It's often good to look at the background of the authors of these posts.

_qc3o · on Oct 15, 2016

I know who she is that's why I expected better.

dsjoerg · on Oct 15, 2016

You missed the sarcasm of the comment you are responding to. You two are in violent agreement.

toomuchtodo · on Oct 15, 2016

The startup scene is rife with overconfidence that you have enough waking hours to know everything you listed in one role.

Regret I have but one upvote for you.

hibikir · on Oct 15, 2016

The issue is that some people can spend just enough hours to make it LOOK like they really understand enough, when in practice, they have wide breadth, but not anywhere enough depth to do the job well. When the startup becomes mid size, suddenly you have a spaghetti codebase with unfortunate tradeoffs, a wide infrastructure full of hard coding that badly attempts to mimic the state of the art of 5 years ago, and one nine of reliability.

All of this is of course undocumented, so when you bring in new people, they take forever to learn the existing system, as nothing resembles good practices. As those new people try to clean everything up, they look far less productive than those people that have made this gigantic mess in the first place, and then everyone wonders how they got so lucky with their early, uberproductive employees, which seem to be so much better than anyone else.

If the startup has a good economic situation, they might survive while compounding the problem by overhiring (I am sure you all have heard the stories). If the finances aren't quite as good, then the startup folds.

I see roads out of this, but they all pass by different early employee compensation: A good early employee is not a good employee in a midsized company, but stock options don't work all that well when early engineers should be leaving the company 3 years in, before exercising the options makes any sense.

toomuchtodo · on Oct 15, 2016

> A good early employee is not a good employee in a midsized company, but stock options don't work all that well when early engineers should be leaving the company 3 years in, before exercising the options makes any sense.

Some early startup roles should vest out in 2 years instead of 4, or (what I'd do if I ever worked for a startup again) have a single ratchet provision where, if you decide you don't need my role anymore, the role has changed dramatically, etc, the remainder of my options vest immediately.

Life lessons are learned the hard way.

lathiat · on Oct 15, 2016

As someone in the hosting industry, you should try pair the average Wordpress "developer" with the customer who wants 1,000 people to hit the site and place an order at the same time. Or just run a site at all.

I'm not sure how these guys survive on your average web host. Our team can (and) do a lot of low level and language level debugging to figure out issues. Most cPanel resellers would send you packing. I can't imagine the frustration the end customers often experience bouncing these issues around for weeks.

In other news, I'm clearly highly over qualified!

photonwins · on Oct 15, 2016

And blatant disregard for performance analysis & capacity planning. Application is slowing down at 5000 QPS? Let's upgrade to 64 core 128GB Server and while we are at it, let's throw in a bunch of SSDs too. </s>

JimmyAustin · on Oct 15, 2016

It's worth it to analyse how much that server costs vs how it costs to rearchitect the application. Those 64 cores are probably cheaper then 2 months of a senior engineer's salary.

crdoconnor · on Oct 15, 2016

That's assuming it's actually even possible to use all that bare metal. I worked for a company that tried to optimize its code by buying multiple servers with 64 core processors - one for each customer - when the critical process couldn't use more than one core.

They were mostly kept afloat by using their patent portfolio as a weapon.

falcolas · on Oct 15, 2016

The first 64 cores, perhaps. But the 30th set? The 300th set? I recently worked for a company that held very strictly that it was better to buy more cores than change the code. As a result, over half their total costs was dedicated to their AWS bill.

phamilton · on Oct 15, 2016

I think the bigger point is that by adding SSDs and increasing CPU/memory all at once, we don't figure out what is actually the bottleneck.

aaronbrethorst · on Oct 15, 2016

In the short term, it is far cheaper to fix everything that might be causing the business to collapse than it is to perform root cause analysis.

davidgerard · on Oct 15, 2016

IME this puts the problem off six months (which might be enough!) but the technical debt interest bill comes calling.

If your algorithm is fundamentally shitty, you can scale it up by brute force for a certain time, then it outstrips your ability to do so, and you may need to apply actual competence to the problem. If you have any on hand that knows your systems.

(I'm a sysadmin. I have full confidence in my job existing for many decades to come. Because even in the future, nothing works.)

snovv_crash · on Oct 15, 2016

The problem is that it means you need to fix the same problem again soon, and finding something faster than SSDs which you can store your 1TB of data on isn't really feasible. On the other hand, if you'd just added an index to the column you are querying on...

gaius · on Oct 15, 2016

Good for the Ops guys* employed at cloud providers :)

* And gals if you want to be pedantic, but guys is a gender-neutral term these days

vorg · on Oct 15, 2016

> guys is a gender-neutral term these days

"guys" sounds like "gaius"!

formula1 · on Oct 15, 2016

The world isnt guided by people who learn every facet and mechanism of every subject. They are lead by those that see an opportunity and have the humility, audacity and will power to learn just enough or make just enough to make a successful product. They dont care about standards, quality nor best practices.

The day when every developer has a hood understanding of distributed systems and devops is when there is a simple way to use it, we start learning programming concepts from s young age or when every developer graduates from MIT and only big companies are allowed to hire people.

I personally think its absurd to be on a high horse in the fastest pace ecosystem out there. We constantly need to look for te newest technology and are competing with highly specialized minds (and eventually AI) that can generally do our work 10x better than we can.

I mean these last couples months have been a PR disaster for docker which has been seen as a "standard" for years now. Though technology specific aspects is different than abstract concepts, very few flowcharts actual ACT as production code

sroussey · on Oct 15, 2016

There are plenty of us. But there are more of them. Best of luck differentiating between the two. And I am being sincere.

ownagefool · on Oct 15, 2016

As someones whos worked within both disciplines, I don't actually think it's terribly difficult to learn all those things. The problem is, we're all too busy learning the new shiney without ever really delving deeper down the stack.

I find it a bit sad that such smart people jump on the microservices bandwagon without learning how to prove their application is working or deal with the fact that the network will fail.

jsudhams · on Oct 15, 2016

In small setup yes the developer knowing this will help,but in large companies I would assume this is what solution architect and CTO's do. If we do 80/20 heck even 20/80 with Infra experts and app architect and decide on day one the characteristics and limitations of the software to be developed then you don't need every dev to know the infra and every infra guy to know code. In my experience seeign other super support professional it is that knowledge of how app and tech works make it easy to support. But yeah i training people with this kind of knowledge. I would say 1 in 5 is capable but only 1 in 10 or 20 only make it due to time it requires to know all and able to do ad-hoc work. May people hate ad-hoc challenges everyday . But there some some who take it as day to day.

emmelaich · on Oct 16, 2016

The only way to make that happen is to make them responsible for service.

It sharpens the mind when they know that.

Unfortunately this takes a culture change that management need to drive.

At the moment the typical developer contract might be only 6 months or a year so they are not invested in the proper running of their code.

FLUX-YOU · on Oct 15, 2016

>I look forward to the day when every software developer has half a clue about monitoring, logging, high availability, configuration management, orchestration/scheduling, performance tuning, build/deployment pipelines, data management/archiving, security and exploit mitigation, etc.

Just keep piling on job requirements until all developers need 20 years of college to get a junior position. Everyone seems to have written an "X things every developer must know" article.

bbcbasic · on Oct 15, 2016

When do they learn this stuff? More weekend reading and side projects? Aka free overtime.

Rapzid · on Oct 15, 2016

There is pretty much no Uni track for this stuff so, yeah, on the job and after work. I have no degree so everything after high school(which I was very fortunate to have had C++, CCNA, etc classes at)..

FREE though? Man, my current salary doesn't make me feel like I gave my time learning this stuff away for free..

Jtsummers · on Oct 15, 2016

Learning isn't about your current salary. It's to qualify you for your next salary.

jonaf · on Oct 15, 2016

I find this to be just a little bit oversimplified. Indeed, I find that learning on a daily basis keeps me employed, whether I am physically in the office or not, or what time of day it is. So, learning may be about your current salary -- keeping it, that is.

sigil · on Oct 15, 2016

Not sure exactly. Tinkering, curiosity, free reign to learn new things at a few lucky jobs, guilt at throwing problems "over the wall" to ops people, and the fact that we're all operators sometimes (we all probably run personal servers and local networks) -- these were all a factor for me. I mostly build, but running things is also fun, and building things with a knowledge of what's downstream is the most fun, imo.

sigil · on Oct 15, 2016

It feels to me like title-centric corporate culture might be a part of the problem here. Are you an SRE? An SWE? Do you Build? Or do you Run? Are you a Software Developer or a Software Engineer? They're interchangeable to the author, but she includes both on the off chance that you the reader are sensitive to some fine splitting / quantum structure...

I for one welcome the blurring boundaries. It feels weird to say, sure, I wrote this code with pathological runtime behavior, but it's your problem now, Ops Person. How do you learn to avoid that mistake if someone else absorbs the pain for you (ie learns your lesson)? Personally I feel a bit cheated when this happens.

At some point in your career, you might get the nagging suspicion there are important lessons to be learned outside of your current role -- lessons that will make you N times better at your current role. If so, heed the call, take the extra responsibility, and level yourself up. Easier to do at small- to medium-sized companies, but not impossible at large companies either. Take the initiative!

pmoriarty · on Oct 15, 2016

"Nothing is impossible for the man who doesn't have to do it himself." -- A H Weiler

spc476 · on Oct 15, 2016

At work, I'm a developer (of backend call processing stuff). I work closely with QA (one guy right now) to get the call processing back end code tested (answering questions about the product, the SS7 stack (which nobody, not even me, wants to muck with---it's nasty) and the regression tests which I wrote back when I was in QA). When it comes time to deploy an update to call processin into production, I'm there, at 2:00 am along with the ops team and QA in the deployment (to answer questions and check to see if everything is running right). If anything is wrong, anyone can initiate a rollback to the previous version (I've initiated a rollback once).

In our situation, our customers are the various Monopolistic Phone Companies, so there's quite a bit of work that goes into a deployment (some fairly nasty SLAs and what not), so I'm glad there's an ops team to deal with most of that red tape. Yes, it's annoying that I can deploy as often as I would like, but I understand the reasoning behind it (and there's more in production than just call processing, like billing, provisioning and updates to our smart phone application).

TheSpiceIsLife · on Oct 15, 2016

Reading your comment just now cause an amusing thought.

Software Developer, it makes it sound like the software is already there, you just need to pour the right sequence of chemicals on it in the correct order, and rinse. In a dimly lit room.

dmourati · on Oct 15, 2016

I think the article gets this mostly right. I describe Dev and Ops as a continuum and DevOps as the concept that each side needs to get better at doing the others job.

Noops and/or serverless are both newer terms that are early on the hype cycle. I wouldn't get too bent out of shape about them.

My advice in this regard has been the following.

Devs: start thinking more about running your code in production and solving user needs. Works on my laptop is over. There are a whole slew of things that you need to understand to build at scale. Talk to your Ops people. Especially those who have the good sense to show you mutual respect.

Ops: Move up the stack. The days of providing just OS support and load balancing are gone. You have to learn to code. Spend time in IDEs, work in source control, do code testing. Learn from your Dev counterparts. Especially those who have the good sense to learn from you.

perlgeek · on Oct 15, 2016

This article seems to assume that everything runs in the cloud. If not, Ops still have a big part in operating the self-service infrastructure that Susan talks about.

Another point I'd like to make is that if developers need to take on more operational responsibility, it creates another barrier on entry. As somebody with a family, I don't want to be regularly on-call. And on the technical side, developers already need to know about the problem domain, the programming language, algorithm and data structures, design patterns, testing, version control and so on. Piling more operational knowledge onto this heap seems hardly promising to on-board more developers, and finding good developers outside the big tech hubs is already a pain.

sqldba · on Oct 15, 2016

I also feel like the "on call" thing is oversold a little.

I work in operations and I'm on call. Some months I hear nothing, some weeks everything goes wrong, but on average I get about one call a month. That's not so bad.

The thing is I spend 40 hours a week working out how to do proactive maintenance, monitoring, and alerting on those 400 systems I look after, so that I don't get called outside of hours. I mean that's what my operations job is meant to be, right?

Developers can't do that. They're focused on one system or a few systems at most and they're developing 40 hours a week. When the SAN dies and every server in the DC crashes there's no point calling them and asking why they're not on top of it and to try to develop some solution to improve the speed of recovery next time.

Me? I built a Jenkins server so next time it happens I press a button and get an immediate dashboard of where to focus my attention.

Developers aren't operations. Things may change a little when things go into the deep cloud though and you have services instead of servers running services, seeing as operations may not know what to do with those. On the other hand maybe there is meant to be a handover of some kind so that one person can look after everything instead of 10 developers each having a small stake in their little piece of world.

gtirloni · on Oct 15, 2016

This article is myopic in that it fails to mention the premises only (mostly) work for web-only startup companies relying heavily on cloud computing for everything. And even then, only the ones that don't see operational/infrastructure efficiency as a differentiator. If you take those constraints into account, it makes a lot of sense: just ask your developers to learn the very basics of those 3rd-party services you signed up to and things will go fine.

A developer will just be making sure he/she uses the logging provider API, that their 3rd-party monitoring and alerting system is updated, the automatic third-party load balancer has a little checkbox saying backend monitoring is configured, etc. Behind those magic services there will be your traditional Ops people (now better equipped with automation tools) making sure everything is running smoothly, properly sized, monitored and troubleshooted when it fails (it WILL fail, in unexpected ways, more often than you'd like).

People parroting "Ops-less" and "Server-less" buzzwords seem to be so detached from the daily activities required to maintaining real-world computing systems that they forget it exists. It's really frustrating when the SV zeitgeist living in a distorted reality (or a very narrow one) proposes this is the way things are or will certainly be in the near future. You just need to skim over all the GitHub issues in these magical tools (or countless articles with workarounds for all sorts of issues) to learn things aren't that simple as they pretend to be (which is fine, things take a lot of time to mature).

The work is not changing (except for increased automation everywhere), it's just shifting to third-parties or far away teams in your own company.

If only the computing world was that dreamy. I'll think about that next time I spent a day analyzing packets with tcpdump, coming up with a patch for a basic flaw in some tool, profiling some code that is destroying the storage system, etc. Yeah, that won't be distracting at all for developers. Maybe they don't need to work on their team's core competencies. Everything will "just" work. </sarcasm, sorry>

perlgeek · on Oct 15, 2016

It's not just about the number of calls you receive.

Typically if you're on call rotation, you're supposed to be able to get near a keyboard in $X minutes (where in our case, $X is about 15, I think). I simply don't want to have to ask myself the question "can I do $activity with my family this weekend?", with the answer potentially being "no, because on-call duty".

I totally agree with your point about a proactive maintenance. Developers can try to make their software easier to operator, and generally do if provided with the right incentives, but that doesn't remove the need for capacity planning, for example.

sqldba · on Oct 15, 2016

I'm on a roster 1 week on 2 weeks off and so it becomes very bearable given proper resources (i.e. a laptop and 3G/4G dongle).

You throw it in your car boot and forget about it. You might get a call but unless you're out cliff climbing every weekend or something it's unlikely to be during anything important (about the only thing I avoid scheduling is dinner/movies with friends because skipping out on that would suck).

sciurus · on Oct 16, 2016

Susan works for Uber, which runs its own datacenters. See, for example, http://www.datacenterknowledge.com/archives/2016/09/19/ex-eb...

donavanm · on Oct 15, 2016

WRT to "the cloud" are you conflating infrastructure, systems tooling, and operations? Theyre not the same thing. For example I cant imagine why a monitoring system would necessitate more "operations" costs than an email or business logic system. It's possible to invest in either, driving down total "ops." Or to skip investments and pay those recurring costs in adhoc people time. Very very few subjects actually necessitate a humans continual interaction.

perlgeek · on Oct 15, 2016

Reformulated: If my services run on-premise, a developer isn't the best person to investigate and deal with a hardware failure. That's a role for a traditional ops person, even in the future.

donavanm · on Oct 15, 2016

I think we're driving towards the same point with different language. You traditional ops person is my hourly employee working manual tasks from a queue.

Your hardware failure case is actually remarkably similar for both on-premise and "cloud". In either case the development team can either invest in negating single component failures, or pay it in an adhoc fashion when it occurs. If that single server dying in the night wakes anyone up your business has chosen the latter for you.

The difference that Ive seen is scale. With 100 servers it makes sense to pay the adhoc failures a few times per year. With 1000 servers its a few times per month. And at 10,000 its time to acknowledge the continual cost, get over The Really Big Server design, and hire an hourly tech to take touch that constant queue of broken hardware. Feel free to substitute "instance" or "droplet" or "router" in the proceeding paragraph.

perlgeek · on Oct 15, 2016

Even if a single server failure doesn't wake anybody in the middle of the night, it has to be dealt with eventually. Otherwise dead hardware piles up in your racks. So, work for operators.

But that's just the simplest case. Network congestion needs to be debugged, operating systems updated, security breaches investigated, and so on. To think that these types of activity can be automated in the near future is unrealistic. And burdening application developers with such tasks also seems like a weird choice.

And since you made a point about scale: The larger the scale, the harder it becomes to investigate such issues.

donavanm · on Oct 15, 2016

:/ Pretty clear that your "operator" is my hourly tech in both our examples.

> To think that these types of activity can be automated in the near future is unrealistic.

Uh, a bunch of those are automated in multiple companies. Or at least to the degree where only exceptional cases are seen by human eyes. And then those exceptional cases become more use cases to address next quarter.

> And burdening application developers with such tasks also seems like a weird choice.

I think this is where we're talking past each other. I'm saying that some companies can specialize Developers in to internal/infrastructure tooling. That then drives down the cost & impact of exactly the use cases we're talking about. Which (in theory) drives up total productivity for those applications or services that generate revenue.

> And since you made a point about scale: The larger the scale, the harder it becomes to investigate such issues.

Yeah, that depends. More data can wash out the signal. Or the right tool can use more samples to isolate the root cause. We're hiring https://aws.amazon.com/careers/

rodgerd · on Oct 15, 2016

> As somebody with a family, I don't want to be regularly on-call.

Neither do the ops team who pick up the sacks of shit developers keep chucking over the wall. What makes your family more important than theirs?

donavanm · on Oct 15, 2016

I've been a bit perplexed by the industries obsession with "operations" for the past decade. Constantly striving to decide what is it, who does it, and whether it's shunned or exalted. My current workplace has convinced me that "ops" is simply a friction organizations pay, like "tech debt." How you prioritize and minimize it is a business decision, not a life calling.

I suspect my employer is actually the largest (many many thousands of employees) and oldest (15-20 years) practitioner of the "you deploy and own what you write" method. There are nearly zero people in "operations" roles, compared to tens of thousands of Developers. The "Systems" folks who might be called SE or SRE or PE or DevOps somewhere else sieve in to three roles: 1) Specializing in development below the application & (slightly) above hardware/os 2) Developing tooling and systems around infrastructure & distributed systems management 3) Saying #2, but mostly driving manual or adhoc actions

Groups that naively pursue #3 seem to implode or catch fire after 12-18 months. They can succeed if its an intentional choice; employees are a resource and not all problems require further investment. Investing in #2 drives down the "ops" cost on other Developers and/or improves returns on infrastructure investment. Category #1 improves medium & long term returns on the software & services that group develops.

A while back the job role was changed to accentuate that the goal is Build value by Development. The specific flavor of development is less important. My job role is Systems Development Engineer and I work with Software Development Engineers. SDE collaborating with SDE. When I need to go beyond my domain knowledge I might consult a NDE.

In short I agree with the article summary. There will never be a "post ops" world. But getting over the "ops" title obsession feels good to me.

digi_owl · on Oct 15, 2016

(Web) devs are user facing and sexy, admins stomping the server room aisles are not (or so the reasoning seems to go).

This appears to have been a prevailing notion since the dot-com days, but has acquired new energy since "cloud computing" became a management buzzword.

llama052 · on Oct 15, 2016

I've always seen operations folks in a different light then most developers. Maybe I'm an exception but I've always viewed operations people as having in-depth knowledge of infrastructure on the lower layers (Session layer and lower). They are able to walk through the inner-workings of network concepts, operating systems, etc. They can implement and design systems that work in a highly available, scalable way.

Not saying developers CANT do this, but a majority of them don't want to, and don't have the first idea on how to. Granted some "10x" developers who have full stack knowledge can do just as well, but let's be honest most people aren't 10x developers, they want to mess with the code stack and that's it.

This reminds me of the all the job posting you see online that want you to have

          expert knowledge of C++
          expert in SQL DBA 
          expert in networking 
          expert in web design
          expert in big data

Yeah you might be able to find someone who can do all of the above, but odds are they won't do them all well.

I don't think the expectation should be set that developers should have to manage the systems stack, as well as manage the code.

Personally, I can't imagine having the expectation set that I need to be on call, and work on code commits with deadlines, while debugging networking issues in production.

Even with configuration toolsets, and infrastructure moving to "code-to-deploy" solutions. Things will still break, unusual things will still happen. Taking developers out of their zone to focus on problems will slow down the entire company.

Ops can can always push to 'strive to automate themselves out of their jobs' but I'd argue that this is an endless job which always has outlets that you can continue to strive and build into.

Of course, this is talking from my personal experience, I've never worked in a big company like Uber, so the environment might be entirely different from my own.

donavanm · on Oct 15, 2016

> Personally, I can't imagine having the expectation set that I need to be on call, and work on code commits with deadlines, while debugging networking issues in production.

This seems to be the meme of "the supporting infrastructure always breaks and causes the pager engagements!" I have seen those orgs and times where networking or facility failures are the leading cause of outages. Long term those are symptoms of chronic underinvestment and tech debt accumulation.

I also see, on a daily basis, a massive business where the absolute leading cause of "outage minutes" is software defect or deficiency. With investment in supporting infrastructure you're paged 10 times for a defect your team "developed" for every one unavoidable dependency failure. Big companies can make those (huge) investments themselves. Small companies can pay someone else for access to theirs. In any case its always a business decision, not a certainty.

davidgerard · on Oct 15, 2016

I explain my job to nongeeks as "computer roadie". My job is to make sure everything is in order, devs' job is to get up there and be Eric Clapton.

People can do both (many roadies are really quite capable musicians). But they're fundamentally different mindsets, and expertise in one is not transferable to the other.

toast0 · on Oct 15, 2016

> Personally, I can't imagine having the expectation set that I need to be on call, and work on code commits with deadlines, while debugging networking issues in production.

"sorry about the deadline, I had to make the fine network work" gets you out of making deadlines.

The nice thing about being on call for your own code is that fixing your stuff at all hours is a natural consequence of writing fragile stuff. The not nice thing is that not all of the reasons your stuff breaks are your fault and some issues aren't realistically preventable.

llama052 · on Oct 15, 2016

I can understand being on call for your own code, but often that's just not the case.

Infrastructure can go off the rails as well, clusters can go on the fritz, upstream might be having issues, a release update on your infrastructure side might introduce problems. I just don't see that falling on the developers who also must maintain the code-base for the product.

pjscott · on Oct 15, 2016

That just means that being on-call for your own code results in good code, a love for automated testing, and a seething grudge against EC2 and its tendency to have multi-zone outages at 3 AM. It doesn't matter what time zone you're in, by the way; the outage will always, always page you at 3 AM.

toast0 · on Oct 15, 2016

If the issues for things that aren't your fault happen are infrequent, then it's not a huge burden, and developers may be able to help with mitigation: ex if your internet bandwidth is cut in half because of a fiber cut, how can you move or shed load to have a stable, degraded system.

If the issues are frequent, then it provides incentive to build fault tolerance. (And also seething grudges for fault prone infrastructure)

limelight · on Oct 15, 2016

> Personally, I can't imagine having the expectation set that I need to be on call, and work on code commits with deadlines, while debugging networking issues in production.

Really? As a dev, the vast majority of my projects/roles have required all 3 to some extent. I have been on call for the past 4 years.

Honestly, people really need to admit that the age of the sysadmin (who is not also a strong programmer) is over. I might not have enough deep knowledge to answer network/OS trivia off the top of my mind, [0] but I'm perfectly capable of running a production build for my code (with monitoring, configuration management, etc.) and finding appropriate resources in the instances where problems are outside my domain.

[0] https://news.ycombinator.com/item?id=12701272

llama052 · on Oct 15, 2016

> Really? As a dev, the vast majority of my projects/roles have required all 3 to some extent.

I can run photoshop and mock some images at home and it's good enough, but does that make me a photoshop guru? I think not.

edit: bad example.

limelight · on Oct 15, 2016

> To some extent maybe, but running something on a heroku engine

Talk about a straw man. I would agree with you that deploying to Heroku hardly constitutes ops. I'm talking about things like running a multi-AZ Kubernetes cluster.

llama052 · on Oct 15, 2016

Following a guide online and getting something online is different from running something in production that a company relies on. Learning a infrastructure piece thoroughly can take months alone in resilience testing, failover testing, and system design/architecture choices.

It's a huge fallacy in the infrastructure side of things that because something "works" means it's setup correctly, and that couldn't be further from the truth.

You'll get an environment built on technical debt. Outages will occur and you'll be playing sysadmin full time before long. You won't be a developer at this point anymore, you'll be the sysadmin/ops role of your company.

I've seen this happen at multiple companies, and I doubt this is the last time I will see it happen.

limelight · on Oct 15, 2016

You continue to be extremely insulting and making significant assumptions about my experience. What shibboleth do I have to provide to prove that I know what I'm doing? This inherent assumption that other people don't know anything is incredibly damaging to the industry: I don't assume a SRE is an incompetent programmer, so you shouldn't assume I am barely competent at ops.

> Following a guide online and getting something online is different from running something in production that a company relies on

Where did I say that I had simply "followed a guide online" instead of "putting something in production."

Like I said in my original comment, I have been the primary on call engineer for multiple projects with millions of users.

Despite that, I still find time to develop new features and push the product forward. Yes, I occasionally deal with outages/failures and have to write some new automation to remediate them. But that doesn't make me a full-time sysadmin.

I don't really feel like engaging with you further, since you're apparently more interested in arguing with a straw man of your own construction.

llama052 · on Oct 15, 2016

> You continue to be extremely insulting and making significant assumptions about my experience. What shibboleth do I have to provide to prove that I know what I'm doing? This inherent assumption that other people don't know anything is incredibly damaging to the industry: I don't assume a SRE is an incompetent programmer, so you shouldn't assume I am barely competent at ops.

My mistake for projecting, the comment you posted about sysadmin age being dead gave me the opinion that you think infrastructure is just something you can deploy and forget, again my mistake.

I also spoke in my original message that some 10x engineers can do both, but they are not the majority.

I guess you're a 10x full stack engineer then.

> Like I said in my original comment, I have been the primary on call engineer for multiple projects with millions of users.

That's great, so you're a developer, and operations engineer then (devops)?

Do you think that developers should be expected to do both? Genuinely curious

limelight · on Oct 15, 2016

> My mistake for projecting, the comment you posted about sysadmin age being dead gave me the opinion that you think infrastructure is just something you can deploy and forget, again my mistake.

I didn't mean to imply that. The "who is not also a strong programmer" part of my comment is important. Of course infrastructure breaks, but in the modern era we have the tooling that your next step should be figuring out how to automate things so that never happens again.

> Also looking at your linkedin on your profile led me to believe that you worked with small niche infrastructure setups that don't need to scale.

I'm not sure where you got that from. As much as people on HN hate Business Insider, it's hardly a "niche" site. And for the record it's not even the biggest site I've worked on.

> I also spoke in my original message that some 10x engineers can do both, but they are not the majority.

I agree that the majority of engineers can't do both, but I also don't think it's only small minority who can. Infrastructure is just another skill-set that some engineers can/do build.

> That's great, so you're a developer, and operations engineer then (devops)?

I wouldn't call myself an "operations engineer" but I've done quite a bit of devops work.

> Do you think that developers should be expected to do both? Genuinely curious

I don't think every engineer should be expected to do both, but I do think at least one developer on every team should have DevOps experience and be capable of coaching the rest of the team.

The role of ops teams (if they still exist) should be primarily in building tools/frameworks for the embedded DevOps engineers to use. From that perspective it's much closer to a traditional internal tools team than a traditional ops team.

advisedwang · on Oct 15, 2016

There is a trade off between development velocity and reliability. Not just because making more changes often breaks more things, but also in how much investment is made in infrastructure, monitoring. It even shows up in design choices - making a service multi-regional is more effort but results in a more stable system.

A big part of SRE is motivation for reliability. Part of the reason to have SRE is to have engineers motivated by stability, not development velocity. Have somebody who knows the better choice for reliability, even if at the time a decision is made the other way. This necessitates both a separate profession and organisation separation (e.g. for product readiness reviews or for higher level stability decision making).

(NB: I am a Googler that works closely with SRE, but not an SRE myself)

_qc3o · on Oct 15, 2016

I think this is a false dichotomy. If developers are incentivized to make stable software then they'll make stable software but that's not the case. Software engineers that work on products are promoted based on number of features they ship, not how many production outages they don't cause. It's like the senator that lobbies to put bolted doors on plane cockpits before 9/11. That senator will get zero credit for anything. Fundamentally it is harder to measure the effectiveness of preventive measures so most organizations don't and instead settle for number of features shipped.

pmoriarty · on Oct 15, 2016

Don't forget all the death marches to meet unrealistic deadlines that some exec or sales rep pulled out of their ass.

In far too many organizations devs are running around like the proverbial headless chickens to ship! ship! ship! while ops are endlessly fighting fire after fire.

Many teams don't have the time or manpower to afford the luxury of being proactive, and when anyone suggests doing so, tries to put the breaks on, or asks for more resources they're treated like troublemakers.

solipsism · on Oct 15, 2016

I don't know who writes/says dumb things like "ops is dead" but ideally we wouldn't waste time arguing against such reductive statements. Whatever the zeitgeist, there will always be homogeneous groups (all developers know/do roughly the same things), structurally heterogenous groups, and organically heterogenous groups.

drinchev · on Oct 15, 2016

I have network knowledge, linux knowledge, debugging knowledge, some best practices about logging knowledge and still ( as a developer ) I think dev-ops needs to be separated to a different role.

Indeed, you don't need a dev-ops person for your MVP or small web-app, but once you can afford it and the users you have require you to do `no-downtime-deployment` mechanism or database replication or server cluster, I think it's too much to ask your developers to do that. Anyway they will have to spend months on reading about the latest and greatest practices and in the end will have something half-baked, compared to what a dev-ops engineer can do for the same amount of time.

tristor · on Oct 15, 2016

The entire conversation around "ops-less" systems ignores that in most cases the Ops folks are stronger engineers than the SWEs. If a company wants to ditch Ops they'll need to pay far more money for SWEs, expect a higher competency in engineering tasks, and good luck ever outsourcing again.

kod · on Oct 15, 2016

Citation needed. I've personally never seen an org where ops were stronger enngineers than dev.

tristor · on Oct 15, 2016

Engineering is not writing code. Writing code is what you do to implement something you've engineered. Ops may not be better at writing code, but in every organization in my career they've been stronger engineers.

Of course that's just my anecdote.

jzelinskie · on Oct 15, 2016

If you liked this post and want to know the full details to this change in ops, please read the SRE book[0]. It's a great read for both devs and ops and can immediately help you make changes to company policy for the better.

[0]: http://shop.oreilly.com/product/0636920041528.do

pmyjavec · on Oct 15, 2016

The naming of these technologies and paradigms is unfortunate in my view, it's now made Ops sound like a bad thing. Noops, Serverless?

As a long time DevOps / Sys Admin / Generalist I feel disappointed lately because there seemed to be a very brief golden age, which in my opinion was DevOps done right, and that was just getting polished and accepted, then it feels like for little good reason except for marketing or something, that was just thrown out the window ? It was really getting results in my last org, basically meeting half-way with devs felt like the sweet spot and now it's going to extremes.

I was really into investing my time into the DevOps / SRE role, now it just feels demotivating, as any good Ops knows it's a tough job that requires dedication, but is it worth the effort anymore ? Will people still want to hire ops? Should I just move into Software Engineering (which I can do), full time ?

I think she hit the nail on the head to be honest.

digi_owl · on Oct 15, 2016

I get the impression the problem is that it's with ops/admins as it is with safety inspectors. When they do their job right, nothing spectacularly fails, and thus management starts wondering why they have this salary expense on the quarterly spreadsheet.

developers on the other hand go hand in hand with marketing, and thus is easily noticed when they do their thing right.

pmyjavec · on Oct 15, 2016

This definitely happens too :)

latch · on Oct 15, 2016

I agree. Quality is improved as developers start to understand, get involved in, and own the operation side of running their system. Ops ought to be an enabling force.

But, I do wish that anytime anyone writes about ops or infrastructure, they put a big headline:

    There's a 99% chance you just just Scale Up and use Bash.

scurvy · on Oct 15, 2016

When it comes down to brass tacks, developers don't want to be on call. They want to work on the things that they want to work on, when they want to work on them. Look at the proliferation of 20% projects ("keep them happy and let them do what they want 1 day a week rather than what they get paid to do"), and the outright refusal to fix their own bugs. How many developers do you know who would turn their noses up at being placed on a sustained engineering team rather than creating new features? There's a stigma there. You think that stigma won't be there at 3AM Pacific when Europe starts hitting their new feature to the breaking point?

I've been an ops engineer in a "lean startup" where developers were on-call for their services. It didn't work super well because people ignored their phones or put phones on silent at night. As a backstop, they put me (ops engineer) as the fallback secondary notification because they knew I would wake up. Ergo, everyone ignored everything and it all rolled to me. They'd wake up at 7AM, find everything ablaze (because they ignored all my phone calls), then would fix it and go about the rest of their day.

Let's face it though. They probably don't want to be on call. Why force them to do this? There's the concept of ownership and closing the pain loop, but most people really don't understand what it truly means to be on call. "Sorry honey, can't go to the movies tonight I'm on call." "Sorry bro, can't get wasted tonight I'm on call." Only huge organizations with huge dev teams can go through a developer on-call rotation. Most leaner (smaller) companies have 1-2 devs per project, and it's unreasonable to expect that developer to be on-call 24x7.

This stuff works at Uber, Facebook, and Google. But the vast majority of the world isn't Uber, Facebook, and Google.

Also, I'd expect more pay as a developer if the job required on-call shifts. I don't think companies are willing to pay even more than they already are.

throw2016 · on Oct 15, 2016

These cycles keeps repeating themselves. Some marketing driven term gets traction and then people start believing the hype, repeating it as some sacred truth and dismissing experience as grey beards.

A few years down the road when things don't go according to plan some other term gets traction and rinse repeat.

HN especially is guilty of perpetuating hype when one would expert a far greater degree of scrutiny.

When you get to the nitty gritty of scaling from networking, distributed storage, failover, high availability, security and managing state that's entire domains of expertise and experience that devops glosses over.

skarap · on Oct 15, 2016

To rephrase the article: everyone should and can go ops-less, you just need the devs to take over the ops roles.

gaius · on Oct 15, 2016

So get the same work done with half the people, got it.

n72 · on Oct 15, 2016

Anyone know the best place to learn more about kind of basic ops that a dev should know? I've done deployments, spun up AWS instances, configured load balancers etc., but my problem is that I don't know what I don't know. For example, when starting to debug, I get in there and muddle around, but there may be far more efficient way of doing it which I just don't know about. I'll watch ops guys use ps, netstat, etc., which are things I don't use, but presume are useful.

jimjimjim · on Oct 15, 2016

back before the agile dark ages, things like performance, logging/instrumentation, stability guidelines, security guidelines and deployment abilities were able to be specified as non-functional requirements and the product was QA'd for these just as much as the features that were added.

kasey_junk · on Oct 15, 2016

Blaming agile for this seems odd. I've worked on agile teams where all of those were set as requirements.

pmoriarty · on Oct 15, 2016

It's gotten a lot easier to learn and practice ops over the years.

No longer do you need access to a university or government lab to get your hands on Unix. Nor do you have to scour obscure corners of university libraries to get your hands on some wizard manual that finally makes sense of some bit of it for you.

Tons of free, quality tutorials are available online, and you can get help on forums and in chat groups. Online book stores are overflowing with books on just about everything you'd want to know.

Unix (or Linux) has become a lot easier to use in many ways, you can practice on VMs, and anyone who wants it can have root on their own machine. The tools have gotten a lot better too (though both the tools and the OS's have increased in complexity, layers, and interaction with other systems). Cloud providers make spinning up machines, network infrastructure, and various services easier than ever.

Computer literacy has become many orders of magnitude more common than it once was, and a lot of devs grow up being admins of their own Linux systems.

In many ways, it's never been easier to learn ops, to some extent. The same could be said for development, with languages, tools and training being far more available than they once were.

That does definitely reduce the need for a dedicated ops team or a dedicated dev team to some degree. But just as in medicine sometimes you need a specialist who's had the training and a lifetime of practice in that speciality, and when a generalist's knowledge is not enough, I think there'll always be roles for ops and roles for devs.

All other things being equal, a dev who mostly does development and dabbles in ops just isn't going to get the level of professional skill in ops as someone who focuses mostly on ops does.. just as someone who focuses mostly on ops and dabbles in development is probably not going to be able to achieve the same level of development skill that someone who does a lot of development day in and day out will.

It's like someone being both a brilliant brain surgeon and a brilliant hand surgeon. They do have something in common: they're both medical specialities that treat the human body and they both require going to medical school, but being great at both is still rare, and if I ever have hand surgery I'd usually prefer to be treated by someone who's done thousands of hand surgeries and specializes in that, not one who's mostly a brain surgeon who's occasionally operated on hands.

Some people are able to straddle both specialities and do an excellent job at both, but those are relatively rare, because the amount of knowledge and experience you need to do really master both is still quite large, despite everything. This knowledge also changes quite rapidly, so you have to spend a lot of time keeping up with new languages, frameworks, tools, services, etc. That's a lot to ask for even for one speciality, never mind two.

antod · on Oct 15, 2016

I agree with your points about it's never been easier to learn. But I personally think that needs to be balanced with the notion that I don't think things have ever been this complex before.

The last 5 years or so have seen an explosion (cambrian?) in the number of tools and platforms in use - most of which are immature to put it politely.

By the time failure modes of these new tools are well understood and fixed, the market has moved on to the next new hotness.

The knowledge and experience of old grizzled Unix greybeards of the past might've been harder to gain, but it seemed to have served them for much longer before becoming obsolete.

dredmorbius · on Oct 19, 2016

I've pretty much burnt out on the role for this reason.

By the time you've made a reasonable assessment of whether or not a tool is worth the trouble to learn, deploy, and use, (and no, it's not), it's obsolete and another tool not worth the pain has replaced it.

Grow the fuck up, tech world.

gaius · on Oct 15, 2016

No longer do you need access to a university or government lab to get your hands on Unix. Nor do you have to scour obscure corners of university libraries to get your hands on some wizard manual that finally makes sense of some bit of it for you.

It's funny, I can still remember a conversation I had it about 1995 with a colleague, we were certain that with this new Linux thing, *BSD at so on, now that everyone could get their own Unix to play with, the specialized sysadmin and the dedicated C programmer were totally obsolete, everyone would have these skills. This was around the time remember that a "real" workstation would cost 20 grand at least, and the compiler would cost as much again...

That obviously didn't happen, 20 years later, so I think we can reasonably conclude that "access to systems" was never actually the problem.

k__ · on Oct 15, 2016

All the ops over 40 I know didn't even see a uni from the inside.

Most IT people here started doing something different and tgen switched to IT.

The younger the people, the more academical they are.

jonaf · on Oct 15, 2016

This was a good read. As a DevOps Engineer at a company that does not have a distinct Ops department, who's also a Tech Lead, I have some thoughts I'd like to share.

First, while the ultimate goal of any engineer (even not among the Ops disciplines) should be to automate yourself out of a job, we have seen time and again that it is impossible to do so, as any good engineer will continue to advance the state of the art. Conclusively, there is no "finish line" for operations that will not be obsolete within 3 years. The concern that you'll just have to migrate across organizations, reaching the "finish line," rinsing and repeating is a non-issue. The notion of a finish line is really sugarcoated FUD. (The author's interesting thought experiment alludes to this and does refute the argument, so hopefully my statements simply complement the article in that regard. I call it a thought experiment because we will never arrive at this "you automated everything" goal.)

I absolutely agree with the author that we do not really need ops engineers. We do, however, need specific disciplines of software engineering. Specifically, I recommend The Systems Engineering Side of Site Reliability Engineering[1] (as well as the book), Site Reliability Engineering[2] from Google. The usenix article in particular describes three distinct disciplines of software development: systems engineering, site reliability engineering and software engineering. The individuals behind the roles have little to do with the roles themselves (rather, the causal chain is the other way around); it's often misunderstood, for whatever reason, that software engineers and operations engineers have different skillsets because of who they are. This is true, but it does not mean that a software engineer cannot, in short order relative to individuals with no software background whatsoever, transition into an operations role, or vice-versa. Orthogonally, identifying individuals with the skills in any of these three disciplines is critical to placing them in work that is personally and professionally rewarding, as well as more valuable to the organization than if they were placed in some other discipline. And sometimes, individuals do not even know of these disciplines or, for whatever reason, think they are suited for a discipline that they are not actually best at. I was one of these people (a software engineer before moving into operations). In essence, what I'm trying to demonstrate here is that these disciplines of software development are permanent (or have generations-length longevity) and we should not be concerned with being replaced or becoming obsolete. Indeed, it is the specific tasks that will change over time. Consider, for example, electrical engineers. We do not anticipate that EE's will be replaced by robots. Despite robots automating the process of manufacturing circuits, EE's will always be invaluable and irreplaceable. However, their specific responsibilities will change over time. This is why I said before, advancing the state of the art results in new work (or even types of work) to exist. Finally -- and this is just a bonus -- any experience acquired by, say, an EE will be useful even if he or she transitions to a new discipline of engineering. In my experience, the best software engineers I ever known have understood in remarkable depth CPU architecture, memory models, networking protocols, configuration management, etc.

> there is practically no difference between a software engineer and an operations engineer - they both build and run their own systems - other than the domain: the software engineer builds and runs the top-level applications, while the (ex-)operations engineer builds and runs the infrastructure and tooling underneath the applications.

The above statement from the article's thought experiment vaguely describes two (of the three) software engineering disciplines that Hixson[1] talks about. Operations engineers and software engineers alike are, in this thought experiment, responsible for leveraging their expertise and talent (understand that I use the word talent according to the definition described by Hixson) at maximum efficiency. The manifestation of these disciplines is reflected in their domain, but the individual tasks themselves are only relevant today and will change tomorrow. The third discipline not described here (systems engineering) is very much relevant and deals specifically with the interactions among systems, which neither operations nor software engineers will focus on (or necessarily have significant talent in). Later in the article, the author sort of blends SRE (site reliability engineers) and SE (systems engineers) together. The distinction isn't important for the author to make her point, but I wanted to highlight it a little bit.

Second, I think the author describes an environment that strongly reflects the ideals of the DevOps movement. From my reading, I'm inferring that the author is aligned with these ideals. I consider this a big selling point if I ever wanted to consider Uber as a place of employment. As some other comments here on HN have noted: it is extremely rare and difficult to find an organization that has embraced DevOps principles with such purity. I'm fortunate to be employed at one of them (not Uber), and it sounds like Uber has made some good decisions as an organization in this regard. (Hopefully this paragraph can be to the benefit of any employment-seeking operations engineers. The statements in the article reflect positively on Uber, particularly if you are trying to move from a traditional operations role to a DevOps/SE/SRE role.)

Third, the article does a great job refuting the 3 identified arguments. In general, I can't agree more! The author takes the time to consider the merit of each argument and qualify the conditions under which they are true before refuting them, which makes it much easier to read coming from a more traditional organization. From my biased perspective, I don't even give these arguments the light of day and refute them without thinking twice about the qualifications that can alter their accuracy; so, one takeaway for me from the article has been to not make the assumption that these arguments are being made by like-minded individuals. It's quite likely that I'm too hard on people for bringing up concerns like these and, as a result, not open to new (old) ideas.

My final thought on the article is that, while it's not really news in most of the social circles I spend my time with (as a byproduct of having learned much of what I know from a stellar colleagues in a great work environment -- not because of any personal accomplishment), I really appreciate that the author took the time to write out these thoughts and publish them so that the broader software community can grow and adopt ideals that move our industry forward in a very positive, very significant way. So thanks to the author, and to aberoham who posted the link here on HN!

[1] https://www.usenix.org/publications/login/june15/hixson

[2] https://landing.google.com/sre/book.html

sjg007 · on Oct 15, 2016

developing on top of managed services will allow you to do more with less people over time. And adopt best practices by integration.

hullsean · on Oct 15, 2016

Devs & ops seem to have different mandates.

Devs to create change, new features & build product to meet customer needs.

Ops to resist change. To provide a rate limit. And a sense of perspective. Bc they're tasked with stability & continuity.

cryptica · on Oct 15, 2016

I think that applications in the near future will be built on top of 'cloud native stacks' (made up of app servers, databases, memory stores, message queues, etc...) which are designed to run and autoscale on a Kubernetes cluster (or a similar ops/orchestration systems) and developers working for various companies will just focus on adding business logic on top of these stacks - They won't have to understand how all the components in the underlying stack interact with each other when scaling up or down (or recovering from failures) - That will be all encoded as part of the stack's configuration (E.g. as Kubernetes config .yaml files).