As a software engineer doing infrastructure work I often find myself working on operational stuff (mostly chasing weird bugs, some on-call, etc.). In my position I am also expected to release features and do development too, but I feel like it's very difficult to focus because of all the operational issues I am dealing with. How are you guys dealing with that sort of work?
Badly. We've lost a couple devs/ops people in the last year, and haven't adequately replaced them. We're stretched way too thin and everyone is getting very burned out.
I haven't done any significant development work in more than six months, just chasing bugs, doing support, and fussing with email and meetings. It blows; I've got to find a different job.
Identify points to automate. Automate them. Get the automation peer reviewed by the team. Establish testing for the automation. Deploy the automation.
If it's one-offs and not consistent misbehavior that the above can deal with, improve testing infrastructure. If you're unable to hit your feature development schedule, point to the problems in the present system and infrastructure.
Ask your boss for clear priorities: Do they want a stable system, or more features. If the present system is this unstable, then more features will only exarcibate this. If they say they want both, and give them equal priority, ask for a pay raise and search for new jobs.
Chasing bugs and being on-call sound like core parts of a software engineer's job, rather than operational work.
That said, some teams at my company are experimenting with having a week-long rotation for "bread box" issues. Those include tending issues/PRs in open source repos, handling bugs as they come in, etc. That frees up the rest of the rest of the team to work on core feature work.
I like to keep a running list of smaller, non-urgent tasks that would otherwise get neglected. When I have a long-running script or need to take a break from another project, I can refer to the list.
Chasing bugs? Yes. Being on-call? No. Not unless you signed up for that. Too many companies think they can just get Pagerduty going and sign up all their engineering staff for operations duty. This is stupid for a number of reasons least of which is managed services get rid of most of the need for this and it is typically cheaper than developer time.
Do some developers on the team need to think about scale? Yes. Should all the developers be on call because perhaps the company decided to roll it's own infrastructure and someone has to deal with occasional server with full disks? No.
The flipside to this is that being on call forces developers to care about bugs in their code that cause operational headaches instead of just throwing releases with varying degrees of test coverage over the fence to ops. Funny how certain bugs that languished in the background get priority when the dev responsible for that code's phone is the one that rings at 3am instead of some poor schmuck on the ops team.
This exactly. If the developers responsible for the problem (and the fix) aren't feeling the pain of being on-call, then nothing will change and the fallout will be left on support/ops (who will usually find a poorly thought out workaround).
Do developers need to be on-call to handle purely ops-related activities (low disk space, high system load, etc)? Absolutely not. Should developers be responsible for their "production-ready" code when it breaks? Definitely.
But the problem is if you assign a rotating duty to your engineering staff, you as an engineer have no direct impact on how often you will be called due to the half-assed work of other developers. It's a rocky road. Do this too much and your staff will leave. I certainly will. Life is too short.
In short, we're all describing poor management issues. Signing up all the developers for Pagerduty is band aid. So is pushing it all onto operations. In both cases, management is making a choice to avoid dealing with something that requires ongoing effort and time.
On call as a core part? Really? Thankfully I've never worked anywhere with such a "duty", tbh if my current place proposed it I'd be applying for new jobs by lunch time.
What's the standard pay for being on-call as a matter of interest?
Every healthy engineering place I've worked at had developers on call. It's called "eating your own dog food". Devs should be responsible for the things they build - it affects the dev culture significantly when your shitty code can wake you up in the middle of the night.
So realistically, is the code going to be fixed at 3 a.m.? Why can't it wait til I'm in the next morning at 8 a.m. for a proper review, triage, priority listing and then fix?
I'm shocked that people would so easily give up their free time really, but to each their own.
How much does it pay extra?
> It's called "eating your own dog food".
No it's not, that's using your own product. Which I do.
Yes, if it matters. I can think of dozens of examples. E.g. you provide a payment processor, and you have clients worldwide.
At any significant size, it's going to matter if your service is down for 5 hours. Let's take an extreme example - let's say Google search goes down at 3am PST. Do you think the engineer on call is going to wait 5 hours before "triage, priority listing and then fix"? Are you kidding me?
I don't know what bubble you live in, but in the real world and for many businesses, outages out of hours matter. I'm sure some places they don't (like maybe day trading systems).
If you don't want a call out, build your systems to be resilient to failure, and self-healing.
On the flip side - if your workplace is toxic or bad enough that you aren't allowed to fix systemic issues that cause outages, well then I can see your viewpoint. It's not worth being on call if you can't make things better.
Never agree to be on call and if you do, make sure you are being paid double salary as a minimum, all modern science points to working unsociable hours as a massive detriment to your health. Also working Saturdays and Sundays does not make your team more productive, because your staff will be tired the following week, it's a false economy.
So if your software needs to run 24 hours and something breaks with your software, how do you avoid being on call?
A developer shouldn't be the first person called, there should be an operations staff but they may have to escalate.
On the other hand, any time that a developer is routinely being called in the middle of the night, there is usually either an issue with the software or the infrastructure not being fault tolerant.
In the UK there are laws you can opt out of being asked to work more than a certain amount of hours. They company should have an out of hours plan but most experienced developers will know very few things get resolved in the middle hours of the night, things need testing, reviewing and sometimes the solution is not simple, it is better like you said to have ops staff that gather data and then pass it on when devs are in fresh, however if you have, say, a big international sale which is happening in another timezone then why not just pay staff as a one off to be around?
Agree entirely. The law hasn't caught up requiring compensation for on call/off hours work at the existing salary level (at least in the US).
Moving from devops/infrastructure to security I doubled my total comp (salary + 401k + vacation + health insurance) while reducing the hours per week I work down to ~37 hours (I also get to work remote and never work nights or weekends). If you're in ops/devops/infrastructure, I highly recommend the transition to others, the current demand for competent security professionals is quite literally bananas.
TL;DR If you're in ops, get out of ops (easy) or work someplace that will compensate you appropriately for on call/nights/weekends work (hard).
If you know how to securely design and build AWS environments, you qualify for “cloud security architect” positions based on my interviewing experience. Beyond that, interview for security positions and take note of your gaps to improve on.
It depends on the frequency and nature of these issues, but it sounds like you are experiencing technical debt and that your paying for it with slower development speed. Solving the stability issues should take precedence over developing new features.
Is the stuff you have to intervene for under your control or external? If you're relying on outside systems that are flakey then you need make your systems more resilient, things like automatically retrying a few minutes later if some third party service is down and/or being more transnational so you can deal with errors.
I see chasing weird bugs as part of development job, not something separate. As long as it has weird bugs, the feature is not really done. As a side note, developers who do "only development" and offload all weird bugs to someone else tend to create less maintainable software overtime - they lack feedback and tend to favor whatever makes them produce new stuff faster over what makes us all avoid those weird bugs.
As for infrastructure and first line support, lobbying management for more people continuously is just about the only long term solution.
The other thing is planning and transparency which helps the above. Keep plan with realistic estimates to show it management each time you talk with them. Do your best work, definitely dont slack etc, but dont skip corners to make something look like done when it is not. Instead, move dates in plan and send it to management again. The point is to convince them that there is really more overall work then possible by one person. (If they get offended over that or treat you badly over that, find a new job.)
(I haven't read this cover to cover but I has more or less read his and Christina J. Hogans book cover to cover I thing and I've also bought a couple of copies of the above book to share.)
Summary of what I've learned and found useful from those and other resources:
Get someone to step in for you half the time. (If only to fill in a ticket or - in a real emergency: call you.)
Manage expectations. (You don't expect hard interrupts except for emergencies. )
Make support requests asynchronous. (Mail, support tickets - not calls. Even when you (or someone else) are available for real time support, - make chat the preferred option.
Yeah I really get your suffering. I really hate when software engineers try to meddle in the ops part. It usually ends up being a stupid piles of crap on another crap. It is also sad to see companies pushing devs to do this instead of giving it to someone who understands what they are doing.
If I don't fix bugs and I don't help my customers with setting up servers and the like I don't think I'll get new projects with them. Why would they trust a developer that disappears? It's as simple as that.
Some of those activities are paid, but fixes close to a delivery are not and it's OK. Usually I set up a maintenance contract for quick activities, like small new features or investigating puzzling events (not necessarily bugs.) I have a ticketing system to keep track of those activities. Customers have access to it.
Obviously one has to make clear that maintenance will slow down development.
Speaking as an ops person, my first thought is that you have technical or architecture debt. Obviously, big and/or very rapidly growing systems will hit limits and need constant attention, but these days designing most applications to scale is not a problem.
The root cause of many operations issues that I see these days stems from one or more deficiencies in the development process. I don't say "deficiencies in developers": to get safe development at speed, you need a disciplined development process with appropriate feedback mechanisms: unit tests, integration tests, performance tests, static analysis, code review etc. The default state of code is "buggy", because humans are not perfect.
You need a better system in place to prevent bugs from happening.
- Separation between development / staging / production environments.
- Integration tests.
- Service / System Metrics.
- Central logging.
- High availability.
- Alerts.
When you have a solid deployment pipeline things don't usually break. Errors and regressions are caught in the staging part of the deployment pipeline and errors in production can be rolled back automatically (and then you add a integration test for the regression!)
All this devopsy work at my company is done by software engineers with advise from systems engineers.
And we do it because neither of the groups want to get called in the weekends :) it has been working really well. Last year we had 0 calls. Before we had this in place things would break in a weekly basis.
You can build all of what I mentioned with OSS like:
- Ansible (deployment)
- Jenkins (ci)
- ELK stack (metrics / logging)
- Zabbix (system metrics)
This system has been serving us, on premises, without much maintenance.
> As a software engineer doing infrastructure work
So you are into devops but doing more ops than dev? This doesn't sound like a problem until your team's agenda and objective is to deliver more ops work.
I haven't done any significant development work in more than six months, just chasing bugs, doing support, and fussing with email and meetings. It blows; I've got to find a different job.