Resilience and Waste in Software Teams

whstl · on Jan 26, 2023

Man, this hits hard. Especially this point:

Efficiency + surprise = shocking price tag. Resilience + surprise = a minor bump.

Reading it after the fact sounds obvious, and I kick myself for not realizing it before, but yes.

One of the most stressful parts of my career was when we accepted trying to a crazy plan from a crazy CTO, where "every release has to have a new special feature". "Nothing big", we thought, until we saw the insanely detailed deliverables from design. I tried to push back but got called a pessimist. We then obviously worked on 100% capacity for a few months and it "worked", until it didn't. Things started collapsing due to technical issues, bit rot and "random" security issues. In the end we found the design and product teams were also working on 100% capacity. The plan had been hailed as a fantastic achievement in a company-wide meeting, so scrapping it took an insane amount of convincing.

danjac · on Jan 27, 2023

The "insanely detailed deliverables from design" is a red flag and indicates a factory conveyor belt mentality: product comes up with idea, hands off idea to design, design hands off to developers, who then have to implement the design down to the pixel. The "hand-off" approach is incredibly wasteful and counter-productive.

The design team should be building designs in tandem with input from developers and product, so that the final agreed-upon design takes into account business requirements and technical constraints.

Nor should this be "final" in the sense that during implementation there may be unexpected problems that require modification of that design - "hey, turns out this widget doesn't work in Safari, we're going to have to come up with a workaround". As long as what comes out in the end is a usable implementation of the business requirement, it doesn't matter whether it matches some arbitrary design ideas from three months ago.

whstl · on Jan 27, 2023

Yep, good observations. One of the main complaints was that we regressed to a waterfall process and we were doing Scrum only for show.

danjac · on Jan 27, 2023

I think the problem with agile (whatever the exact flavour) is that if you don't adopt the mindset across the organization, and just enforce it in your dev team, you just end up with waterfall-with-extra-steps.

whstl · on Jan 27, 2023

Good point. In this case we had a misapplied "dual-track agile" structure, which in theory means the whole organization is "doing agile", but since the feedback loop is super long, it's indeed just waterfall-with-extra-steps.

This is not the first time I see this happening. Product Managers love it. IMO it is classic micromanagement. Product Managers want more control of the process than is afforded by a "self managing agile team", so the only chance of control is achieved by isolating team communication and having "small" deliverables that they have full control over. IMO that's alright, but it's waterfall. You can't expect the advantages of an agile process to apply to this.

claytonjy · on Jan 26, 2023

Corollary from your anecdote: "crunch time works, but not for long".

I suppose crunch time is >100% efficiency (extra hours), so that's why your CTOs plan lasted way longer than crunch time ever should.

Looking back, how much earlier would you have had to pull back efficiency to avoid the later issues? Would a month have been fine? Maybe two?

nicbou · on Jan 27, 2023

Kitchen analogies have replaced car analogies as ny favourite.

You can only cook for so long without cleaning your work area or prepping more ingredients. Eventually, the efficiency gains of not doing those things catch up with you.

whstl · on Jan 27, 2023

Good question. IMO, one sprint or two weeks was already a lot. The damage was done from the beginning. It generated a lot of stress, eroded the relationship of design and dev teams, and between the CTO and employees due to our complaints not being addressed on time.

I think the distinction between this and "crunch time" is that crunch time is meant to be temporary, even when it isn't. In this case the prospect that this wouldn't ever change eventually became as stressful as the work itself, or perhaps more.

claytonjy · on Jan 27, 2023

That makes sense; it's not just "this sprint sucks", it's the expectation the next one will also suck, and the one after that, etc.

Was the design team fine during this time? They had no trouble designing enough stuff for you to implement, it didn't weigh on them in the same way?

whstl · on Jan 27, 2023

No, some of them also confided they were heavily stressed due to pressure to deliver. I urged them to "work less" but the pressure was still coming in from above.

jcelerier · on Jan 27, 2023

how much $$$ did the company make, before and after?

whstl · on Jan 27, 2023

That's a great question! Because of this strategy? I wouldn't say it moved the needle in the slightest, and that was entirely expected. There was however a lot of churn because the platform we operate on launched some features that made our own product obsolete.

The goal of this strategy was merely to make the process and the releases more predictable, but it ended up being at the expense of designers and developers.

manfre · on Jan 27, 2023

Crunch time is often better described as a death march.

Joel_Mckay · on Jan 27, 2023

Most projects lifecycle:

1. exploratory phase documenting stakeholder requirements

2. analytical phase consolidating a design specification

3. prototype phase where viability and scope is further refined

4. development phase where inflated expectations get attenuated

5. refinement phase where the stakeholders and end users realize they get what they asked for... but not quite what they needed.

6. revision purgatory where the clean efficient design gets hammered back into the same abomination it was meant to replace

One can simply plot the exponential decay in job satisfaction by project phase number. If 40% of the team remains by the end, than it was a better manager than most.

Good luck fellow travelers =)

dasil003 · on Jan 27, 2023

This is definitely true and insightful, but I think it's also important to understand the difference between global logistics and operations vs software engineering teams.

In the case of an airline, the job to be done is very very clear, but is logistically difficult and subject to all manner of physical disruption, not the least of which is you simply don't have the right people in the right cities at the right times.

With software development, the job to be done is more nebulous. Sure you have operations (on-call), but the bulk of the work is building new things, which have variable scope, and are all one-off builds. In this environment you can't even really measure efficiency in any objective way. When the MBAs try, they get subtle (or open) rebellion and Goodhart's Law slaps them down. Best case sane engineering management steps in and saves the day, worst case the elves leave middle-earth and the bean counting fantasy land develops its own self-perpetuating superstitions and mythology while the business runs its natural course.

As an engineering manager, it's always a losing argument to suggest that we need slack in the system. The reasons for this are two-fold: first, upper management never wants to hear this, we're not paying people to sit around idle. More importantly, software engineers also don't want to be bored. Fortunately writing software is not operations. There is always an unbounded list of improvements that the team could undertake at any moment. So balancing efficiency with resiliency really just means having everyone working on the most important thing at the moment, while breaking things down small enough that people can switch gears without a massive context switch, or working around huge, unwieldy in-flight migrations, and also ensuring that 90% of the time people are working less than 40 hours a week, so when crunch time is necessary they are fresh and not burnt out. Easier said than done, but still easier to manage than resilient logistics.

kqr · on Jan 27, 2023

> The reasons for this are two-fold: first, upper management never wants to hear this, we're not paying people to sit around idle

Have you ever thought to frame it as "paying for lower latency"? E.g. in the spirit of https://two-wrongs.com/estimating-work-lag.html

If yes, how did that go down? (Asking out of personal interest. Expecting to have to do this shortly.)

dasil003 · on Jan 27, 2023

Hard to answer this without more context as it really depends on the personalities and viewpoints of those you're dealing with. I could see a world where it could work with very technical leadership that had high trust in you, but my gut instinct is it doesn't come off well.

Is the problem you're trying to solve that leadership doesn't feel you're responsive enough? Or is the problem that the team is being pulled in too many directions to maintain focus and momentum?

Joel_Mckay · on Jan 27, 2023

Good Engineers usually quit doomed projects early. Example: managers not recognizing physical, budgetary, and logical constraints.

It is funny because it is accurate. =)

vsareto · on Jan 26, 2023

>Southwest hasn’t coped with changing expectations

TBF the amount of rapid change that consumer-facing technology has undergone has been crazy. The idea of making some boring, reliable, maintainable, high-traffic, and high-importance piece of software that lasts for decades isn't easy. You have to do this without new developers joining, seeing all of your "old" stuff, and having them jump ship. Not to mention if the technology company behind your old stuff decides to abandon it.

ddulaney · on Jan 27, 2023

To pick a small part of your comment: I think that retaining developers for a legacy codebase is doable as long as you have strong technical leadership.

If I come on board a new team and there's a ton of legacy code, the main question I'll have is "what's your plan here?". If the answer is "no plan, ignore it until it's a problem, then make the smallest possible fix," then of course you'll get new devs immediately jumping ship. But there are lots of other potential plans. "We're incrementally replacing each of these components, targeting one of them per quarter." "We've put this really solid interface between the legacy components and the new stuff." "We've been investigating, writing documentation, and writing tests that cover our legacy code so that we can maintain it alongside newer stuff."

As long as there's some kind of plan in place and the team is executing on that plan, most developers I've worked with will stick around. The issue comes when either there's no plan or there's nobody willing to put resources towards the plan.

darth_avocado · on Jan 26, 2023

Unfortunately due to macroeconomic conditions, we can only operate on supreme efficiency of underfunded teams that 12 hours a day at lower wages.

dboreham · on Jan 27, 2023

But we take full responsibility.

prettyStandard · on Jan 26, 2023

Then we will suffer, and it's the nerds fault. /s

ac50hz · on Jan 27, 2023

+10

I have had similar experiences over many years and whilst the products, tools and names may change, whether it’s a large or small organisation, the outcomes are often predictably similar and bad.

I find a general lack of experience, interest and understanding of consequences and the associated risks pervades.

Unfortunately there is rarely any enthusiasm, interest or even resources to allow for sensible discussions to help gain insight into the problems and challenges.

Resilience can come with an understanding of the short and long-term consequences of every decision.

Of note, I find that reinventing, rewriting or scratching-the-itch to make the perfect CMS, ORM, etc, offers no value to understanding consequence and resilience.

arkh · on Jan 27, 2023

What they call slack is called reserve in military. People who can be deployed to help a failing front or to capitalize on an opportunity.

pmontra · on Jan 27, 2023

A side note:

> We store some of this data in MySQL, which supports full unicode… in a version we didn’t have yet.

I'm not sure about full Unicode. Anything that fits in 4 bytes is OK after a lot of false starts [1], but what about characters with a representation longer than that? According to a reply to [2]

> " " is seventeen bytes: '\xf0\x9f\xa4\xa6\xf0\x9f\x8f\xbb\xe2\x80\x8d\xe2\x99\x82\xef\xb8\x8f'

I don't have a MySQL 8 at hand to test if it fits into a utf8mb4 column.

[1] https://dev.mysql.com/doc/refman/8.0/en/charset-unicode.html

[2] https://stackoverflow.com/questions/42778709/the-longest-cha...

Edit: HN doesn't like that 17-bytes character. I was able to paste it into the text area but I see it as an empty space after I submitted the comment. It's visible at stackoverflow.

kqr · on Jan 27, 2023

It's also important not to overreact to temporary hiccups. This time Southwest crapped out. The next time it will be some other airline. An important part of resilience is how they learn from their mistakes.

Southwest has a strong, resilient internal structure[1] that ought to help them learn from this, perhaps better than other airliners would. I wouldn't write them off so quickly.

When judging things on their resilience, I think it's important to not look at single events and then pass judgment. Better to apply theory to the system and make predictions under a wide variety of upsets.

[1]: https://www.forbes.com/sites/darrendahl/2017/07/28/why-do-so...

tomerbd · on Jan 28, 2023

Agile processes must have resilience mechanism inside if not plant it in ask for technical debt payment sprint every couple of sprints.