Unfortunately this is a lesson that often has to be learned the hard way when yo...

tonyarkles · on April 23, 2016

I perpetually love the references to Ship of Theseus, and find it particularly applicable to this problem.

You mention fundamental functional problems, and I'd like to add something: sometimes it's not a functional problem, but a changeability problem. The code functions fine, but the process of adding a new feature is incredibly painful.

I've done big-bang partial rewrites of systems before, quite successfully, but I've got a Ship of Theseus rule of my own that I follow: no new features during the rewrite, and no missing features. The first example that comes to mind was a rather complicated front-end application that had become a total spaghetti disaster. It had been written using Backbone, and from my research it fit the Angular model of things quite well.

I took a pound of coffee with me out to the cabin for a weekend, and rewrote the whole frontend. Side-by-side. I started with the first screen, and walked through every code path to populate a queue of screens that were reachable from there. Implement a screen, capture its outbound links, repeat.

Nothing changed, but everything changed. The stylesheet was gnarly too, but I left it 100% untouched. That comes later. By keeping the stylesheet intact, I (somewhat) ensured that the new implementation used identical markup to the old system. The markup was gnarly too, but keeping it identical helped to ensure that nothing got missed.

48-72ish hours later, I emerged a bit scraggly and the rest of the team started going over it. They found 1 or 2 minor things, but within a week or so we did a full cut-over. The best part? Unlike the article, clients had no outward indication that anything had changed. There was no outcry, even though about a quarter of the code in the system had been thrown out.

enobrev · on April 23, 2016

A full rewrite absolutely should not be taken lightly (if ever). It's very much a last resort and something that requires deliberation and a clear path to success. You're spot on with your rules - nothing new; nothing lost.

I had a similar experience, sans the cabin. I was hired onto a startup that already had a functioning app in the wild. The API was written by one of the founders, who is not a developer in the professional sense, and holds my undying respect. As an early-stage startup with lots of ideas, we needed to move fast, and I wasn't going to be able to do so with the existing code base.

I stocked the fridge, locked myself into my tiny Brooklyn apartment, and got to work. I started by logging all requests to the API in order to ensure I had all the necessary endpoints covered. Then I wrote integration tests - acting as an HTTP client - for the entire API.

About a week or so later, once the rewrite was finished, I added automated tests that compared the output between the two APIs, and once those matched perfectly, ran it live beside the original API (sending requests to both) and compared results from real requests to ensure there were no discrepancies.

Besides a couple very small bugs after the switch, it went very well. The user base was none-the-wiser, besides the sudden uptick in features after the rewrite. The startup was relatively successful (acquired), and I still work with those guys from time to time.

d33 · on April 23, 2016

The thing that struck me is that they decided to rewrite the system without talking to the customer. I believe that if they kind of sold this first, they might have gone a different path.

chris_wot · on April 23, 2016

I love both your story and the patent's :-) great work!

radicalbyte · on April 23, 2016

I have similar experience: the "incremental rewrite" is usually the most effective tactic to apply. It reduces risk, cuts "time to market" and lets you apply Pareto's rule - making the process efficient.

Very rarely a rewrite is the answer: for example if the product is functioning poorly, where even low risk changes cause random regressions. Where large system-level changes are needed to make the system work - say because the original developers didn't implement authorization checks after the login.html page. Where none of the original developers are available and no-one knows the requirements. Where the system is a hodgepodge of 4 different frameworks including one custom one (whose developer is long gone, leaving 0 documentation or tests behind).

In cases like that the software artifacts are a liability; it's better to use the assets you've built up (domain knowledge) and develop a new product in parallel. Put the old one in zombie mode and spend two years building the new product with feature parity.

There is one other situation where a full rewrite is good: if v1 is a throwaway prototype. However that would never have been put into production for any significant number of users.

dpark · on April 24, 2016

>In cases like that the software artifacts are a liability; it's better to use the assets you've built up (domain knowledge) and develop a new product in parallel. Put the old one in zombie mode and spend two years building the new product with feature parity.

In two years your devs should be able to understand the existing code, fix the auth problems, and excise the worst of the code and hodgepodge framework mess. If they can't, then they certainly can't maintain the old system while building a replacement.

If you can't understand the old code, you can't replace it. If you don't understand the requirements, I'm not sure you can even maintain the existing system. It shouldn't take two years to add auth, or remove an unsupported framework.

This is the textbook case of when a rewrite will fail, because the scope is not just too big, but unknown (because you don't even understand the requirements). The choice to rewrite in this situation is not logical. It's emotional. The existing system is a mess, and the fix seems so difficult. But the rewrite estimate is likely poor because you don't actually understand the system you're rebuilding, and even if it's an accurate estimate you lose two years guaranteed just to ship feature parity. You can make a whole lot of improvements to a codebase in two years while also shipping features.

groovy2shoes · on April 24, 2016

> There is one other situation where a full rewrite is good: if v1 is a throwaway prototype. However that would never have been put into production for any significant number of users.

Tell that to my manager. Since I joined this company a little over two years ago, all of our tools have been put into protoduction, despite my warnings and protests.

dpark · on April 24, 2016

>cabin for a weekend, and rewrote the whole frontend.

The problems with big bang rewrites don't manifest in rewrites that take 2-3 days. Even if your rewrite burns a full week, if it fails you've only lost a week. Rewrites are problematic when they are expected to take months or longer. That's when the amount of code is high, the complexity is high, and the estimates tend to be bad. Losing many months of forward progress to chase a rewrite can kill a company. If a single lost week can kill your company, you're probably doomed anyway.

I've successfully done the "big bang" rewrite myself for things that needed a week or so of rewriting. I don't believe for a moment that this experience is relevant for large scale rewrites. I've only ever seen those fail spectacularly.

Anyone who tells you that a 1-week rewrite is never appropriate is just cargo culting and probably not worth listening to in general. A one week rewrite to avoid weeks of refactoring can be a very good tradeoff. A one year rewrite on the other hand is likely to end in disaster, not least because it guarantees a year of lost forward progress.

tonyarkles · on April 24, 2016

That's why I targetted a specific subsystem. Redoing the whole system, frontend and backend, would have definitely taken much longer than the 48 of 72hr I put in over 3 days. I took the piece of the system that was the gnarliest, rewrote it, and bought us a quick win. Down the road, other vertical chunks of the backend were ripped out and replaced in a similar fashion, once they became the now-worst piece of the system.

dpark · on April 24, 2016

I guess I'm a little unclear about your argument here. You used the term "big bang" rewrite, but you're describing incremental rewrite.

DonHopkins · on April 23, 2016

Only a pound of coffee? I'd be terrified of running out just before I finished. Stay safe, man. Always bring lots of extra coffee to your cabin in the woods.

tonyarkles · on April 23, 2016

Hah, funny enough... Not on that project, but on a different messy one, I packed up and went out there, only to discover at 9am on the first day of the project that I had accidentally brought a pound of decaf. The nearest town is about a half hour drive one-way, and the best I could find at tiny grocery store there was a large can of Folgers with questionable vintage, and a box of Red Rose tea.

Surprisingly, if you use pre-ground cheap coffee in an Aeropress, it still turns out... sort of OK. Better than I expected, worse than I'd hoped!

groovy2shoes · on April 24, 2016

Friends don't let friends drink decaf.

chris_wot · on April 23, 2016

That is precisely what I did with a set of Qlikview dashboards at my previous firm. They were always very dodgy, but kind if worked. My new boss walked in and started fiddling with the underlying scripts and buffered up everything.

Eventually I and another colleague was tasked with fixing the mess. What I did was to look at the sources, then work out what each graph was trying to do. I then setup a new script that extracted the data into a cleaner data model, and then I (rather painfully) copied and pasted entire sheets into this new Qlikview dashboard. From here all I needed to do was to hook up the graphs to the new model by changing the source fields, expressions and calculations.

After a bit of UAT from internal departments I had fixed the reports and no client was really any the wiser. If I hadn't done it this way, there would have been awkward questions and I just know I would have been chasing my own tail modifying the rewrite to get it back to the way the old system looked.

bernardlunn · on April 23, 2016

A about 500 of Fortune 500 would like you to show them how to do that with each of their 500 worst hairballs.

p4wnc6 · on April 23, 2016

While I generally agree with this, there is one situation when scorched earth re-write should at least be kicked around as an option.

If you are dealing with a very bad system that has not been maintained and has tons of candidates for those tightly-scoped incremental fixes, but at the same time, you are embedded in a giant, faceless conglomerate sort of company where there is effectively 0% chance that any of those tightly-scoped incremental fixes will ever be greenlighted and everyone in the team knows it.

Then you should consider the scorched earth approach, because it is the only way that the incremental fixes will ever happen. Bureaucrats will always find an excuse why this particular short term time frame is not the right one for the incremental fix that slightly slows productivity and gets them dinged on their bonus. And they will always find a way to pin the blame regarding underinvestment in critical incremental fixes on the development staff.

So sometimes all you can do is deny them that option, even if you know how painful it will be. The aggregated long-run pain will be less, though it won't feel that way for a long time.

I just want to reiterate that I mostly agree with you. It's especially bad to turn your nose up at legacy code that actually has valuable tests, because the tests make the maintenance and incremental fixes so much better. When there are solid tests, you should almost never throw it away.

Nonetheless, sometimes you have to torch it all to deny the bureaucrats the chance to slowly suck the life out of it (and you).

dack · on April 23, 2016

I dunno, after you do the rewrite, then what? If the bureaucrats won't ever let you make incremental fixes/cleanup, then even your rewritten version will eventually rot.

I think the idea of building software "the right way, once and for all" is mostly just wishful thinking on our part. Even the ideal rewrite will have ongoing maintenance that needs to be done.

Therefore, my only two options in that situation would be to a) make those incremental changes as part of features without telling them (estimated as one chunk), or b) look for a new job.

DougWebb · on April 23, 2016

At my old job, we built, maintained, and operated a huge customer-facing system. We had 2-3 big releases per year, and in every release cycle we devoted around 20% of our time to maintenance, whatever time was necessary for critical bugs (generally from the new code released in the last cycle) and the rest to new development. We (the dev team) put this split into our estimates and development plans, and we had official approval from the project management team and c-level execs. That 20% time was ours to use as we saw fit, and we generally used it for refactoring code we felt was hampering productivity on new development. But we also used it for side projects; I built a performance/metrics gathering and reporting system that was incredibly useful to me, and ultimately became incorporated in marketing materials and the ceo's presentations.

We were lucky to be able to use this 20% time openly, but if we weren't we would have increased our new dev estimates to get the time.

chris_wot · on April 23, 2016

Michael Feathers considers legacy code to be any code without tests. [1] I tend to agree - only I think even code with tests can be candidates for the "legacy" moniker.

Bottom line is: you may do a rewrite, but that codebase may be considered to be a hairball by the next guy.

1. http://www.netobjectives.com/system/files/WorkingEffectively...

cubano · on April 23, 2016

All b) really does is start the loop over again at the initialization step.

aytekin · on April 23, 2016

One of the cool things about doing replace-in-place is that you can A/B test the old and new versions. We are doing that on our current Form Builder rewrite at JotForm. We rewrite one piece and make it live to 50% of users. Then we receive daily morning emails about each test. If some metric is in red, we discuss (or watch fullstory, or talk to users) what might be causing it and improve the new version.

Here is a fresh example, we released the new version of PayPal payments pro/express integration, and the success rates stayed in red. The old version was beating the new version. It was 3x better even though almost everything was same. After some head scratching, we found that the old version had a link to a direct PayPal page where the can get their api credentials, and the new version was missing it. From there, the fix was easy and things turned green.

This is a story that has happened over and over again. When you rewrite software, you lose all those hundreds of tiny things which were added for really good reasons. Don't do it blindly.

hinkley · on April 23, 2016

Usually these problems come down to the information architecture, and I wonder sometimes if the skills necessary to do a replace-in-place to fix those sorts of problem might be better off, at least in general, being applied to some other problem.

If people want broken stuff, or things that violate the laws of physics, information theory, and common decency, no amount of heroics is gonna keep the wheels on forever. Maybe writing software to cure cancer or improve food distribution (ie, cure hunger) would be a better use of your energy.

radicalbyte · on April 23, 2016

Funny because that's exactly the choice I just made.

Why should I - after spending 20 years turning myself into a top tier developer - spend all of that talent and investment fixing the horrible mess that two generations of management have created for me?

I can earn twice as much freelancing or moving into management.

And, honestly, I'm beginning to think that we should just let market forces do their thing. These are the software equivalent of the subprime mortgages - if I spend an enormous amount of capital I can stop it failing but I simultaneously remove any downside. So these "management" types can make the same errors all over again.

Nah, starting for myself is much more interesting. That seems to be the Valley way (although I live on the other side of the ocean).

aytekin · on April 23, 2016

Our product is used by 2 million people and some of them are literally curing cancer and hunger. That's actually a great motivational reason for us to continuously improve it.

DonHopkins · on April 23, 2016

Yes, it's a great feeling and strong motivation to work on something like that, and it gives you the will to stick to it for the long run.

I've been working on a system for improving health and preventing diabetes since 2005 [1], which is complicated by the fact that it's designed from the start to support controlled clinical trials, followup questionnaires, and analytics to measure how well it works. That's enabled us measure how well it works (which literally involved drawing blood), publish papers proving its effectiveness, and feed that learning back into many changes to improve the system over the years.

Parts of the system really needed to be rewritten, but were originally intertwined all throughout, and there was no time for a rewrite. So over the years I've been incrementally refactoring and isolating those parts to make it easier to remove them in the future, even when we were under crunch and there was no time to rewrite.

I finally found the time to completely replace the bad parts, and it turned out to be much easier (and more satisfying) that I expected, significantly reduced the complexity and lines of code, and made it possible for us to hire contractors to work on the code without their heads exploding. But I think it went so well because I'd been thinking about it and chipping away at it for years.

[1] http://turnaroundhealth.com

reitanqild · on April 23, 2016

For anyone who has to look up the "Ship of Thesus", here it is: https://en.m.wikipedia.org/wiki/Ship_of_Theseus

HillRat · on April 23, 2016

Or Fowler's "Strangler Pattern," a similar approach.

ralphael · on April 24, 2016

...this is a good reminder that sometimes its a good idea to hire for certain positions, based on prior experience.

fallous · on April 25, 2016

Indeed, but then the organization has to know what it does not know... which presumes a certain wisdom in the first place. ;)