Hacker News new | past | comments | ask | show | jobs | submit login

> If something is used once, ignore any abstractions. If it's used twice, just copy it, it's better. If it's used 3 or more times, look at writing an abstraction...

That is a good rule of thumb, and I often follow it too. But it does take some discernment to recognize cases where something would benefit from an abstraction or some common code, even if it is only used twice.

I used to work for a company that imported airspace data from the FAA (the US Federal Aviation Administration) and other sources. The FAA has two main kinds of airspace: Class Airspace and Special Use Airspace.

The data files that describe these are rather complex, but about 90% of the format is common between the two. In particular, the geographical data is the same, and that's what takes the most code to process.

I noticed that each of these importers was about 3000 lines of C++ code and close to 1000 lines of protobuf (protocol buffer) definitions. As you may guess, about 90% of the code and protobufs were the same between the two.

It seemed clear that one was written first, and then copied and pasted and edited here and there to make the second. So when a bug had to be fixed, it had to be fixed both places.

There wasn't any good path toward refactoring this code to reduce the duplication. Most of the C++ code referenced the protobufs directly, and even if most of the data in one had the same names as in the other, you couldn't just interchange or combine them.

When I asked the author about this code duplication, they cited the same principle of "copy for two, refactor for three" that you and I approve of.

But this was a case where it was spectacularly misapplied.




I think your example illustrates why it's so important to choose the right way to generalize/share code depending on the circumstances. I've found that when there's a 90% overlap between 2-3 use cases, many people tend to go with "one common code path for all that's shared and then inject the 10% difference in via components/callbacks/config vars". This works reasonably well when the flow of execution is the same and what changes is just the specifics of some of those steps. But if the differences are also in which steps even happen, then in my experience this approach couples the whole thing too tightly and makes it harder to reason about what actually happens in a given configuration.

What I like to do instead is break the shared code paths into a palette of smaller subfunctions/subcomponents and then have each use case have its own high level code path that picks and chooses from these subfunctions: One does ABCDE, another does ACDEX. It makes it supremely easy to reason about what each of them actually do, because they read almost like a recipe. It becomes a sequence of high level steps, some of which are used by several use cases, while others are unique. I've found this way of generalizing is almost "cost free" because it doesn't really couple things at a high level, and it's the kind of readability refactor that you'd often want to do anyway even if the code wasn't being shared.


> ...break the shared code paths into a palette of smaller subfunctions/subcomponents and then have each use case have its own high level code path that picks and chooses from these subfunctions: One does ABCDE, another does ACDEX. It makes it supremely easy to reason about what each of them actually do, because they read almost like a recipe. It becomes a sequence of high level steps, some of which are used by several use cases, while others are unique.

Isn't this just the Command pattern? - https://en.wikipedia.org/wiki/Command_pattern


I love this. Refactor the first time. Remix the rest of the times.


Do you know if there’s a name for this pattern? I admire it all the time in Peter Norvig’s code. It leads to very approachable code.


I don't know if there is an official name, but in my head I call it "helpers/components/mixins are better than frameworks." Or, "if one happens to want to write a framework, one ought to try hard to refactor it 'inside-out' to a set of composable components."

The most important (though not only) issue with frameworks is that you typically can't compose/mix more than one together - every framework is "exclusive" and takes control of the code flow. Whereas "components" can usually be easily mixed with each other, and leave control of the code flow to the programmer.


I generally think of this as the same principle of "prefer composition over inheritence". Leave the top-level free to compose the behaviour it requires rather than inheriting the framework's behaviour, for exactly the reasons you describe.


This is frameworks vs libraries. In the first case the framework is calling the code with config and hooks to change behaviour. In the second case there are common library functions called from completely separate “application” code.


I don't know an official name for it. It seems like it's almost too basic - "subdivide into helper functions" - to make it into the Gang of Four or other design pattern collections. But in my head I'm calling it the "Recipe Pattern"


It sounds like a version of the strategy pattern to me.

https://en.wikipedia.org/wiki/Strategy_pattern


> and it's the kind of readability refactor that you'd often want to do anyway even if the code wasn't being shared.

Couldn't disagree more tbh. Some of the worst code I've ever had to work with has been over abstracted "recipe" code where I'm trying to descern complex processes based off two word descriptions of them in function names.

Doing this too much is a great way to turn a readable 100 line algorithm into a 250 line clusterfuck spread across 16 files.


> Doing this too much

ok, so you're talking about overdoing it. It's still a good approach when done right.


Not really, unless "done right" is for like a 2000 line function or something.

If code is running once in order, there's no reason to break it up into functions and take it out of execution order. That's just stupid.


Martin Fowler, in his book "Refactoring" outlines circumstances were you can leave bad code alone.

Basically if it works and you don't have to touch it to change it, leave it alone.


I think you've completely missed the point.


Oh god that reminds me. Our company did this but for a whole project.

It was back when a bunch of social networks released app platforms after Facebook's success. When hi5 released their platform, rather than refactoring for our codebase to work on multiple social networks... someone ended up just copying the whole fucking thing and did a global rename of Facebook to Hi5.

For the 3rd social network I refactored our Facebook codebase to work with as many as we wanted. But we never reigned in Hi5, because it had diverged dramatically since the copy. So we basically had two completely separate codebases: one that handled hi5, and one that had been refactored to be able to handle everything else (facebook, bebo, myspace, etc)


No bets on which one is buggier. Or which one's bugs (and also their fixes) break more networks.


Hi5 was less buggy because new features were just never ported to it - it was deemed not worth the effort.


I also got this heuristic from Martin Crawford. However I believe it applies to snippets (<100 lines of code at the very most) only, for the reason you gave. But even then, it sometimes happen that you find a bug in a 4 line snippet that you know was duplicated once, and have to hope you can find it through grep or commit history. So while being careful not to over-engineer and apply KISS/YAGNI ('you ain't gonna need it'), one-time duplication can be a pain.


I cannot edit my comment anymore, but I realized Crawford is the Martin of 'Forest Garden' fame. I was obviously meaning Martin Fowler, from the 'Refactoring' book.

Maybe we'll have 'Forest Software' in some time. 'A code forest is an ecosystem where the population of bugs, slugs, trees and weed balance themselves, requiring very little input from the engineer'.


> There wasn't any good path toward refactoring this code to reduce the duplication. Most of the C++ code referenced the protobufs directly, and even if most of the data in one had the same names as in the other, you couldn't just interchange or combine them.

That makes it sound like the problem is more of a spaghetti mess than duplication.

But I think the advice to copy something when you need two versions is supposed to be applied to specific functions or blocks or types. Not entire files. Then it wouldn't have duplicated the geographical code.

It's also important to have a good answer to how you'll identify duplicated bugs. I'm not sure how to best handle that.


If I needed to guess: They probably referenced the protobufs directly, because there are always 2 and "You have to tell it which one!".


What if the FAA updates the code for the coordinates of the one and not the other. Then your abstraction is moot.


Of course not, abstraction works even better there! Every point that differs will have either a conditional, or an abstract part to be implemented by child classes. So the abstraction lets you know at a glance what are the key points to look for.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: