> The hoops and complexity that some people go through to DRY (don't repeat yourself) are worse sometimes than the cost of maintaining two copies.
The number of bugs are proportional to the lines of code, this is undeniable from empirical data. Ergo, fewer lines of code will tend to yield fewer bugs. So if your code is literally the same, there's no reason not to extract it into a function.
That said, the moment you have to start adding parameters in order to successfully factor out common code, ie. to select the correct code path taken depending on caller context, that's when you should seriously question whether the code should actually be shared between these two callers. More than likely in this case, only the common code path between two callers should be shared.
> So if your code is literally the same, there's no reason not to extract it into a function.
If the two pieces of code are likely to change in different ways, for different reasons, that is a strong reason not to extract it into a function even if they happen to be character-for-character identical for the moment.
Less code means fewer bugs, but that doesn't mean I should be working on the gzipped representation.
> If the two pieces of code are likely to change in different ways, for different reasons, that is a strong reason not to extract it into a function even if they happen to be character-for-character identical for the moment.
The future is much more malleable than your immediate needs.
But even if your future turns out to be true, the code you need to refactor is then already extracted to a function, so you can easily duplicate that function, make your localized changes, and change the callers to the new function. So this is still the best route.
> Less code means fewer bugs, but that doesn't mean I should be working on the gzipped representation.
Don't be absurd. gzipping doesn't preserve your program in human-readable form. Extracting code into a reusable function makes your program more human readable, not less.
In the parent comment, you said there was no reason. I maintain that it's a strong reason. It may or may not be a sufficient reason, weighed against other considerations. In particular, if it harms readability that's also a strong consideration.
But sometimes it can help readability, too. "DRY" as a principle was originally formulated in terms of repetition of pieces of knowledge rather than code, and I think in those terms it's far more useful. If this code represents "how we frob the widget" and that code represents "how we tweak the sprocket" and there's no reason for those to agree, they should probably be separate functions. Pulling them out into a "tweaking_sprockets_or_frobbing_widgets" function is making things less readable, because it's conflating things that shouldn't be conflated. If there is not some underlying piece of knowledge - some statement about the domain or some coherent abstraction that simplifies reasoning or some necessary feature of the implementation - combining superficially similar things is just "Huffman coding".
> Extracting code into a reusable function makes your program more human readable
When done properly, yes. When done to the point where a five line function is created with ten inputs (yes, this is real), no. But DRY tells us that the five lines of duplication is unconditionally worse.
Hell, I've even seen things like logging/write tuples (i.e. log the error, write the to a socket) encapsulated, even though the only non-parameter code ends up being the two function calls.
Anything, taken to extremes is bad. The problem with DRY is it encourages that extremism.
> But DRY tells us that the five lines of duplication is unconditionally worse.
I agree that's often how DRY is understood, and that it can be a problem.
It is not how DRY was originally formulated, which was "Every piece of knowledge must have a single, unambiguous, authoritative representation within a system." This differs from blind squashing of syntactic repetition in two important ways. First, as under discussion here, if things happen to be the same but mean different things, combining them is not "DRY-er". Second, there can be repetition of knowledge without repetition of code. For instance, if we are telling our HTML "there is a button here", and our JS "there is a button here", and our CSS "there is a button here", we're repeating the same piece of knowledge three times even though the syntax looks nothing alike.
I make no claim as to whether the flawed, more common understanding or the original intent is what "DRY really means", but I think the latter is more useful.
This is correct, and the below should not be taken in argument with it.
DRY as a guiding principle sometimes has a secondary beneficial effect that was not discussed. Two pieces of code that happen to be the same but "mean different things" should not automatically be deduplicated by dumb extraction. However, the fact that those two things share code may, when viewed through the lens of "prioritize-DRY-ness", hint that the two share a common underlying goal, which can be abstracted out into functionality that can be used by both.
Put another way: if the code to control a nuclear reactor circuit and the code to turn on a landing light on a plane happen to be the exact same, they shouldn't be blindly deduplicated into some library function, but the fact that they're the same may indicate a need for a more accessible, easily-usable-without-mistakes way of turning that kind of circuits on and off.
> When done properly, yes. When done to the point where a five line function is created with ten inputs (yes, this is real), no. But DRY tells us that the five lines of duplication is unconditionally worse.
I'm not convinced by your example. There are plenty of mathematical calculations taking numerous parameters that I think should be in a distinct function.
Even for non-mathematical calculations, 5 lines of code that are used repeatedly as some sort of standard pattern in your program should also get factored out. Like your logging example, ie. you consistently log everything in your program the same way, then sure, refactor that into a shared function. Then if you suddenly find you need to log more or less information, you can update it in one place.
Of course, I understand your meaning that sometimes factoring out doesn't make sense, but if you find repetition more than twice as per DRY, refactoring seems appropriate.
In web frameworks, there is usually a little bit of boiler plate for each view.
You could refactor this completely away, but not without an almost total loss of flexibility and a good amount of readability too. Often the views will look very similar, then start diverging as the project grows.
With you on refactoring common patterns out, and yes, some people don't do this enough. But really, the important thing there is that those patterns are truly common to a large degree and should stay in sync - so it's worth it to introduce a maintain a new concept to keep them that way.
You DRY up code by writing indirection. That's the expense of all abstractions. You can't believe that all indirection is worth it at all costs, so I'm not sure what point you're belaboring.
> You can't believe that all indirection is worth it at all costs
I think I've been pretty clear about the costs and when this is worth it, particularly in my first post in this subthread, which I'll quote here:
> That said, the moment you have to start adding parameters in order to successfully factor out common code, ie. to select the correct code path taken depending on caller context, that's when you should seriously question whether the code should actually be shared between these two callers. More than likely in this case, only the common code path between two callers should be shared.
Or if you want a more concise soundbite: refactor if your indirection is actually a clear and coherent abstraction.
> When done properly, yes. When done to the point where a five line function is created with ten inputs (yes, this is real), no. But DRY tells us that the five lines of duplication is unconditionally worse.
I work with people I would describe as... junior at best (lots of boot campers) and I see this all the time. Functions that just return an anonymous function for no reason, half JSON blobs returned from functions that are called with one string instead of just repeating the blob in the code, etc.
That’s a popular claim. I wonder how many failed projects could have it as their epitaph.
Have you ever worked on a project where the requirements changed so fundamentally from one day to the next that you truly, honestly had no idea where you were going next?
I haven’t. I’m not aware that I’ve ever met anyone else who has, either.
The claim that requirements always, or even usually, change so dramatically within such short timescales that it isn’t worth laying any groundwork a little way ahead simply doesn’t stand up to scrutiny, in my experience. Any project that was so unclear about its direction from one day to the next would have far bigger problems than how the code was designed.
Otherwise, there is always a risk that by being too literal, by ignoring all of your expectations about future development regardless of your confidence in them, you climb the mountain by climbing to one small peak, then down again and up the next slightly higher peak, and so on. This could be incredibly wasteful.
Of course requirements often change on real world projects. Of course I’m not advocating coding against some vaguely defined and mostly hypothetical future requirement five years in advance. But often you will have some sense of which requirements are going to be stable enough over the next day or week or month to base assumptions on them, and insisting on ignoring that information for dogmatic reasons just seems like a drain on your whole development process.
>That’s a popular claim. I wonder how many failed projects could have it as their epitaph.
Way less than over-ambitious projects that died because of things that they didn't need, immortalized in lots of classic Comp-Sci literature. From Fred Brooks' books to Dreaming in Code:.
There's a reason it's a popular claim. In fact, popular means it's just repeated by many -- but this claim one can read repeated by the most experienced and revered programmers (or an analogous one, e.g. the KISS principle, "Do the simplest thing that could possible work", etc.), from the Bell Labs guys to the most celebrated programmers today.
>Have you ever worked on a project where the requirements changed so fundamentally from one day to the next that you truly, honestly had no idea where you were going next? I haven’t. I’m not aware that I’ve ever met anyone else who has, either.
Welcome to my life :-)
Not being snarky -- rapidly changing requirements is the number one complain in my kind of work.
Well, I didn’t say there was only one way a software project could fail! My point is simply that I believe anticipating and allowing for future requirements is a matter of costs and benefits. It’s about comparing the cost of making a wrong step and then having to backtrack with the cost of following a circuitous route to the final destination instead of a more direct one. Both are bad if we make the wrong choice, and we can’t see the future to make an informed decision about the right choice, but we can at least look at the expected cost either way and make an intelligent decision in any given case.
> If it's not actually re-used then it's just making me jump around to see what's actually happening rather than reading straight through the code.
Firstly, you only actually create a function either when it is being reused, or because it's functionality is a logically separable responsibility and so you factor it out for understandability.
Either way, the function should also have a meaningful name describing its purpose so you don't have to jump around to understand what's actually happening.
Meaning, your future needs are ever changing and often unclear. Your present needs are immediate and usually obvious. Meet your present needs first and foremost without sacrificing flexibility to meet future needs. Factoring code into functions accomplishes this.
> I've come across plenty of small functions that I couldn't understand without checking the calling functions for context.
Sure, happens to me too when I don't assign meaningful names, or the functions don't actually encompass a small, meaningful set of responsibilities, or the functions use deep side-effects that require reasoning within larger contexts.
The problem with such programs isn't factoring into functions though. If anything, this step reveals latent structural problems.
>The number of bugs are proportional to the lines of code, this is undeniable from empirical data.
Do you have a link to this undeniable data? I haven't seen many empirical code quality studies that are not littered with possible confounders. It's very difficult to do these studies.
I do believe that bug density is roughly proportional to the size of the codebase if you average over large corpera of code. But the bug density of different types of code within those corpera varies a great deal in my experience.
So I think the important question is what sort of code is duplicated and why it is duplicated. Removing duplication means creating dependencies. If we create the right dependencies, i.e. the ones that enforce important invariants, that's a good thing. But that is a big if.
For a summary, sure [1]. There have been loads of studies on various metrics, but none have actually been any better than simple lines of code, despite the fact that it has such a large variance as a metric.
> Removing duplication means creating dependencies. If we create the right dependencies, i.e. the ones that enforce important invariants, that's a good thing. But that is a big if.
While enforcing invariants would certainly be good, I'm not convinced that's the only reason DRY reduces bugs. Common functions get manual reviews every time the code that calls them also get reviewed and/or refactored, whether due to new features or bugfixes.
DRY increases exposure of more commonly used paths through your program.
>There have been loads of studies on various metrics, but none have actually been any better than simple lines of code
True, but that doesn't mean SLOC is a very useful metric. Say you were to rewrite a large Java codebase in a language that eliminates all getters and setters. You will have greatly reduced the number of SLOC, but it is very unlikely that you will have reduced the number of bugs very much.
In other words, going for the low hanging fruit of programming language design wouldn't necessarily help much. Bugs are not evenly spread out over the entire codebase.
You will have greatly reduced the number of SLOC, but it is very unlikely that you will have reduced the number of bugs very much.
I agree strongly with your point about potential confounders (and was about to make it myself) but now you are making your own assertion that I'm uncertain about. Why are you confident that switching to a language that greatly reduces the number of lines of code would not reduce the number of bugs?
While I don't know of any hard evidence, it passes my internal plausibility test that if some of the "2 screen" functions become "1 screen" function, bugs might be less likely. There might be some counter-force that would confound this, but I wouldn't eliminate it out of hand. So what makes you say "very unlikely" rather than "not necessarily true"?
I think the idea was "A language that is otherwise identical to Java but eliminates the need to write trivial getters and setters." This obviously reduces line count, but probably does not equivalently reduce bug count, as the removed lines are very unlikely to contain bugs.
This obviously reduces line count, but probably does not equivalently reduce bug count, as the removed lines are very unlikely to contain bugs.
While I agree that the bugs are not likely to be in the trivial code, I don't think it's a given that presence of the trivial code has no impact on the number of bugs elsewhere. Consider a "fatigue" based model, where the human brain is distracted by the monotony of the bug-free getters and setters and thus unable to pay sufficient attention to the logic bugs elsewhere in the program. And again, I'm not making that claim that eliminating boilerplate reduces bugs, only objecting to the assumption that it does not.
My intent was to clarify (what I perceived to be) the parent's argument, more than make one of my own.
I think if our process is "1) write the software in Java, 2) remove those lines", it's clear that we've probably changed the average bug density of the project. I agree that there is much reason for concern in generalizing that result to what would have happened if we'd written in that other language to begin with.
I simply haven't found many bugs in (mostly auto-generated) getters/setters during the past 25 years, but it's purely anecdata of course.
>While I don't know of any hard evidence, it passes my internal plausibility test that if some of the "2 screen" functions become "1 screen" function, bugs might be less likely.
Yes, mine too, but only for randomly chosen pieces of code. What I don't believe is that the linear correlation between SLOC and bugs that studies have found in large codebases allows us to pick and choose the lines of code that are easy to eliminate and expect the number of bugs to drop proportionately.
> The number of bugs are proportional to the lines of code, this is undeniable from empirical data. Ergo, fewer lines of code will tend to yield fewer bugs.
I do not think that this analysis can be applied to decisions about whether to duplicate code or not.
First, if your code is the same, you're not going to have two different bugs in the two copies of it.
Second, trying to change the number of lines of code in your project without making deeper changes is the sort of thing that very directly confounds the analysis. It's like going from "Smaller companies have happier employees" to "So we should fire half our employees because empirical data shows the rest will instantly become happier."
> First, if your code is the same, you're not going to have two different bugs in the two copies of it.
No, but you might see two different buggy behaviours due to contextual differences in how the code is used.
> Second, trying to change the number of lines of code in your project without making deeper changes is the sort of thing that very directly confounds the analysis.
Except refactoring into smaller reusable functions is precisely a deep structural change.
The comment I was replying to was about deduplicating code that is "literally the same". In that case, you wouldn't make the code more or less buggy if you collapsed the multiple copies of the code down into one, and no deep structural changes would be involved in refactoring.
If there are actual changes involved in refactoring, I have no a priori expectation of whether that reduces or increases bug count, and I can see good arguments that it's likely to increase them, since you're making the code more complex in order to satisfy the demands of multiple consumers and therefore exposing each consumer's unique complexity to the other as bug surface. (Case in point: Heartbleed resulted entirely from a little-used extension to DTLS, a variant of TLS over UDP, which 99+% of OpenSSL users never cared about.)
> The comment I was replying to was about deduplicating code that is "literally the same". In that case, you wouldn't make the code more or less buggy if you collapsed the multiple copies of the code down into one, and no deep structural changes would be involved in refactoring.
Firstly, any refactoring involves a code review of what you're factoring out. This has a non-zero probability of revealing bugs, so I already disagree with your claim that it wouldn't change the bug count.
Secondly, if you're having difficulty refactoring, that's a strong hint at deeper structural problems, so it yields information on what kinds of structural changes are needed.
> If there are actual changes involved in refactoring, I have no a priori expectation of whether that reduces or increases bug count, and I can see good arguments that it's likely to increase them, since you're making the code more complex in order to satisfy the demands of multiple consumers.
Are you making it more complex? Because that doesn't seem like a sound refactoring in my mind. Special cases require special code, you don't place special cases in a general function, unless the function itself only handles the special case. I already covered this in my original post where I discuss when DRY isn't appropriate.
> (Case in point: Heartbleed resulted entirely from a little-used extension to DTLS, a variant of TLS over UDP, which 99+% of OpenSSL users never cared about.)
I don't see how this is a point in your favour. It's a point that little used and little inspected paths are more likely to be vulnerable. But a reused function gets more use and more review than inline code. In other words, Heartbleed would probably still not have been found if it weren't part of a common function, and instead were littered in various places throughout a code base.
>The number of bugs are proportional to the lines of code, this is undeniable from empirical data. Ergo, fewer lines of code will tend to yield fewer bugs.
The first part is obvious (more code of course brings the potential for more bugs).
The deduction (hence more code = more bugs) is only valid if the studies controlled for the similarity of the code, cyclomatic complexity and other such factors.
Just because more code has impact in the number bugs, doesn't mean more code by itself means more bugs. Correlation != causation.
Program A having more code than program B could mean e.g.:
(1) A is inherently more complex (and really needs more lines, the same way an IDE needs more lines than an simple text editor), and thus will naturally be more prone to bugs.
(2) A and B has similar functionality as programs, but B is written by people who meander and write bloated code and needless abstractions (turning a 100 line program into a 1000 line hell of "design patterns" and factorySingletonProxy "flexibility"). Which will again bring in more bugs.
But that's not necessarily the case if A is bigger than B due to simple repetition that doesn't introduce complexity. Which is exactly what we're discussing in this thread.
For a trivial example, 10.000 lines of "print 'hello world'" repeated won't have more bugs than a 1000 line complex C program.
> Just because more code has impact in the number bugs, doesn't mean more code by itself means more bugs. Correlation != causation.
So the only possible causations for the correlation we're discussing are:
1. more program code causes more bugs
2. more bugs causes more program code
3. some unknown third factor(s) simultaneously causes both more bugs and more program code
I think 1 and 3 are most often the case, where 3 could be something like developer inexperience, although some studies have shown that even experienced developers still introduce bugs at comparable rates to novices (just lower constant factors). I think 2 sometimes happens to address immediate needs, ie. hotfix for specific bug X may introduce more bugs, but I doubt it's the rule.
Regardless, my original claim still seems pretty undeniable, ie. more program code tends to yield more bugs.
> For a trivial example, 10.000 lines of "print 'hello world'" repeated won't have more bugs than a 1000 line complex C program.
But 10,000 lines of "print 'hello wrld'" would have more bugs than a 1,000 line complex C program. Probably on the order of 9,000 more bugs in fact.
The numbers we're talking about here are averages across all programs of comparable length, not to be applied literally to any specific program, because it turns out that those specific program qualities don't really matter, ie. LOC is still a more accurate predictor of bug count than cyclomatic complexity and other metrics.
Thus I can say that a 1,000 line program probably has about X bugs, and I probably won't be off by an order of magnitude unless the program was verified by a theorem prover or something along those lines. Something like verification is really the only confounder that I've come across.
> The number of bugs are proportional to the lines of code, this is undeniable from empirical data. Ergo, fewer lines of code will tend to yield fewer bugs. So if your code is literally the same, there's no reason not to extract it into a function.
Code sharing often increase coupling and the coupling can be at odds with naive attempts at achieving DRY. The real world is not as simple as this makes it out to be.
You are right more code means more bugs, but more code might also be the difference between a viable product and one that hasn't been built yet because everybody is locked into DRY bureaucratic hell. As in everything, there's a balancing act that needs to be performed and that requires judgement and experience.
> The number of bugs are proportional to the lines of code, this is undeniable from empirical data.
Including unit tests? A code base with unit test is likely to contain less bugs and more lines of code. And what about code golf?. This just does not seem right and I believe the original saying is related to languages with/without rich stdlibs and/or the NIH syndrome.
>The number of bugs are proportional to the lines of code, this is undeniable from empirical data. Ergo, fewer lines of code will tend to yield fewer bugs. So if your code is literally the same, there's no reason not to extract it into a function.
Except you've added a dependency that connects two pieces of code that were independent before.
It depends. If the duplicated code is supposed to do the same thing in both places then the two pieces of code are already dependent. You change one of the duplicates, you need to change the other one two.
The number of bugs are proportional to the lines of code, but shared code can massively multiply the cost of making a change, because of the risks involved.
Sometimes duplication is worth it simply because it reduces the impact of changes.
But if I fix all X bugs in code block A, then a code block containing B+A still contains X bugs.
That's one of the big problems with code duplication. "You fixed a bug? Great. Did you fix it everywhere?" It's better if there's only one place to fix it, because when you've fixed it, you've fixed all of it.
But as others have said, eliminating this is not the only good thing in programming. There are limits to how far you should go to prevent or eliminate duplication.
> It's better if there's only one place to fix it, because when you've fixed it, you've fixed all of it.
You've got one aspect, and the other aspect is that code factored into a reused function F is now reviewed more too, ie. whenever you're reviewing the callers, you work through the functions they call as well to trace the behaviour. So you're also much more likely to find any bugs in F.
It's a double-whammy for bug squashing, which is why DRY is such an important principle. Certainly there are some misapplications, but its benefits hold up pretty well in a wide variety of scenarios.
If the bug being fixed was expected behavior by all the other dependencies, you just screwed over everyone else who should have duplicated what they needed in the first place.
Nothing wrong with importing a left-pad module if you're using a sane language with a sane development toolbox. I'll leave it to you to decide whether JS fits that bill.
The number of bugs are proportional to the lines of code, this is undeniable from empirical data. Ergo, fewer lines of code will tend to yield fewer bugs. So if your code is literally the same, there's no reason not to extract it into a function.
That said, the moment you have to start adding parameters in order to successfully factor out common code, ie. to select the correct code path taken depending on caller context, that's when you should seriously question whether the code should actually be shared between these two callers. More than likely in this case, only the common code path between two callers should be shared.