At the risk of sounding arrogant, isn't this a bit obvious?
To me, code smells are more about spotting when someone's been forced to do something a bit weird and convoluted. It's possibly the cleanest solution available given the immediate context, but this is actually a symptom (or "smell") that something is wrong with the wider design of the system.
This article seems more about detecting choices around data structures which are bad in any context and have no real link to the overall system design.
Decreasing code duplication is good of course. But you need to be careful that the two pieces of code are inherently equal and not just by accident. Because once you consolidate the code, for example like you suggest, and somebody comes along and changes `GOOD_NAME_FOR_MAGIC_CODE` to something else, without being aware that it is used in multiple places, he might inadvertently make a change he did not mean to do. Tests help, a good IDE helps (to find location where a symbol is used). But still, it's a fine line to walk.
The mess of duplicated code is much harder to untangle than de-consolidating code. Especially because subtle differences WILL leak into duplicated code and you won't know if it's on purpose or not.
If a dev can't be bothered to figure out where different pieces of code use a variable and whether the change is appropriate for them, do you think they will have the foresight to search for the duplicated code when they need to change it everywhere? Any decent dev should do both, and the former is a MUCH easier process than the latter.
Edit: despite the above, I don't completely disagree with your statement.
If they would need to search for the duplicated code and change it in more than one place, there's something shared to factor out. If the two pieces of code are actually only accidentally the same, changing them independently should be the right thing.
That is definitely good advice. I've seen some developers get overly obsessed with deduplicating code, and sometimes it really does just overcomplicate things. For example, it is easy to end up making things more brittle by moving a somewhat generic method with very specific pre-conditions and post-conditions into a generic utility library. It is even worse if those pre and post conditions were documented in the original location, but not necessarily copied over to the new shared module.
This whole potential problem is one of the reasons the particular example isn't that great. It's overhyped in Java-land where some people treat naked strings like lepers but in places like Clojure-land you'd get a lot of people saying "Just use a keyword" for both places. The keyword's name is its value. It's immutable. An IDE should be able to find all uses of it, or there's always 'ag'. If you really want that coupling behavior, change the interface of the thing you're calling to require conformance to a schema. The simplest way in Java is probably just by taking an enum instead of a string, but there are richer ways (especially outside of Java) worth looking into if you're needing a string value eventually in the implementation since you'll run into the other Java-ism of making a complicated enum class instead of just using enums like keywords with their string value being their actual name value.
It seems like "inexperienced" should fall under the umbrella of "bigger problems". It also (hopefully) falls under the umbrella of "temporary problems".
It's worse than that. We have met the Enemy and the Enemy is us.
Java, as introduced, claimed that nearly everything was an Object. What we got instead was nearly everything is a String.
The Real WTF in this code is that all of the important information is passed around as Strings. The iterator and its source hint at this but she fixes the wrong problem. Ever has this been the way with Java. Despite having a statically typed language nobody turns the data into information and instead your modules just vomit strings at each other. Sometimes in groups of four or more.
One reason for this is the tight coupling in developers' minds between what I'd call "types" and "representations" (these terms are very overloaded, so others may use them in different ways).
Just because, say, a function name and a string are represented in memory the same way, that doesn't mean they are the same type; in particular there are many strings which aren't function names, and there are many operations (e.g. append) which make sense for strings but not for function names.
I can think of two things to blame for this:
- Languages which make it complicated, verbose and/or inefficient to introduce new, incompatible names for existing representations (i.e. nominal typing). For example, having to declare a new class in a new file with a private field, a constructor method and a getter method; that's a lot of boilerplate, and those method calls will incur a runtime cost. Haskell's `newtype` is pretty good in comparison, e.g. saying `newtype FunctionName = FN String` lets me use `FunctionName` in type signatures, I'll get a type error if I try to use a `String` as a `FunctionName` or vice versa, and I can construct and destruct a `FunctionName` using `FN` (e.g. `let x = FN "foo"`)
- Developers only coding for the happy path. When writing a test suite, it's important to test that bad outcomes are prevented, as well as just testing known-safe inputs. The same should apply to types: types should be used such that meaningless expressions are ill-typed, as well as just allowing known-good expressions to be well-typed.
Unfortunately these sorts of considerations tend to get derailed by similar-but-unrelated issues, e.g. 'YAGN a `FunctionName` interface because there's only one implementation'.
The end result of all this is Web applications concatenating together a bunch of "strings" which are actually HTML, plaintext, SQL, user input and URL parameters :(
You can go further with the type/representation dichotomy.
An obvious example is printing: All of a sudden, you need to convert an object of that type to a representation that can be read. Read by what? Humans? Other code? Both to some extent? The representation is dependent upon both the original type and the intended recipient.
(In Lisp, "readable" means "acceptable to the read function, which parses Lisp expressions"; some objects, such as compiled functions, inherently cannot become "readable" in this sense, so they get printed in an unreadable form. Should that be a type error?)
You can even have layers of representation: Length is a type of value, whether it's expressed in inches or centimeters or light-seconds is a representation, and whether it's in ints or floats or strings is another layer to the representation. You can add inches to centimeters with the right conversion, much like you can add numerical values represented ints and strings with the right conversions. The conversions just have to be at the right layer of representation.
Your ideas sound a lot like the original Hungarian notation, BTW: If your language-level types are representations (as in, your type system says int and float, as opposed to semantic notions like pixels-from-edge or alpha-percentage) you can encode the real type information in variable names. People mutilated this to encoding language-level type information in variable names, which is utterly pointless and potentially harmful.
Regarding things like serialising/deserialising, I don't think there's much issue 'philosophically': in the case of Lisp's `read`/`write`, we can spot the ambiguity and deal with it by writing separate functions, for example:
- `serialise`/`deserialise` to produce/consume machine-readable data; must be mutually inverse (which rules out your example of functions).
- `pretty-print` for human consumption; has no inverse.
As a MVP, these could just be wrappers around `read` and `write` (macros would prevent any runtime overhead); the names convey the intended meaning. Later on we could start enforcing some checks, but since Lisp is dynamically typed, we'd have to do runtime tag-checking (e.g. "if type of 'X' is 'function', throw an error"). A static type system/checker would be better.
More generally, each application should be responsible for parsing its input into a domain-specific model, with machine-checked types. A pair of communicating applications may choose to delegate that responsibility to some common library, but they should not assume that their input can be trusted since it's 'coming from that other system'; e.g. they shouldn't pass around a `String` of input as if it were a domain object, and mangle it by pull out particular characters, concatenating things, etc. That `String` should be parsed into its components ASAP, and those should be passed around.
You're right that "apps hungarian" is another possible approach: slightly better than just documentation, but still not manchine-checked. I think the history of apps hungarian degenerating into systems hungarian is another example of this type/representation conflation.
It's not just Java! I see this all the time in C and C++ as well. Developers seem to be afraid to use types as they were intended. For example, look up any OpenGL question on StackOverflow and you'll see things like what should be an array of 3D vertices with separate x, y, and z components passed as a 1D array of floats.
In C++ we see quite a bit of abuse of std::pair<> instead of just making a fucking struct to hold 2 properly named, easy to read pieces of useful data. Nope! Apparently calling all your data members "first" and "second" is better to some people.
I don't know if it's an educational problem, a language issue, or something else, but I see it so frequently that there must be some common underlying cause (or set of causes).
It's an educational problem, Chandler Caruth of LLVM gave a really good talk on POD type slicing and how useful it can be. More to the point that for PODs it's supported in the standard. I do agree on the abuse of pair... pair is there for implementers primarily and shouldn't be used outside of deliberately templated code. Ditto tuple.
That said... with C++17 destructuring... it's a lot less painful since you can do the equivalent to a std::tie in a single line.
One problem in C++ is if you define a struct then you have to define a bunch of operators yourself (operator<, etc.) which gets annoying after you've realized you've needed them a bunch of times.
If you're just using objects to represent groupings of strings and encapsulate behavior around converting them back to strings for output at some point, why not just operate on the strings instead of all the serialization / deserialization? Maybe enough of the quick stuff I do isn't enough to warrant more complex data structures, but I get pretty far into a problem with just dicts (in Python) and generally get annoyed when something returns back objects that have no easy str representation without finagling the object into giving me what I need. I think that this is generally a bigger issue with libraries and treating object metadata (stuff like time stamps, extra attributes, etc) as of equal importance as the data it represents; if an object has one obvious attribute for what the thing is, make that the default thing you get when you just refer to the thing, instead of making everyone call some weird method to turn the instance into the representation that you likely want to get out of it.
This is one of the major reasons why I love Lisp so much. Serialization problems disappear in homoiconic languages. It eliminates an entire class of issues.
The first time I ran into this was on a project that was passing contact information around like String, String, String, int. Back and forth across the system in four or five workflows with extra data passed in here and there.
Of course they left off Address Line 2 and rather than try to fix this in thirty places as I was told I made an Address object instead.
With very rare exception, a street Address (Address 1) is meaningless without at least the zip code. And god help us if you had two addresses.
Later it took someone a couple hours to fix it for 9 digit zip codes. No further changes unless a new State joins the union (this was for a state court system and they required a domestic mailing address. One of the few times I didn't eventually need to process international addresses)
"We throw strings across system boundaries at will."
That's neither good nor bad, it's just inevitable. The point of not using "strings" is that strings generally do not have the semantics of whatever it is you are really dealing with, because strings are just raw sequences of bytes and are generally too permissive for the specific type. For instance, if you have a URI, that comes with a list of restrictions for validity such as "can't start with a colon". Programming-language strings aren't URLs because they permit starting with invalid scheme declarations, or being empty.
However, when crossing system boundaries, you are exposed in some sense to the raw truth of the hardware, which is "raw sequences of bytes" is the only thing that really exists. When someone sends you something claimed to be a "URI", you really can't blindly count on that, if you want robust code you are going to have to validate that. Even if you work very hard and use higher-level abstractions to try to map things to higher-level types, you still have the problem that the semantics have to be an exact match for this to work; in one program the URI type only permits http and https, in another it permits arbitrary (valid) schemes, and there you go, there's a difference sufficient that you must validate.
To match semantics between two programs sufficiently that you can claim with a straight face that you really, truly aren't throwing "strings" across system boundaries requires incredibly tight coupling between all participating programs.
S-expressions are irrelevant to my point. S-expressions are just a serialization format; the use of S-expressions does nothing to synchronize semantics between two programs. Labeling a string as a "uri" with an s-expression does nothing to solve the problem I mentioned about mismatched semantics between two programs. In fact, it does nothing to solve (uri "::::::////SUPERINVALID"). Behold the fundamental problem with semantics and why the semantic web has never worked out as its advocates hoped; labeling is the easy part of the problem.
The getLoadNames method returns a List of Strings, which we iterate over in order to see if a particular String value is in there.
Even in describing the algorithm we aren't informed it's a list of function names.
Edit: from the comments in the code it's pulling field names from a Mongo result set, so it's member variables more so than function names. One half of that relationship is still deterministic at runtime. Mongo records can change whenever but your ORM is dealing with Classes and their contents are known at runtime.
I'll bet you a six pack that hasName is called in a tight loop somewhere. The cost of generating the collection on every call is far greater than the cost of iterating it was.
Can we at least agree that the chosen example has too many other confounding factors in it?
Even C has types, and the UNIX API's make heavy use of them.
Awk and sed and grep and friends certainly encouraged strings, but I heard there was an even earlier thing called REPL. From what I've read it is not a coincidence that PERL is an anagram of it.
I guess - but I feel like you'd call it an "error" rather than a "smell". Better to just teach people data structures rather than which constructs they should be afraid of. Approaching it "sideways" like this just leads to people going "I heard that loops can be bad, so I copied and pasted that line 25 times instead".
Sets have a very specific use case: they only contain unique values and (unless using a specific implementation) they don't have any defined order. Switching from arrays to sets just to get rid of an iteration doesn't sound like fixing a smell to me. Also, insertion, iteration and memory usage are less efficient for a set.
Importantly Sets have different semantics than Lists. Creating a set is an information losing process. A list converted to set cannot be recovered, but a set converted to a list is easily recovered.
I don't think this is handled well in TFA. Because of some dubious iteration at the call site, the author changes the semantics of getLoadNames(), blithely assuming that duplicates should not be allowed and order is not important.
The author mentions that the function is called at two other places. For all we know the original author was aware of Sets but chose List because it more correctly matched the use of the structure.
It's obvious to anyone with a CS (or similar) degree, but may not be to someone who is learning to code from various tutorials. Having hints like this in your IDE can greatly improve the quality of your code and help you learn.
People with CS degree and decades of real world experience still benefit from automated code analyses. We deal with systems with millions of lines of code...
Exactly, this is pretty clearly aimed at beginner/intermediate programmers who haven't entirely figured out how to decide when to use List versus Set versus Map. It's probably obvious to the typical HN reader, but valuable for the target audience. Might not belong here, although the fact that it's currently at #3 on the front page suggests there's enough interest in it.
The "smell" here is that the API does O(n) allocations for a read-only check of a database schema?
Why add an API to have a method that checks for containment directly (and hopefully delete the conatiner returning one entirely?)
Also, it probably makes sense to cache the schema data somewhere.
Finally, my gut tells me the check for the existence of this name probably can be moved earlier in execution (like during initialization of the called class), which will cause the system to fail earlier and be easier to debug (so it would be good to check that before touching it).
Agree. What makes a pattern a code smell is that it is immediately recognizable and is almost never a false positive. Either the immediate implementation is problematic, or something about the larger context of the implementation is problematic.
A functional programming diehard could assert that iteration is a code smell, because recursion is the preferred solution. But there is nothing inherently wrong with using iteration to implement a solution in Java.
I'm not making the assertion :). But what I was getting at is that the larger-scope problem from this perspective would be the use of a non-functional language.
Many of the syntax highlighting that IDEA does is very obvious once you're alerted. The nice thing is it does hundreds if not thousands of analyses and presents them to you as actionable feedback.
The author proposes using data structures that use hashing without even mentioning the necessity of having a hash function and immutable data for the hashed objects, because in their case they were basically just using strings.
Something I see often and is a huge code smell to me is not using the most restrictive form of iteration.
If you see collection.map(...) you know that each iteration is simply a pure function from original element to transformed element, which is an immense help when reading the code.
If you can use only map / filter / takeWhile / join etc to express what you are doing, use those! If not, try and just use reduce / foreach. If not, try and just use for. Only use while if nothing else works!
> If you see collection.map(...) you know that each iteration is simply a pure function from original element to transformed element, which is an immense help when reading the code.
You'd think so, but I've had colleagues who managed to fuck that up and use map or list comprehension solely for side-effects.
And I thought it was abusing the map() call for side effects. However, it is still shorter than writing it out as follows:
def update(): Try[Unit] = {
parser.parse(...) match {
case Success(result) =>
updateState(result)
Success(())
case f@Failure(_) =>
f
}
}
So I didn't have a strong opinion either way since semantically both do the same (and in the case of Scala, the first one is potentially more performant since it relies on the JVM doing virtual dispatch as opposed to calling unapply() and matching, not to mention potentially less garbage being generated).
Yeah I mean my point is "trust but verify", I love restricted iteration construct, but just because they're being used does not mean they're being used "properly" unless the language ensures it.
What is the "most restrictive" form? The answer to this question is highly context dependent (for example, programming language / architectural framework) and expensive to give. Remember that everything has a cost. And especially overzealous formalism.
Personally I find simple C index-based for loops consistent and refreshing. And it's typically not a huge deal. If it is, the procedure might be doing too many things at once. (But yes, I use simple "for x in y" style loops in Python or C++ when they make the lion's share of the loops).
Many different types of loops in a single file, over a single datastructure, always remind me of odd syntax highlighting (for example in vim) in so many different colors that it's only a distraction. I don't care to make so many distinctions. I try to focus on the distinctions that we have to make to get a program done.
Map-style iteration carries more semantic information than a C-style index based loop. An index based loop might be doing anything with the data being iterated over. When you do collection.map, already at first glance you get an idea about what the code is doing.
I like this kind of article not necessarily because it's revolutionary or groundbreaking (because it's not), but because it serves as a helpful reminder sometimes, and can help stick an "observation bias" bug in my brain to notice more often the sorts of examples it calls out.
I agree with the sentiment: people repeat entirely too much code. There are very few cases where a for-loop is the right thing to write. Using generic methods which can be tested and shipped in isolation is basically always better.
But the article seems to imply there are performance concerns in some cases with its talk of using "a data structure". This irked me, because it's not like Set will beat linear time in balanced read-writes and without care the hash tables end up being non-constant as well.
Indexed loops are just a bad problem waiting to happen. They introduce arithmetic bugs, are the least efficient way to traverse most any data structure but a sequential memory segment, maximize the opportunity for error, and are often harder for compile-time optimizers to work through (since it's harder to prove what they do in situ).
Like, for(int i=0; i<10; i++) { // todo } OK well yes there can be bugs using the index and even C++ back when I was doing it had the Boost library which had for_each. In Java we got something similar back in Java 5 or 6. Also we have fancy stream operations now in Java 8 which took me the longest time to come to appreciate. The only down side is removed compatibility with other languages like C and C++ and C#, locking in the app to a specific language more.
There is on rare occasion, however, sometimes a problem that is solved more clearly or handily using indexing, similar to step indexing in BASIC.
One use case where I've seen stepped indexing useful is in the field of robotics, which uses step characteristics for synchros and servos.
> The only down side is removed compatibility with other languages like C and C++ and C#, locking in the app to a specific language more.
Using a Stream instead of a for loop does very little to make your code incompatible with C. It is already pretty much incompatible (modulo JNI).
What I mean is: if you want something that looks like C just use that, there is no gain in not learning new constructs just because older languages did not have them.
I really don't like the term Code Smells. Its just seems like it heralds finicky criticism that ignores wider context and real-world pressures. Outside of academia, you can't release a build without creating a few smells
Hashing is indexing and there is no faster and memory efficient index than an array (not vector) of sorted items. Iteration is the best if you can organize your data for it
Not sure what you're trying to get at. Indexing by surrogate index, yes. Indexing by value? That doesn't help. The best you can do is binary search -- O(log n). Hash-based structures are average amortized constant time -- O(1). This works because the key is a function of the value, rather than a surrogate key which is unrelated to the value.
Straight from Wikipedia:
"In many situations, hash tables turn out to be more efficient than search trees or any other table lookup structure. For this reason, they are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches, and sets."
To say that iteration should be "considered a code smell" is not a good thing to do for the programming community.
Present-day CPUs operate on structures by iterating. On a typical high-level programming language, they are at some point done by either (a) iteration or (b) recursion. If the compiler does not do automatic tail call optimization; then (b) recursion can't be made as optimal (in terms of speed and resources) than (a). The latest Java compiler does not perform tail call optimization and probably never will be able to do it. I mention Java because the article is targeted to the Java crowd.
Thus, iteration IS a very important tool for enhancing performance, and real-world (production) systems might have very important performance requirements.
Iteration, in a code, should just be considered... iteration. No kind of smells.
In the past, there was a very important computer scientist called Edsger Dijkstra wrote a paper called "Goto Statement Considered Harmful", GOTO being considered a "code smell": Dijkstra is a very important guy; an unit of measure, the nanoDijkstra, was named in his honor.
However, even afterwards, new programming languages did include a GOTO statement (or some form), because there are special cases when they are needed for higher performance or (believe it or not!) producing more readable code.
The autoplaying code editing animations are really annoying. By the time I've read the preceding text, it is somewhere in the middle and I don't know what it's doing.
Please, can we just agree that animations and videos (and audio) should only play when the reader initiates it? Also, a progress bar would be handy, although it may not always be necessary (for short animal videos and similar).
These examples seem to discuss data structure choice what I thought it might be about. I have encountered two major bugs at two separate companies all involving iteration and maps.
Both of the examples were written in Go and although language shouldn't matter, iterating over a map is non-deterministic. But each time the author intended for them to construct the returned data structure in a pre-defined order.
We had been suffering from poor caching performance and saw our Cassandra reads spiking as a result. Once we spotted the incorrect cache key construction, our reads to Cassandra dropped significantly allowing for better performance all around.
Perhaps others would have spotted this but it seems innocuous in code review. I know I will be more diligent in the future.
To me, code smells are more about spotting when someone's been forced to do something a bit weird and convoluted. It's possibly the cleanest solution available given the immediate context, but this is actually a symptom (or "smell") that something is wrong with the wider design of the system.
This article seems more about detecting choices around data structures which are bad in any context and have no real link to the overall system design.