The most copied StackOverflow snippet of all time is flawed (2019)

ddlatham · on June 17, 2021

I'm the author of #6 on the same list. It's definitely interesting to see it has been used thousands of times on GitHub, and who knows how many more in proprietary code. I don't think it's buggy, but I now think it could definitely be improved.

I think this shows an example of a big problem with StackOverflow compared to its initial vision. I remember listening to Jeff and Joel's podcast, and hearing the vision of applying the Wikipedia model to tech Q&A. The idea was that answers would continue to improve over time.

For the most part, they don't. I'm not quite sure if it's an issue of incentives or culture. Probably some of both. I think that having a person's name attached to their answer, along with a visible score really gives a sense of ownership. As a result, other people don't feel enabled to come along and tweak the answer to improve it.

Then, once an answer is listed at the top, it is given more opportunity for upvotes, so other improved answers don't seem to bubble up. This is a larger issue with most websites that sort by ratings. Generally they sort items based on the total number of votes, including hacker news itself. Instead, to measure the quality of an item, we should look at the number of votes, divided by the number of views. It may be tough to measure the number of views of an item, but we should be able to get a rough estimate based on the position on a page, for example.

If the top comment on a HN discussion is getting 100 views in a minute and 10 upvotes, but the 10th comment down gets 20 views and 5 upvotes, the 10th comment is likely a better quality comment. It should be sorted above the top ranked comment! There would still need to be some smoothing and promotion of new comments to get them enough views to measure their quality as well.

Such a policy on StackOverflow would also help newer, but better answers sort to the top.

jsmeaton · on June 17, 2021

An idea I've had for a long time is that "the community" can vote to override an accepted answer. There are many times when the accepted answer is incorrect, or a newer answer is now more correct, but the only person who can change an accepted answer is the OP.

I think community-based changes to the accepted answer would go a long way to solving your problem too, but it requires someone to be reviewing newer answers and identifying when there's another that would be more appropriate.

It'd incentivise writing newer answers to older questions. Correcting accepted answers that probably weren't ideal to begin with. A new "role" where users hunt through older questions and answers looking for improvements to make.

Stack Overflow answers are supposed to be community-based, but we unfairly prioritise the will of the original questioner *forever*. I don't think that's optimal.

irrational · on June 17, 2021

As a side gig I teach an intro to web development class online. Every semester I get students asking for help about why their code isn’t working. Nine times out of ten, they are trying to use some jQuery code they copied from stackoverflow because it is the accepted answer. They don’t yet know enough to recognize that it isn’t vanilla JavaScript (which they are required to use).

dotancohen · on June 17, 2021

The best way to address these students is to ask them "Why do you think that this should work?"

tacticalDonut · on June 19, 2021

What platform do you use to teach the course? I've been teaching an ML from scratch course for junior devs at my company and think it'd be useful for others.

dotancohen · on June 17, 2021

  > but the only person who can change an accepted answer is the OP.

This system makes the person arguably _least qualified_ to understand the situation the single arbitrator as to which answer is accepted.

Was it the most efficient? First to answer? Copied-and-pasted right in with no integration work? Written by someone with an Indian username? Got the most upvotes? Made a Simpsons reference? Written by someone with an Anime avatar?

NaturalPhallacy · on June 17, 2021

>This system makes the person arguably _least qualified_ to understand the situation the single arbitrator as to which answer is accepted.

Devil's advocate: If it fixed their problem adequately, it's acceptable.

Maybe separate "acceptable" and "ideal" answers would be a nice feature?

dotancohen · on June 17, 2021

In the vast majority of cases, OP did not check (or even define) edge cases, race conditions, memory usage, network activity, etc etc etc.

aweiland · on June 17, 2021

Seems like the answer is to add a community accepted answer that's easier to change over time, while keeping the accepted answer feature as is.

passivate · on June 17, 2021

>This system makes the person arguably _least qualified_ to understand the situation the single arbitrator as to which answer is accepted.

What is the argument for the OP being the least qualified?

Leherenn · on June 17, 2021

The argument is probably that while they are the best qualified to know whether it solved their issue, they're not qualified about whether it was the best way to solve their issue, since they had to ask in the first place.

dotancohen · on June 17, 2021

Of all the people involved, he was the one who _didn't_ know how to resolve a specific issue.

passivate · on June 17, 2021

Well, they do know the tech stack, the domain, the specific problem. Now they know whether the solution resolved their specific problem, if their code review/testing caught any bugs, etc, etc.

If anything, they have the most amount of information in this context. I really don't think of them as being the least qualified.

jsmeaton · on June 17, 2021

Yeah there’s an argument for that. I think it’d hold more weight if narrow, specific, and loosely defined duplicate questions were allowed - but they aren’t.

Questions and answers belong to the community. I think the accepted answer should too - maybe after some period of time.

dTal · on June 18, 2021

(or she :))

dotancohen · on June 18, 2021

Why you always on about women, Stan?

Breza · on June 23, 2021

Currently the only incentive to post a new answer to an old question is you get a special badge. That's neat but limited. I've gone through old R questions and posted answers with a more modern syntax and my answers rarely get much attention.

I'd be cautious about overriding an accepted answer. Imagine a situation where there's an easy-to-understand algorithm that's O(n^2) and the "Correct" algorithm that's O(n). If OP only has a dozen datapoints, the former might be the best answer for her specific problem, despite it clearly not being the right approach for most people finding the thread via Google in the future.

inglor · on June 17, 2021

They actually recently added this feature - you have a "this answer is outdated" button you can press. Note sure what the reputation threshold to see it is.

jsmeaton · on June 18, 2021

I've browsed a few tags and haven't been able to see that button and my reputation is 40k+ so I'd expect to have all features enabled.

Are you able to point me to a Meta/Blog post or even just a screenshot please? I'd be keen to see it.

Actually it looks like https://meta.stackoverflow.com/questions/405302/introducing-... is the announcement for wanting to tackle the problem. Not sure if they've implemented it yet though.

weinzierl · on June 17, 2021

"An idea I've had for a long time is that "the community" can vote to override an accepted answer."

I don't know if this is still a thing, but for some time in the past when an answer was edited more than a certain amount of times it automatically turned into what was called a "community wiki" answer.

zatkin · on June 17, 2021

Or you could just edit the accepted answer if it’s wrong? I’ve seen a few posts where the top contains an “UPDATE” that, in summary, links to another answer.

saganus · on June 17, 2021

One of the things that baffles me the most about SO is that I can't sort answers by _newest first_.

If I search for something related to javascript for example, I know there will be a ton of answers for older versions that I am most likely not interested in. However I can only sort by oldest first (related to date).

Old answers are definitely useful a lot of times, but the fact that there's not even the option to sort them the other way around tells me that SO somehow, at it's core, considers new answers less important.

A strange decision if you ask me, considering software changes so much over time.

If anyone has a possible explanation for this I'd love to hear it.

apnorton · on June 17, 2021

There are three buttons that act as sorting directions at the top of the answers section: "Votes," "Oldest," and "Active." The "Active" option sorts by most recently modified, which is _usually_ what you'd want instead of strictly newest. (i.e. an edit would update the timestamp, making that answer have a more recent activity date)

So, I guess the answer to your question of "why can't I" is "good news! you can" :)

saganus · on June 17, 2021

Well, none of those options do what I want.

More often than not, sorting by "Active", "Oldest" and "Votes" usually surface the same 2 or 3 answers, and I still need to scroll down to the bottom to find out the most recently posted answer that has more up to date info.

I don't see why I shouldn't have the choice to sort by "Reverse Oldest" if you will, when it's so useful a lot of the time.

ooOOoo · on June 17, 2021

This is why Stack Overflow has just started the "Outdated Answers project" in which users can set answers as outdated: https://meta.stackoverflow.com/questions/405302/introducing-...

acomjean · on June 17, 2021

I always thought the should have a language version. Eg python3, php7. JavaScript es6....

Cthulhu_ · on June 17, 2021

Tags work to categorize by language (https://stackoverflow.com/questions/tagged/python-3.x); by having multiple languages on one site, you'll have a broader audience because there's few developers that only work with one singular language.

bachmeier · on June 17, 2021

> If I search for something related to javascript for example

As someone that's been learning a little JS over the last year, I quickly came to the realization that you skip over the SO links that come up in the search, and you go to one of the many other sites. I've had good luck with w3schools and mdn. SO is a lost cause for JS.

saganus · on June 17, 2021

I agree.

However sometimes I am looking for some error related to a botched nodejs install for example, or something that has to do with permissions being set incorrectly and other stuff that does not live in MDN and other documentation sources.

For the actual language questions I do go directly to MDN instead.

hansvm · on June 17, 2021

> we should look at the number of votes, divided by the number of views

Closer, but still not quite what you want probably or a few stray votes can make a massive impact just from discretization effects. What you really care about is which answer is "best" by some metric, and you're trying to infer that as best as possible from the voting history. Average votes do a poor job. Check out this overview from the interblags [0].

[0] https://www.evanmiller.org/how-not-to-sort-by-average-rating...

Matumio · on June 17, 2021

This isn't just a statistical problem, it's also a classical exploration/exploitation trade-off. You want users to notice and vote on new answers (exploration), but users only want to see the best answers (exploitation). The order you show will influence future votes (and future answers).

In addition, it's a social engineering problem. At least people with a western psychology seem to respond very strongly when a score is attributed to their person (as opposed to a group success like in a wiki). So you better make the score personal and big and visible, and do not occasionally sort by random just to discover the true score.

ddlatham · on June 17, 2021

I think that's a great example of the "smoothing" that I was alluding to, though not in a format accessible to most programmers. However it is still just using a function of upvotes and downvotes. I think true rating can be much better when you also incorporate number of opportunities to vote. Because having the opportunity to vote (by viewing an item, or purchasing it, or whatnot) and choosing not to vote is still a really useful piece of data about the quality of an item. Especially when you are comparing old items that have had millions of opportunities against new items with only thousands.

hansvm · on June 17, 2021

> number of opportunities

Yep, definitely. The only challenges there are that there's less literature about doing so and that if you have both up and down votes there's no longer one right way to define a single objective for scoring.

slightwinder · on June 17, 2021

> I think this shows an example of a big problem with StackOverflow compared to its initial vision. I remember listening to Jeff and Joel's podcast, and hearing the vision of applying the Wikipedia model to tech Q&A. The idea was that answers would continue to improve over time.

Interessting. As a random visitor this was something that never came to me from the way SO presents itself.

> For the most part, they don't. I'm not quite sure if it's an issue of incentives or culture.

I think it's more a problem of communication and UI. SO is not really the kind of site that animates people to answer or improve things. The overall design is also more technical and strange, not motivating and userfriendly.

Today for the first time I realized that there is a history for answers and an "improve"-Button that seems to allow me to change someone else answer. I only saw that because I expliciet looked for this because of this thread.

Wikipedia in the beginning was very vocal and motivating to engage all kind people to help and improve articles. SO never had that vibes for me. Additionally, it simply has not the interface that makes it simple to do this stuff. There are only this aweful comments under each answer, which are not really useful to discus an answer in all lenght and from all sides. Might be better to change them to a full fletched forum with some kollaboration editing and some small wiki-functionality or something like that.

I remember they tried to do some kind of wiki with high quality-code-parts, what happend to that?

rcthompson · on June 17, 2021

One of the really frustrating things about SO is that once you reach a certain rep threshold, you lose the ability to suggest edits, and instead gain the ability to just make the edits directly. I'm a lot more likely to do the former, because it helps ensure that if I actually made a mistake, it will be caught by the people voting on it. And so SO has lost out on a bunch of my suggested edits because they took away my ability to suggest edits.

analyte123 · on June 17, 2021

What would really help with the vision here is some way to comment and associate tests against posted code. I have corrected algorithms on Wikipedia that were obviously wrong with even a cursory test. Then people can adjust the snippet, debate the test parameters, or whatever else they need to do while maintaining some sort of sanity check. If it’s good enough for random software projects used by a dozen people, it’s probably good enough for snippets used by thousands of developers and even more users.

travisjungroth · on June 17, 2021

This post made me think the same thing. It would be nice to have a StackOverflow that was actually more code focused. People could write tests or code and actually run them.

cerved · on June 17, 2021

I always try and improve existing answers with edits. Often just adding important context when the answer is just a line of bash and adding links to source documentation.

There's very little gamification incentive to do so and often the edit queue is full. Still, there are lots of times where important caveats and information is pointed out in the comments and never added to the answer

ant6n · on June 17, 2021

The other day I asked a question about the c/c++ plugin of vscode, somebody swooped in to edit it to just be c++ because “c/c++ is not a programming language”. The question wasn’t answered. I wonder what’s the incentive for people to do something like that.

shkkmo · on June 17, 2021

> As a result, other people don't feel enabled to come along and tweak the answer to improve it.

It's worse than that. Edits have to go through a review process that is much more selective and often arbirarily rejects good edits.

matsemann · on June 17, 2021

Only if you're a low rep user, though. And no, many more bad edits are accepted, than good edits being rejected. By orders of magnitude.

shkkmo · on June 17, 2021

> Only if you're a low rep user, though.

What qualifies as "low rep"? I'm easily in the too quintile.

> And no, many more bad edits are accepted, than good edits being rejected. By orders of magnitude.

Do you have any data to support this?

The editing and updating process for stackoverflow is broken and as a direct result I've used the site less and less over the years. Denying the problem just hastens the demise of the site.

bachmeier · on June 17, 2021

Editing answers is a complete waste of time. You can post a correction along with a copy and paste of the relevant section from the documentation, yet have your edit disappear without explanation.

lkrubner · on June 17, 2021

To correctly measure the quality of an item one needs to take something like Google's PageRank algorithm and apply it to people. That is, there needs to be some measure of the reputation of the person posting. This doesn't mean that a person who was correct in the past is necessarily correct right now, but it is true that people who are often correct tend to go on being correct, and people who are often wrong tend to go on being wrong. Careful people tend to continue to be careful, and sloppy people tend to continue to be sloppy. It's important to capture that reality and use it as a weight given to any particular answer.

L_226 · on June 17, 2021

Potentially a stupid question; why is it not possible to just make a MediaWiki site explicitly for SO questions? Does it exist already?

fragmede · on June 17, 2021

The technical cost/effort for someone like you or me to do that is minimal. The expensive part is the ongoing social maintenance fee aka moderation. As evident by the stack overflow drama re: Monica, it’s an unsolved (non-technical) problem that you could make your own mint to print money on, if you were able to fix any tiny part of it.

BrandoElFollito · on June 17, 2021

And then we would run again into people with an inflated ego, edit wars etc.

wizzwizz4 · on June 17, 2021

The Monica situation is probably a bad example; that was Stack Overflow (the company) royally and unilaterally messing up. It's certainly not a usual situation for resource-curating communities.

I've written, and deleted, several essays on the matter, but a TL;DR: Monica's legitimate questions to staff about a policy got caught up in a crackdown on sealioning-type harassment of trans (etc.) mods in the mod chat, and SO management basically declared war on Monica by mistake. We don't know whether they dealt with the actual harassment (though I think they did, belatedly), because if they did, proper procedure was followed and the perps weren't named-and-shamed in the press.

LoveMortuus · on June 18, 2021

Wouldn't a simple TTL - Time to live, solve that problem, of course with an option to see the graveyard.

This would mean that the same questions would get answered again and again over the years, but I think that could also solve the negative reputation problem of the website.

Two bird with one stone, or if you're Slovenian, two flies with one swat. ^^

ayewo · on June 17, 2021

For anyone else that is curious like I was, the #6 answer on that list is from 12 years ago: https://stackoverflow.com/a/140861/

macksd · on June 17, 2021

>> For the most part, they don't. I'm not quite sure if it's an issue of incentives or culture.

Classic example of "good is the enemy of best".

colejohnson66 · on June 16, 2021

What’s wrong with a simple loop (like the one near the top)? Why does it have to branchless? Wouldn’t the IO take longer than missed branches/pipeline flushes?

Not to mention that the fixed version now has branches as well…

rkagerer · on June 16, 2021

Not sure why some programmers these days have aversion to simple loops and other boring - but readable - code.

Instead we have overused lambdas and other tricks that started out clever but become a nightmare when wielded without prudence. In this article, the author even points out why not to use his code:

Note that this started out as a challenge to avoid loops and excessive branching. After ironing out all corner cases the code is even less readable than the original version. Personally I would not copy this snippet into production code.

hnedeotes · on June 17, 2021

I'm not against using for loops when what you need is an actual loop. The thing is most of the times, previously, for loops where actually doing something for which there are concepts that express exactly what was being done - though not in all languages.

For instance, map - I know that it will return a new collection of exactly the same number of items the iterable being iterated has. When used correctly it shouldn't produce any side-effects outside the mapping of each element.

In some languages now you have for x in y which in my opinion is quite ok as well, but still to change the collection it has to mutate it, and it's not immediate what it will do.

If I see a reduce I know it will iterate again a definite number of times, and that it will return something else than the original iterable (usually), reducing a given collection into something else.

On the other hand forEach should tell me that we're only interested in side-effects.

When these things are used with their semantic context in mind, it becomes slightly easier to grasp immediately what is the scope of what they're doing.

On the other hand, with a for (especially the common, old school one) loop you really never know.

I also don't understand what is complex about the functional counterparts - for (initialise_var, condition, post/pre action) can only be simpler in my mind due to familiarity as it can have a lot of small nuances that impact how the iteration goes - although to be honest, most of the times it isn't complex either - but does seem slightly more complex and with less contextual information about the intent behind the code.

codedokode · on June 17, 2021

For me, code with reduce is less readable than a loop. With loop everything is obvious, but with reduce you need to know what arguments in a callback mean (I don't remember), and then think how the data are transformed. It's an awful choice in my opinion. Good old loop is so much better.

foxes · on June 17, 2021

I disagree entirely. In most imperative programming languages, you can shove any sort of logic inside a loop, more loops, more branches, creating new objects, its all fair game.

Fold and map in functional languages are often much more restrictive in a sense. For example, with lists, you reduce down a collection [a]->a to a single object, or produce another collection with a map [a]->[a]. So map and fold etc are much more restrictive. That's what makes it clearer.

verinus · on June 17, 2021

if you are used to imperative programming, then yes.

But in a for loop anything can happen- from a map to a reduce to a mix, to whatever convoluted logic the dev comes up with.

ajuc · on June 17, 2021

Technically you can implement map as reduce ;)

But yes - for me

    (defn factorial [n]
      (reduce * (range 1 (inc n))))

is slightly more readable than

    def factorial(n):
        result = 1
        for i in range(2,n+1):
           result *= i
        return result

I mean in this case the name kinda makes it obvious anyway :)

If the operation is conceptually accumulating something over the whole collection and if it's idiomatic in the language I'm using - I will use reduce. Same with map-y and filter-y operations.

But if I have to do some mental gymnastics to make the operation fit reduce - for loop it is. Or generator expression in case of python.

distances · on June 17, 2021

Indeed. I rarely encounter basic loops in code reviews now, so seeing one is definitely a small alert to do an extra thorough review of that part.

watwut · on June 17, 2021

And it is usually very easy and straightforward to see what is going on inside.

hnedeotes · on June 17, 2021

It can definitively happen, but I think more times than not the others are more readable.

To be honest this seems to be a familiarity thing > but with reduce you need to know what arguments in a callback mean

If I didn't know for it would be mind boggling what those 3 things, separated by semicolons, are doing It doesn't look like anything in the usual language(s) they're implemented. It's the same with switch.

The only thing both of them have, for and switch, and helps, is that languages that offer it and aren't FP usually use the same *C* form across all, whereas reduce's args and the callback args vary a bit more between languages, and specially between mutable and immutable langs.

I still prefer most of the time the functional specific counterparts.

cerved · on June 17, 2021

Guido?

jasonkester · on June 17, 2021

When used correctly it shouldn't produce any side-effects outside the mapping of each element.

But that's just a social convention. There's nothing stopping you from doing other things during your map or reduce.

In practice, the only difference between Map, Reduce and a For loop is that the first two return things. So depending on whether you want to end up with an array containing one item for each pass through the loop, "something else", or nothing, you'll use Map, Reduce or forEach.

You can still increment your global counters, launch the missiles or cause any side effects you like. "using it correctly" and not doing that is just a convention that you happen to prefer.

hnedeotes · on June 17, 2021

That is true (less so in FP languages though), but the for loop doesn't either - indeed I do prefer it most of the times, I think its a reasonable expectation to provide the most intention revealing constructs when possible, it's also easier to spot "code smells" when using those. The exceptions I make is when there's significant speed concerns/gains, when what you're doing is an actual loop, when the readability by using a loop is improved.

(and I haven't read the article so not even sure I agree with the example there, this was more in general terms)

kamray23 · on June 17, 2021

Yeah, I'd much rather have something like

  congruence_classes m l = map (\x -> ((x ==) . (`mod` m)) l) [0..m-1]

than

  def congruence_classes(m, l):
      sets = []
      for i in range(m):
          sets += [[]]
      for v in l:
          sets[v % m] += [v]
      return sets

For-in is very neat and nice but it still takes two loops and mutation to get there. Simple things are sometimes better as one-line maps. Provability is higher on functional maps too.

Same one-liner in (slightly uglier) Python:

  def congruent_sets(m, l):
    return list(map(lambda x: list(filter(lambda v: v % m == x, l)), range(m)))

gjulianm · on June 17, 2021

The one liner is far less readable and under the hood it actually is worse: for each value in [0, m] you're iterating l and filtering it, so it's a O(n^2) code now instead of O(n). That mistake would be far easier to notice if you had written the exact same algorithm with loops: one would see a loop inside a loop and O(n^2) alarms should be ringing already.

Ironically, it's a great example of why readability is so much more important than conciseness and one liners.

verinus · on June 17, 2021

I agree and despite beeing a fan (kind of a convert from OO) of FP I am often wondering about readability of FP code.

One idea I have is, that often FP code is not modularized and violates the SOLID principle in doing several things in one line.

there are seldom named subfunctions where the name describe the purpose of the functions- take lamdas as an example: I have to parse the lamda code to learn what it does. Even simple filtering might be improved (kinda C#):

var e = l.Filter(e => e.StartsWith("Comment"));

vs.

var e = l.Filter(ElementIsAComment);

or even using an extension method:

var e = l.FindComments();

sorry I could not come up with a better example- I hope you get my point...

kamray23 · on June 17, 2021

True, it is computationally worse, though it's O(nm) so applying m at compile time to form a practical use as I used it will turn it into to O(n) in practice.

But that much is immediately obvious since it's mapping a filter, that is, has a loop within a loop.

I did consider the second one to also take quadratic time though. I forgot that in python getting list elements by index is O(1) instead of O(n) which is what I'm personally used to with lists.

It's also true that you can replace the filter with

  [ v | v <- l, v `mod` m == x ]

but that's not as much fun as

(x ==) . (`mod` m)

I just love how it looks and it doesn't personally seem any less clear to me, maybe a bit more verbose.

ummonk · on June 17, 2021

"I forgot that in python getting list elements by index is O(1) instead of O(n) which is what I'm personally used to with lists."

Have you considered that maybe this is a sign you're too deep into using impractical programming languages?

Cleanness for immutable data structures aside, linked list are a very poor way to store data given the way computer architectures are designed.

dragonwriter · on June 17, 2021

> Have you considered that maybe this is a sign you're too deep into using impractical programming languages?

“Languages that use ‘list’ for linked lists and have different names for other integer-indexable ordered collections” aren’t necessarily “impractical”.

gjulianm · on June 18, 2021

> True, it is computationally worse, though it's O(nm) so applying m at compile time to form a practical use as I used it will turn it into to O(n) in practice.

Even applying it at compile time, it's still O(nm). You have to compute 'v mod m' for each possible value of v and m.

> But that much is immediately obvious since it's mapping a filter, that is, has a loop within a loop.

It's not immediately obvious because you have to parse the calls and see exactly where is the filter and the map.

    map(lambda x: do_some_things(x, another_param), filter(lambda x: filter_things(x), lst))
    map(lambda x: do_some_things(x, filter(lambda y: filter_things(x, y), another_list)), range(m))

versus

    retval = []
    for x in lst:
       if not filter_things(x):
          continue
       
       retval.append(do_some_things(x))

and

   for x in lst:
      filtered = []

      for y in in another_list:
         if filter_things(x, y):
             filtered.append(y)

     retval.append(do_some_things(x, filtered))

In the first case, you have to parse the parenthesis and arguments to see where exactly are the map and filter cals. In the second, you see a for with a second level of indentation.

> I just love how it looks and it doesn't personally seem any less clear to me, maybe a bit more verbose.

It doesn't seem any less clear to you because you're used to it. But think about the things you need to know apart from what a loop, map, filters and lambdas are:

- What is (x ==). Is it a function that returns whether the argument is equal to x? - What is '.'. Function composition? Filter? - Same thing with `mod` m. What are the backticks for?

Compare that with the amount of things you need to know with the Python code with for loops. For that complexity to pay off you need some benefits, and in this case you're only getting disadvantages.

That's the whole point of this discussion. Production code needs to work, have enough performance for its purpose and be maintainable, those are the metrics that matter. Being smart, beautiful or concise are completely secondary, and focusing on them will make for worse code, and it's exactly what happened in this toy example.

ummonk · on June 17, 2021

Why not just use a list comprehension?

  def congruent_sets(m, l):
    return [[v for v in l if v % m == i] for i in range(m)]

988747 · on June 17, 2021

Isn't that unnecessary quadratic algorithm (nested loops, m*l iterations, instead of m + l)?

ummonk · on June 17, 2021

Yes, but that's the case with all the functional approaches proposed.

If Python were focused on functional programming it would have a utility function for this similar to itertools.groupby (but with indices in an array instead of keys in a dictionary).

dragonwriter · on June 17, 2021

> If Python were focused on functional programming it would have a utility function for this similar to itertools.groupby (but with indices in an array instead of keys in a dictionary).

itertools.groupby doesn’t return a dictionary, it returns an iterator of (key, (iterator that produces values)) tuples. It sounds, though, like you want something like:

  from itertools import groupby

  def categorize_into_list(source, _range, key):
    first = lambda x: x[0]
    sublist_dict = { 
      k: list(v[1] for v in vs) 
      for k, vs in groupby(sorted(((key(v), v) for v in source), key=first), first))
    }
    return [sublist_dict.get(i, []) for i in _range]

Then you could do this with something like:

  def congruent_sets(m, l):
    categorize_into_list(l, m, lambda v: v % m)

ummonk · on June 18, 2021

Yeah exactly.

zimpenfish · on June 17, 2021

> For instance, map - I know that it will return a new collection of exactly the same number of items the iterable being iterated has.

Unless you're using Perl - "Each element of LIST may produce zero, one, or more elements in the generated list".

hibbelig · on June 17, 2021

Perl implements flatMap and calls it map :-)

xelxebar · on June 16, 2021

I can't comment on the social phenomenon here, but there is indeed a decent technical argument for avoiding for loops when possible.

In a nutshell, it's kind of like "prinicple of least priviledge" applied to loops. Maps are weaker than Folds which are weaker than For loops, meaning that the stronger ones can implement the weaker ones but not vice-versa. So it makes sense to choose the weakest version.

More specifically, maps can be trivially parallelized; same for folds, but to a lesser degree, if the reducing operation is associative; and for-loops are hard.

In a way, the APL/J/K family takes this idea and explores it in fine detail. IMHO, for loops are "boring and readable" but only in isolation; when you look at the system as a whole lots of for loops make reasoning about the global behaviour of your code a lot harder for the simple reasone that for-loops are too "strong", giving them unweildy algebraic properties.

qayxc · on June 17, 2021

While these are all valid and well thought out arguments, in this particular example, a whole class of problems and bugs were introduced specifically by avoiding simple loops.

Not to mention the performance implications. Parallelisation, composability and system thinking are sometimes overkill and lead to overengineering.

caf · on June 17, 2021

The article says the looping versions also had the bug.

qayxc · on June 17, 2021

Correction: a bug. This is important to note, because only the non-loop version had precision issues.

gjulianm · on June 17, 2021

> So it makes sense to choose the weakest version.

Only if it's actually more readable. The principle of least privilege does not give you any benefit when talking about loop implementations.

> More specifically, maps can be trivially parallelized;

This argument is repeated time and time again but I've never actually seen it work. Maps that can be trivially parallelized aren't worthy to parallelize most of the time. In the rare case it's both trivial and worthy, it's because the map function (and therefore the loop body) are side-effect free, and for those rare cases you don't care too much about the slightly extra effort of extracting the loop body into a function.

> when you look at the system as a whole lots of for loops make reasoning about the global behaviour of your code a lot harder for the simple reason that for-loops are too "strong"

Code is too strong in general. Reasoning about the global behavior of code is difficult if the code itself is complex. Nested maps and reduces will be equally difficult to comprehend. The fact that a map() function tells you that you're converting elements of lists does not save you from understanding what is that conversion doing and why.

Sometimes loops will be better for readability, sometimes it will be map/reduce. Saying that for loops always make it harder to reason about the code does not make too much sense in my opinion.

hnedeotes · on June 17, 2021

I agree with the no silver bullet thing - and written on another reply I don't even know if I agree with the example in the article.

> The fact that a map() function tells you that you're converting elements of lists does not save you from understanding what is that conversion doing and why.

It can actually, say you have a query that comes in, this calls a function that fetches records from the database, it's not a basic query, it has joins, perhaps a subquery, etc. Then you have another function that transforms the results into whatever presentational format, decorates, wtv, those results, and it's also more than a few basic couple lines of logic.

And now you have a bug report come in, that not all expected results are being shown.

If you have

  func does_query -> loop transforms

You have 3 possibilities, the problem is on the storage layer, the problem is on the query, the problem is on the loop. You read the query, because the bug is subtle, it seems ok, so now you move to the loop. It's a bit complex but seems to be correct too. Now you start debugging what's happening.

If you have

  func does_query -> func maps_results

You know it's either underlying storage or the query. Since the probability of the storage being broken is less plausible, you know it must be the query. In the end it's a synch problem with something else, and everything is right, but now you only spent time on reproducing the query and being sure that it works as expected.

codedokode · on June 17, 2021

Loops are easier to read. With functions like reduce you have to solve a puzzle every time to understand what the code is doing (this is also true for functional style of programming in general).

> More specifically, maps can be trivially parallelized; same for folds, but to a lesser degree, if the reducing operation is associative; and for-loops are hard.

In a typical Javascript code reduce operation will not be parallelized. It actually can be slower than a loop because of overhead for creating and calling a function on every iteration.

> when you look at the system as a whole lots of for loops

A code with lot of loops is still more readable than a code with lots of nested reduces.

bryanrasmussen · on June 17, 2021

>Loops are easier to read. With functions like reduce you have to solve a puzzle every time to understand what the code is doing

I think that is a function of familiarity, if you use reduce a lot it will be as easy to read as a loop - perhaps easier because more compact - there is a downside to reading more lines for some people, at some point verbosity becomes its own source of illegibility (although any loop that can easily be turned into a reduce probably won't be excessively verbose anyway)

Of course all that is just on the personal level, you , by using and reading more code with reduce in it will stop finding reduce less easy to understand than loops - but the next programmer without lots of reduce experience will be in your same boat.

girvo · on June 17, 2021

I disagree quite strongly, in that this is simply a function of familiarity. Reduce is no more or less readable than for (especially the C style for — imagine trying to work out what the three not-really-arguments represent!)

cerved · on June 17, 2021

loops are harder to read. What does it do, map, reduce, send emails to grandma?

In JavaScript, the reduce callback is created once and called repeatedly. For loops are pretty much always the fastest possible way because they use mutable state. They are also a really good way of creating unreadable spaghetti that does things you don't want them to.

I'm not sure what you mean by nested reduces. Chained reduce functions are easy to follow

watwut · on June 17, 2021

You can send email to grandma from both map and reduce.

cerved · on June 18, 2021

The point was that in map and reduce it's clear what's being done and what's being returned, especially in a typed language. Ideally you're also in an environment that doesn't allow for side effects, in which case, grandma gets no emails from map or reduce

watwut · on June 19, 2021

It is not visible what is returned or what is input im case of chaining. Because return and input parameters are not directly visible and you have to read all previous calls to figure that out.

And they both trivially allow for side effects.

chousuke · on June 17, 2021

Very often processes are naturally modelled as a series of transformations. In those cases, writing manual loops is tedious, error-prone, harder to understand, less composable and potentially less efficient (depending on language and available tools) than using some combination of map, filter and reduce.

dragonwriter · on June 17, 2021

> Not sure why some programmers these days have aversion to simple loops and other boring - but readable - code.

Like goto, basic loops are powerful, simple constructs that tell you nothing at all about what the code is doing. For…in loops in many languages are a little better, but map, reduce, or comprehensions are much more expressive as to what the code is doing, but mostly address common cases of for loops.

While loops are weakly expressive (about equal to for…in), but except where they are used as a way (in language without C-style for loops) but there is less often a convenient replacement.

BrandoElFollito · on June 17, 2021

Disclamer: amateur developer for 25 years, no formal education in that area

a loop that iterates over indices when I want elements is not readable, e.g. I prefer

    for element in elements:

rather than

    for (i = 0 , i < len(elements), i++) { element = elements[i] ...

This is maybe where this aversion comes from, people usually [citation needed] want to iterate over elements, rather than indices.

hawski · on June 17, 2021

I find that many times in more complex loops you need the index as well. Sometimes for as mundane reason as logging.

BrandoElFollito · on June 17, 2021

Yes, my code is not that complicated and I use languages that are rather high level (Python, Golang, JS with Vue) so I needed the index I think once when I had to remove an element from an array in JS and for some reason I was not using lodash.

But yes, there are of course cases where the index could be needed, I was merely commenting on the aversion part for generic developers.

nvarsj · on June 17, 2021

Yes, this plagues JDK8+ code. Every fashionable Java coder has to use an overly complex, lazy stream vs a simple loop in every case.

MauranKilom · on June 16, 2021

The irony is that a single log computation is going to take longer than the loop. (No idea if implementing a log approximation involves loops either.)

slavik81 · on June 16, 2021

https://code.woboq.org/userspace/glibc/sysdeps/ieee754/dbl-6...

I don't see any loops, but there are a number of branches. The code could probably be generalized using loops to support arbitrary precision, but I think any optimized implementation for a specific precision will have unrolled them.

bottled_poe · on June 16, 2021

Sounds like textbook example of when theory is misaligned with reality.

kortex · on June 17, 2021

Waiting for someone to post some fast-inverse-sqrt-esque hack to compute the logarithm. Although in Java that's probably not likely to be faster.

I wonder how fast it'd be to convert to string and count digits.

Gibbon1 · on June 17, 2021

> I wonder how fast it'd be to convert to string and count digits.

When you convert the number to a string you're really transforming it to a decimal format. Which is the domain where you should be solving the problem. Otherwise you're doing some sort transformation in the binary domain and then hopping to pull the answer out of a hat when you do the final convertion to decimal.

tzs · on June 17, 2021

Many architectures include a logarithm instruction. Does Java use that if available? Would it make a difference?

usr1106 · on June 17, 2021

Many architectures? What would they be?

Regardless whether they contain a logarithm instruction or not, how may architectures are there these days. Outside of truly embedded computing I can only come up with 2: Intel and ARM. Counting POWER and RISCV is probably a bit of a stretch already.

tzs · on June 17, 2021

x86 has two logarithm instructions, FYL2X and FYL2XP1.

FYL2X takes two arguments, Y and X, and computes Y log2(X).

FYL2XP1 takes two arguments, Y and X, and computes Y log2(X+1).

As you note, x86 and ARM are by far the most used, and I'd guess that when it comes to Java you are more likely to be running on x86 than ARM, so I figured it was arguable to say "many" when the only one I was sure had a logarithm instruction was x86.

colejohnson66 · on June 17, 2021

Those x86 instructions are “legacy floating point” instructions. As in, the x87 FPU. Benchmarks I’ve seen seem to indicate that the x87 “coprocessor” is slow compared to the SSE/AVX FPUs, and only exists for backwards compatibility. I don’t think SSE/AVX has a logarithm instruction, sadly, but there are intrinsics for them: `_mm256_log_pd` for example. Considering that intrinsic generates a “sequence” instead of a single instruction, I’d be curious how it compares to x87.

nn3 · on June 17, 2021

Besides log()'s implementation is certainly not branch-less.

It's the ostrich approach: if you don't see the branches they don't matter.

ceronman · on June 17, 2021

Simplicity FTW. The simple loop version is very easy to understand. It's probably really fast, as it's just a loop over seven items. And more importantly it's more correct. It doesn't use floating point arithmetic, so you don't have to worry about precision issues.

The logarithmic approach is harder to reason about, prone to bugs (as proven by this post). I'm baffled at the fact that tons of people considered it a more elegant solution! It's completely the opposite!

xxpor · on June 16, 2021

the original version had branches too, in fact a majority of the lines had them! ? is just shorthand for if.

enedil · on June 16, 2021

This isn't true, this form of conditionals can be compiled into cmov type of instructions, which is faster than regular jump if condition.

hvdijk · on June 16, 2021

Both ?: and if-else have cases where they can be compiled into cmov type instructions and where they cannot. Given int max(int a, int b) { if (a > b) return a; else return b; }, a decent compiler for X86 will avoid conditional branches even though ?: wasn't used. Given int f(int x) { return x ? g() : h(); }, avoiding conditional branches is more costly than just using them, even though ?: was used.

dataflow · on June 16, 2021

> This isn't true, this form of conditionals can be compiled into cmov type of instructions, which is faster than regular jump if condition.

IIRC cmov is actually quite slow. It's just faster than an unpredictable branch. Most branches have predictability so you generally don't want a cmov.

Speaking of which, a couple questions regarding this for anyone who might know:

1. Can you disable cmov on x64 on any compiler? How?

2. Why is cmov so slow? Does it kill register renaming or something like that?

dgrunwald · on June 16, 2021

cmov itself isn't slow, it has a latency of 2 cycles on Intel; and only 1 cycle on AMD (same speed as an add). However, cmov has to wait until all three inputs (condition flag, old value of target register, value of source register) are available, even though one of those inputs ends up going unused.

A correctly predicted branch allows the subsequent computation (using of the result of the ?: operator) to start speculatively after waiting only for the relevant input value, without having to wait for the condition or the value on the unused branch. This could sometimes save hundreds of cycles if an unused input is slow due to a cache miss.

bigiain · on June 17, 2021

I wonder if there's anyone on earth who needs nicely formatted human readable file sizes that's worried about the difference between one or two cpu cycle branching instructions?

There might be a few guys at FAANG who have a planet-scale use case for human readable file sizes. But surely "performance optimising" this is _purely_ code golf geekiness?

(Which is a perfectly valid reason to do it, but I'm gonna choose the most obvious to the next progerammer reading it version over one that 50% or 500% or 5000% fast in almost any use case I can think I'm like to need this... I mean, it's only looking for 6 prefixes "KMGTPE" a six line case statement would work for most people?)

bigiain · on June 17, 2021

Actually, I just realised. This is (probably a small part of) why "calculate all sizes" in Mac finder windows is so slow. I already mentioned Apple in FAANG, but I guess someone at Microsoft and people who work on Linux file brokers care too. And whoever maintains the -h flag codepaths in all the Unix-like utils that support it?

dataflow · on June 17, 2021

Confused what this has to do with calculating file sizes. Time spent computing file sizes is dwarfed by I/O, right?

dataflow · on June 17, 2021

Ahh, thank you! Makes sense.

initplus · on June 17, 2021

CMOV is slow because x86 processors will not speculate past a CMOV instruction. They do speculate past conditional jumps, so those are more performant.

This same property makes CMOV useful in Spectre mitigation, see https://llvm.org/docs/SpeculativeLoadHardening.html

Keeping CMOV slow is now an important security feature.

dataflow · on June 17, 2021

They don't speculate past a CMOV at all? Like even if the next instruction has nothing to do with the CMOV's output?

saghm · on June 17, 2021

I think out of order processing is considered different than speculative execution, but I could be remembering my architecture class wrong

colejohnson66 · on June 17, 2021

Out-of-order just means it can rearrange the decoded uops in a way to keep the execution units at full capacity. So, if an instruction needs the ALU, but it’s busy, and the next one needs the AGU (address generation unit) and doesn’t depend on the results of the ALU one, it can “dispatch” the AGU one while the ALU one waits for the pipeline to move.

Speculative execution refers more towards the decoder/uop generation side of the processor (the “in-order” side). A normal “in-order” processor, upon encountering a conditional jump, would wait until the pipeline is finished to check if it should jump or not. It does it by inserting “bubbles” into the pipeline - essentially doing nothing but waiting.

Speculative execution (or branch prediction) would say, “I think the branch will be taken based on X, Y, Z,” and then keep the pipeline full in the process. If the prediction was right, congratulations! You just saved dozens of clock cycles that otherwise would’ve been wasted. If it was wrong, no worries. The pipeline is then flushed; all the speculated instructions’ results are tossed (before they’re “written back”). Then the processor resumes operation on the correct branch.

Speculative execution doesn’t necessitate an out-of-order architecture, and visa-versa. Just a pipelined one. It’s perfectly possible to have an out-of-order architecture that doesn’t speculate, or a speculative one that is completely “in-order”, but they work hand-in-hand, and it makes sense to have both if you have one.

dataflow · on June 17, 2021

Is it safe to say speculation is about what to do with instructions following conditionals, and OoO is about what to do with instructions following non-conditionals?

colejohnson66 · on June 17, 2021

Roughly, yes

rot13xor · on June 17, 2021

This email thread from Linus might be interesting: https://yarchive.net/comp/linux/cmov.html

colejohnson66 · on June 17, 2021

My understanding of out-of-order (and pipelined) CPUs is limited, but it’s interesting that CMOV isn’t interpreted as a “Jcc over MOV” by the decoder. That would allow using the branch predictor. Would it be too complex or does the microarchitecture not even allow it?

dataflow · on June 17, 2021

I think that thread is where I first learned this actually. Didn't remember it until you linked it now, thanks for posting it!

ncann · on June 16, 2021

If the if/else is simple the compiler should be able to optimize that anyway.

kruczek · on June 17, 2021

Exactly. As the article itself mentions:

> Granted it’s not very readable and log / pow probably makes it less efficient

So, the "improved" solution is both less readable and probably less efficient... where is the improvement then?

sixothree · on June 17, 2021

If it were me in my programming language, I would just use Humanizr and be freaking done with it.

xfer · on June 17, 2021

The real question is why is it a bug to report 1 mB instead of 999.9 kB for human readable output? It seems like a nice excursion to FP related pitfalls, but i don't think this is a problem to get entangled in that.

Groxx · on June 17, 2021

Because it doesn't print 999.9 kB or 1 mB.

It prints 1000.0 kB.

xfer · on June 17, 2021

I still wouldn't consider it a bug since we are throwing lots of lsbs anyways. It matters even less when we are talking about Peta/Exa bytes.

cellularmitosis · on June 17, 2021

I guarantee a design team would take one look at 1000.0kB and kick it back as a bug

kylejrp · on June 17, 2021

As part of the Stack Overflow April Fools' prank, we did some data analysis on copy behavior on the site [0]. The most copied answer during the collection period (~1 month) was "How to iterate over rows in a DataFrame in Pandas" [1], receiving 11k copies!

[0] https://stackoverflow.blog/2021/04/19/how-often-do-people-ac...

[1] https://stackoverflow.com/a/16476974/16476924

audiometry · on June 17, 2021

That’s sad, as when you find yourself iterating over rows in pandas you’re almost invariably doing some wrong or very very sub optimally.

dannyw · on June 17, 2021

To me it's an means to an end. I don't care if my solution takes 100ms instead of 1ms, it's the superior choice for me if it takes me 1 minute to do it instead of 10 minutes to learn something new.

BrandoElFollito · on June 17, 2021

True, but sometimes these 10 minutes help you to discover something new that will improve your code.

I had a few of these cases in my life:

- discovering optimized patterns in Perl, which led to code I could not understand the next day

- discovering decorators in Python, which led to better code

- discovering comprehensions in Python (a magical thing) that led to better code, except when I wanted to be too clever and ended up with Perl-like code

bruce343434 · on June 17, 2021

The flaw of human nature on display. And I don't mean that personal, not to you anyway, but to the human species.

coopsmoss · on June 17, 2021

Why? It seems very rational. Especially if you're just going to run it once to get a value and not as a part of some system.

kamray23 · on June 17, 2021

Exactly, that's not a flaw, that's rational behaviour. Why design an intricate solution for a one-off. Using 15 times the amount of time it would have taken you manually to automate a pretty standard task is just stupid, though we all do it.

Do the first thing that works, don't overthink it.

bruce343434 · on June 17, 2021

Why learn anything at all then? Why bother learning OOP paradigms if procedural just works? Why bother ...? Do you see the flaw in your argument?

passivate · on June 17, 2021

I think the flaw is you misinterpreting the argument.

In my opinion, the bottom line with obvious caveats is this - Human-time is more valuable than CPU-time.

If you are shipping at scale then the calculus is different - Don't waste end-users' human-time and their cpu-time and/or server's cpu-time.

If you're writing code with a team the calculus is different - Use/Learn techniques and tools to reduce the teams' human-time wastage plus all the above.

If you're writing code just for yourself the calculus is different - Save your own human-time.

wruza · on June 17, 2021

One doesn’t learn rock climbing to step over a brick, man.

dizzy3gg · on June 17, 2021

Because when it's not a one off?

laumars · on June 17, 2021

If that was a fair comment then we’d be writing all our code in assembly still.

tgb · on June 17, 2021

I iterate over rows in pandas fairly often for plotting purposes. Anytime I want to draw something more complicated than a single point for each row, I find it's simple and straight-forward to just iterrows() and call the appropriate matplotlib functions for each. It does mean some plots that are conceptually pretty simple end up taking ~5 seconds to draw, but I don't mind. Is there really a better alternative that isn't super complicated? Keep in mind that I frequently change my mind about what I'm plotting, so simple code is really good (it's usually easier to modify) even if it's a little slower.

NaturalPhallacy · on June 17, 2021

>That’s sad, as when you find yourself iterating over rows in pandas you’re almost invariably doing some wrong or very very sub optimally.

Humans writing code is suboptimal. I can't wait for the day when robots/AI do it for us. I just hope it leads to a utopia and not a dystopia.

arnaudsm · on June 17, 2021

I'm glad that DataFrames don't iterate by default. It's good design to make suboptimal features hard to access.

Pinus · on June 18, 2021

I got bitten by that prank when copying code from a question, to see what it did (it was something obviously harmless). I was rather annoyed for about two seconds before I realized what date it was. :)

anonymfus · on June 17, 2021

> return String.format("%.1f %sB", bytes / Math.pow(unit, exp), pre);

As a human, the first thing that I hate about this interpretation of "human readable" format is inconsistency in the number of significant digits. One digit after decimal separator is simply wrong, as when you jump from 999.9 MB to 1.0 GB you go from 4 significant digits to 2, instead it should be 1.000 GB, 10.00 GB and so on. This annoys me enormously when I upload things to Google Drive from Android phone and look at the number of data transferred as as soon it becomes bigger than 1 GB digits stop changing and I become anxious that it stopped the transfer and my Windows Phone nostalgia jumps over the roof (as WP was never infected with this problem by virtue of not using Java, and OneDrive on WP explicitly showed current connection speed, and frozen connection never caused any strange problems with uploaded files like it does on Google Drive on Android).

As a human not from US, the second thing I hate here is lack of locale parameter to pass to formatter as decimal separator is different in different cultures, and in the world of cloud computing the locale of the machine where the code is run is often different from the one where the message is displayed.

As a human from a culture using non latin alphabet, the third thing I hate here should be obvious for a reader.

cycomanic · on June 17, 2021

I don't think it makes sense to talk about significant digits here. And while you are correct that you should not go from 999.9MB to 1.0 GB you are incorrect about your reasoning and your correction is also incorrect. Significant digits signify reliability of the numbers. So if your measurement is accurate to the +/- 50kB as indicated by 999.9MB you should then move to 1.0000 GB (5 significant digits). So it should be 10.00GB and 1.00GB not 1.000GB, because the reliability should not change between your measurements.

Kiro · on June 17, 2021

> instead it should be 1.000 GB, 10.00 GB and so on

I had a hard time mentally parsing that sequence even when I knew what your point was so imagine regular users seeing that.

anonymfus · on June 17, 2021

As a bonus, the thing I don't care any more here is that there is no option to output binary SI prefixes.

AceJohnny2 · on June 16, 2021

> Key Takeaways:

> [...]

> Floating-point arithmetic is hard.

I have successfully avoided FP code for most of my career. At this point, I consider the domain sophisticated enough to be an independent skill on someone's resume.

user3939382 · on June 16, 2021

There are libraries that offer more appropriate ways of dealing with it, but last time I ran into a FP-related bug (something to do with parsing xlsx into MySQL) I fixed it quickly by converting everything to strings and doing some unholy procedure on them. It worked but it wasn’t my proudest moment as a programmer.

tomrod · on June 16, 2021

I wish to learn a better way. FP is sure to byte again and again.

marcosdumay · on June 17, 2021

Well, manually floating the point in a string is sure to bite again and again too, but way more frequently than in binary.

There is actually no better way, if you try to calculate over the reals (with computers or whatever you want), you are prone to be bitten. Once in a while there's an article about intervalar algebra on HN, those are a great opportunity to just nod positively and remember all of the flaws of intervalar algebra I got to learn on my school's physics labs. (And yeah, those flaws do fit some problems better than FP, but not all.)

rini17 · on June 17, 2021

Pity the rational numbers (fractions) did not catch on. Of course it has flaws too but bit easier to grasp. And it handles important cases like 1/3 or 1/10 exactly.

exporectomy · on June 16, 2021

As long as you're using it to represent what could be physical measurements of real-valued quantities, it's nearly impossible to go wrong. Problems happen when you want stupendous precision or human readability.

Numerically unstable algorithms are a problem too but again, intuitively so if you think of the numbers as physical measurements.

brandmeyer · on June 16, 2021

I am regularly reminded of William Kahan's (the godfather of IEEE-754 floating point) admonition: A floating-point calculation should usually carry twice as many bits in intermediate results as the input and output deserve. He makes this observation on the basis of having seen many real world numerical bugs which are corrupt in half of the carried digits.

These bugs are so subtle and so pervasive that its almost always cheaper to throw more hardware at the problem than it is to hire a numerical analyst. Chances are that you aren't clever enough to unit test your way out of them, either.

jrochkind1 · on June 17, 2021

Yep, floating point numbers are intended for scientific computation on measured values; however many gotchas they hsve when used as intended, there are even MORE if you start using them for numbers that are NOT that. money or any kind of "count" rather than measurement (like, say, a number of bytes).

The trouble is that people end up using them for any non-integer ("real") numbers. It turns out that in modern times scientific calculations with measured values are not necessarily the bulk of calculations in actually written software.

In the 21st century, i don't think there's any good reason for literals like `21.2` to represent IEEE floats instead of a non-integer data representation that works more how people expect for 'exact' numbers (ie, based on decimal instead of binary arithmetic; supporting more significant digits than an IEEE float; so-called "BigDecimal"), at the cost of some performance that you can usually afford.

And yet, in every language I know, even newer ones, a decimal literal represents a float! It's just asking for trouble. IEEE float should be the 'special case' requiring special syntax or instantiation, a literal like `98.3` should get you a BigDecimal!

IEEE floats are a really clever algorithm for a time when memory was much more constrained and scientific computing was a larger portion of the universe of software. But now they ought to be a specialty tool, not the go-to for representing non-integer numbers.

cycomanic · on June 17, 2021

I think you are significantly underestimate the prevalence of floating point calculations, there is a reason why Intel and AMD created all the special simd instructions. Multimedia is a big user for example. You also seriously underestimate the performance cost of using decimal types, we are talking orders of magnitude.

jrochkind1 · on June 17, 2021

Fair! Good point about multimedia/animation/etc.

There are still a lot of people doing a lot of work in which they hardly ever want a floating point number but end up using it because it's the "obvious" one that happens when you just write `4.2`, and the BigDecimal is cumbersome to use.

exporectomy · on June 17, 2021

I like that idea too. I wonder why Python doesn't use bigdecimals by default. Maybe because it seems to require you to choose a precision?

seoaeu · on June 17, 2021

Notably, this is only true of 64-bit floats. Sticking to 32-bit floats saves memory and sometimes are faster to compute with, but you can absolutely run into precision problems with them. When tracking time, you'll only have millisecond precision for under 5 hours. When representing spacial coordinates, positions on the Earth will only be precise to a handful of meters.

necheffa · on June 16, 2021

I do a lot of floating point math at work and constantly run into problems either from someone else's misunderstanding, my own misunderstanding, or we just moved to a new microarchitecture and CPU dispatch hits a little different manifesting itself as rounding error to write off (public safety industry).

exporectomy · on June 17, 2021

If you expect bit-for-bit reproducible results, then yea, you'd have to know about the nitty-gritty details. The values should usually still correspond to the same thing in common real world precision though.

RhysU · on June 17, 2021

> it's nearly impossible to go wrong

It's a matter of time if one doesn't know to look for numerically stable algorithms. Or if one thinks performance merits dropping stability.

https://github.com/RhysU/ar/issues/3 was an old saga in that vein.

tehjoker · on June 17, 2021

Unfortunately, that doesn't work when you have to do:

1 - quantity2 / (quantity1 - quantity2)

... or some such thing. If quantity1 and 2 are similar, ouch!

exporectomy · on June 17, 2021

Not sure if there's a mistake in that expression, since if they're similar, you're already going to get some ridiculously large magnitude (unphysical) result. Maybe you mean calculating the error between two values or convergence testing? In that case, it hardly matters if whether you do

quantity2/quantity1 - 1

or

(quantity2 - quantity1) / quantity1

with double precision and physically reasonable values.

opheliate · on June 16, 2021

So you have problems if you want a precise answer, you want to display your answer, or if you want to use any of a large number of useful algorithms? That sounds like it’s quite easy to go wrong.

exporectomy · on June 17, 2021

You can't want a precise answer from physical measurements unless you don't know how to measure things. Display should be done with libraries, and numerical instability makes algorithms basically useless, so you pretty much have to be inventing it yourself.

pvg · on June 16, 2021

I consider the domain sophisticated enough to be an independent skill

It's been a whole field with its own patron saint for a quite a while, take a look at

https://en.wikipedia.org/wiki/William_Kahan

RhysU · on June 17, 2021

Just this week I watched someone discover that computing summary statistics in 32-bit on a large dataset is a bad idea. The computer science curricula needs to incorporate more computational science. It's a shame to charge someone tens of thousands of USD and to not warn them that floating point has some obvious footcanons.

bigiain · on June 17, 2021

> Just this week I watched someone discover that computing summary statistics in 32-bit on a large dataset is a bad idea. The computer science curricula needs to incorporate more computational science.

Sadly, I suspect too many "computer science" courses have turned into "vocational coding" courses, and now those people are computing summary statistics on large datasets in Javascript...

bqmjjx0kac · on June 17, 2021

Could you shed some light on what they did wrong, and what would be a better way to do it?

cellularmitosis · on June 17, 2021

not OP, but the hint is in “computing summary statistics in 32-bit on a large dataset”.

A large dataset means lots of values, maybe we can assume the number of values is way bigger than any individual value. Perhaps think of McDonalds purchases nation-wide: billions of values but each value is probably less than $10.

The simplest summary statistic would be a grand total (sum). If you have a good mental model of floats, you immediately see the problem!

The mental model of floats which I use is 1) floats are not numbers, they are buckets, and 2) as you get further away from zero, the buckets get bigger.

So let’s say you are calculating the sum, and it is already at 1 billion, and the next purchase is $3.57. You take 1 billion, you add 3.57 to it, and you get... 1 billion. And this happens for all of the rest of the purchases as well.

Remember: 1 billion is not a number, it is a bucket, and it turns out that when you are that far away from zero, the size of the bucket is 64. So 3.57 is simply not big enough to reach the next bucket.

RhysU · on June 17, 2021

Well explained! All of the later contributions to the sum are effectively ignored or their contributions severely damaged in 32-bit because the "buckets" are big.

It was precisely this problem. The individual had done all data preparation/normalization in 32-bit because the model training used 32-bit on the GPU. It's a very reasonable mistake if one hasn't been exposed to floating point woes. I was pleased to see that the individual ultimately caught it when observing that 2 libraries disagreed about the mean.

Computing a 64-bit mean was enough. Compensated (i.e. Kahan) summation would have worked too.

bqmjjx0kac · on June 17, 2021

Thanks for the explanation!

bla3 · on June 16, 2021

> At the very least, the loop based code could be cleaned up significantly.

Seems like the loop based code wasn't so bad after all...

spkm · on June 16, 2021

This! If I had to choose between the two snippets I would have taken the loop based one without a second though, because of its simplicity. The second snippet is what usually happens when people try to write "clever" code.

dataflow · on June 16, 2021

The loop by itself isn't entirely clear on what it's doing. Stuff like the direction of the > comparison and what to do vs. >= and the byteCount / magnitudes[i] at the end really do require you to pause & do mental analysis to check correctness. I think the real solution here is to define an integer log (ilog()?) function based on division and use that in the same manner as the log(). That way you only do do the analysis the first time you write that function, and after that you just call the function knowing that it's correct.

jmelloy · on June 17, 2021

I was reading this and thought it sounded familiar. A few months ago I needed a human readable bytes format, ended up on that stack overflow article and, plot twist, copied the while loop one.

bigiain · on June 17, 2021

In his defences, he did admit at the start of the blog post that he was code golfing.

meetups323 · on June 16, 2021

Loop code has the same bug.

bla3 · on June 16, 2021

This is Java, not JavaScript. The exponents table was likely of integer type. Then it works.

alisonkisk · on June 16, 2021

How does that avoid the misrounding bug for 999,999B = 1000KB (wrong), 1MB correct?

bla3 · on June 17, 2021

D'oh, you're right.

twobitshifter · on June 16, 2021

Premature optimization strikes again.

jka · on June 16, 2021

There might be an opportunity somewhere around this area to combine the versioning, continuous improvement, and dependency management of package repositories with the Q&A format of StackOverflow.

Something like "cherry pick this answer, with attribution, and notifications when flaws and/or improvements are found".

Maybe that's a terrible idea (there's definitely risk involved, and the potential to spread and create bad software), but equally I don't know why it would be significantly worse than unattributed code snippets and trends towards single-function libraries.

fennecfoxen · on June 16, 2021

NodeJS did something a lot like this by having packages that are just short snippets, but half the ecosystem flipped out when someone messed up `leftpad`.

fastball · on June 16, 2021

Well that and because having 20,000 packages in your project is a PITA in various ways.

Mostly but not entirely because NPM handled things poorly in various ways.

DylanSp · on June 16, 2021

Not sure if it's quite what you had in mind, but SO is starting to address the issue of updating old answers with the Outdated Answers Project: https://meta.stackoverflow.com/questions/405302/introducing-...

jka · on June 17, 2021

Very relevant, thank you!

Smithalicious · on June 17, 2021

Sadly updates don't just remove bugs, but sometimes also add them. Silently adding a bug to previously working code is a lot more bad than silently fixing a bug you didn't know you had is good, so I wouldn't want to have a load of self-updating code snippets in my codebase.

jrockway · on June 16, 2021

> Sebastian then reached out to me to straighten it out, which I did: I had not yet started at Oracle when that commit was merged, and I did not contribute that patch. Jokes on Oracle. Shortly after, an issue was filed and the code was removed.

Good thing it wasn't a range check function. I hear those are expensive.

beermonster · on June 16, 2021

> I wrote almost a decade ago was found to be the most copied snippet on Stack Overflow. Ironically it happens to be buggy.

I don’t find it ironic, I find it quite normal that even small snippets of code contains bugs (given the daily review requests I receive).

I think when copying code literally from StackOverflow what’s more important is understanding what the code does, and why , rather than copying it ad-verbatim by copy & pasting it into your production code.

I also often find on StackExchange et al that quite often the most upvoted is the one that ‘fixes it’ for ‘most people’ yet the correct answer is down at number 3 or 4. Again, understanding the answer and why it applies, helps give you the context to understand if this is actually the solution to your problem or just treats the symptom.

mninm · on June 16, 2021

What I realized years ago is that the upvote on Stack Overflow don't mean "I tried this and it works for me" or "I'm an expert and this is the answer". No, the upvotes on Stack Overflow are along the line of the upvotes/likes one would find on Reddit or HN. More like "you sound confident" or "I was looking for this but I haven't tried it yet"

MaxBarraclough · on June 16, 2021

> No, the upvotes on Stack Overflow are along the line of the upvotes/likes one would find on Reddit or HN. More like "you sound confident"

I think you're right that online scoring systems tend to incentivise false confidence. This happens with blog posts too, where a student of some topic writes a confident and subtly incorrect blog post, and it then ends up on the HN front-page. Only someone with a relatively deep knowledge of the topic can then call out the errors. Ideally it should always be made clear upfront that the author is new to the material.

Somewhat related: Stack Overflow's unfortunate norm of calling out mistakes in answers in a way that goes beyond confidence and strays into condescension and borderline hostility. For a lot of people it seems it's not enough to be seen to be right, they also feel the need to paint someone else as clueless, while just about passing as acceptably polite by keeping the aggression passive. If challenged, they'll brush it off as 'directness'.

kenniskrag · on June 17, 2021

Also to create a democracy.

> Our sites are all intended to be a sort of representative democracy. Moderator elections are an important part of that plan, but voting on questions and answers is the primary mechanism through which the community governs the site on a day to day basis.

https://stackoverflow.com/help/why-vote