Why is this C++ code faster than my hand-written assembly (2016)

abainbridge · on Aug 22, 2017

A couple of weeks ago I'd never heard of Peter Cordes. Now the linked article is the third time I've seen his work. He's doing a fine job of fixing Stackoverflow's low-level optimization knowledge. Not so long ago all I seemed to find there was people saying something like, "well, you shouldn't optimize that anyway", or, "modern computers are very complex, don't even try to understand what's happening".

jdcarter · on Aug 22, 2017

> "modern computers are very complex, don't even try to understand what's happening"

One thing to keep in mind is that the CPU is no longer the CPU you think it is. While the x86 instruction set used to directly control the processor, today it's just a compatibility shim. Under the covers, the CPU is converting your assembly into its own microcode and it's that microcode which is doing the real work. For example, look at some of the block diagrams in section 2 of this Intel manual:

https://software.intel.com/sites/default/files/managed/a4/60...

Combine that with all kinds of other voodoo they've added--fancy caches, branch prediction, and so forth--and the "don't even try" maxim starts to make sense. I'd amend that to say "don't even try unless you're just curious."

Personally, I think assembly is a great learning tool, e.g. for understanding how pointers work, how the stack works vs. the heap, how function calls work, and so-forth. I was exposed to Motorola 68K assembly in college and it completely changed my world--lots of C++ things I considered voodoo suddenly made total sense. But I've never used assembly in production code and probably never will.

CalChris · on Aug 22, 2017

Another thing to keep in mind is that the compiler is still pretty much the same compiler you think it is. This is the case because it is very difficult to codify all of the information in the Intel Optimization Manual [1]. It's a harder thing to use that information and then a harder thing still to codify, use and use correctly all that information for multiple targets and architectures.

LLVM Tablegen is an approach for representation but it still requires the compiler writer to write everything down and then use it effectively and correctly. This is really hard and it doesn't get done. Instead someone ports a backend and then kinda sorta improves things until it seems better than it was.

Intel has a fine compiler which goes further. It sure isn't open source and it costs a few bucks. They're also contributing LLVM now.

The reason don't even try makes sense both for compiler writer and assembly writer is that this stuff is really hard. Also CPUs are damn fast and Intel (and ARM and ...) work really hard to solve problems dynamically (branch prediction, speculation, register renaming, caching, speculation, ...). In fact, you shouldn't even try this unless you know a lot. Agner Fog and Peter Cordes should be familiar names. But if you do know a lot, beating compilers is not difficult at all. Finding the right place to beat them? That's tough.

[1] https://www.intel.com/content/dam/www/public/us/en/documents...

benchaney · on Aug 22, 2017

Even though CPUs have got a lot more complicated, it is t that difficult to make intelligent decisions about optomizations at a high level of abstractions. There are situations where it makes sense to delve deeper as well. Not only because you are writing assembly, but also if you want to handhold the compiler. There are absolutely situations where it is not worthwhile to worry about optimizations. The problem is that the internet peanut gallery is rarely in a postition to give good advice about this.

mrec · on Aug 22, 2017

> I was exposed to Motorola 68K assembly in college and it completely changed my world--lots of C++ things I considered voodoo suddenly made total sense

Heh. I had this the other way around - picked up 68k asm for fun long before C/C++, and couldn't understand why people had such difficulty with the concepts in those.

I've never learned x86 - partly because I worry that I'd enjoy it too much and waste oodles of time "optimizing" things that don't need it. Probably useful to be able to at least read it, though.

spc476 · on Aug 22, 2017

The x86-64 bit isn't bad. You get 16 registers (half named, half numbered---historical baggage) and opcodes galore (more than even the VAX architecture). The x86-32 bit is a mixed bag (8 registers, the specialized opcodes may or may not be worth using depending upon which chip you use), but it certainly helps to know the history (the 8086 and 80286) to understand why it's the way it is.

A few weeks ago I tried my hand at optimizing some code using the vector instructions. While I was able to beat clang quite easily using assembly, I just could not beat GCC. That was sobering. There's probably some 1000 page tome I have to read to understand where I messed up in optimizing.

vram22 · on Aug 22, 2017

Is the x86 family the only processor family that had the segment:offset thing?

__s · on Aug 22, 2017

No: https://en.wikipedia.org/wiki/Memory_segmentation

Coincoin · on Aug 22, 2017

x86 and 68K are night and day. x86 is no fun at all; there is way too much legacy stuff.

dws · on Aug 22, 2017

The CPU is no longer the CPU that the CPU designers (often) think it is. A few years back someone who worked on CPUs at Intel related how his group pulling their hair out trying to characterize how a small section of code in a loop could have wildly different performance characteristics depending on the preceding instructions.

mark-r · on Aug 22, 2017

The Motorola 68K was nearly ideal for assembly programming. You're lucky to have learned on it.

jdcarter · on Aug 22, 2017

Oh very much so, 68K was great. Even better, at my first job we had a custom 68K-based embedded device and no source-level debugger, so you could only debug in assembly. That reinforced everything I had learned just a few years before. Our next device was MIPS-based, and MIPS was a great RISC system to learn on. I count myself very blessed.

westmeal · on Aug 24, 2017

For someone attempting to learn 6502/10 assembly hows the C64 for this?

harry8 · on Aug 23, 2017

"Don't even try to understand" _never_ makes sense. In this context it's the mantra of idiots who want to put dog-slow abstractions into C++ (among others). You have tried and you understand micro-ops exists, that branch prediction exists etc. This information is useful when optimising.

If you never need to optimise C or C++ code why are you using these languages? Seriously? You've almost certainly got the wrong tool for the job. (Ok sometimes someone has made a legacy decision and you're stuck with it for bad reasons. Meh).

If you don't try to understand it, you won't know things like "Never use a linked list if you care about performance at all." Even Bjarne has twigged to that one (at last!). C and C++ are languages you use when you need performance, understanding the CPU as best you are capable is crucial for performance. DO try. In fact it's insane not to work on your understanding.

Don't even try to justify a statement of "don't even try to understand". It's wrong thinking from top to bottom. Here, as everywhere. No really.

mmphosis · on Aug 22, 2017

As someone who has used assembly in production code, I am much more than merely curious. I would like to assemble / generate microcode directly. I would like to know exactly what the CPUs are doing, and not rely on voodoo.

mikebenfield · on Aug 22, 2017

I have the same instinct, but I think in a few years we'd be in the same position, where what we write doesn't directly correspond to the architecture of the CPU. The CPU people are always going to work to make existing code faster, and since it's impossible for many generations of CPUs to "be just like the last generation, only faster", that's inevitably going to involve some kind of translation layer between your code and the CPU's "real architecture."

vram22 · on Aug 22, 2017

Won't the translation layer reduce some of that extra speed? And if so, why is it done? so that people can keep programming to the same instruction set, even though the underlying (micro)code and hardware architecture has changed?

mikebenfield · on Aug 22, 2017

Here's my informal and possibly screwy understanding of why translating to microcode doesn't really slow things down: When you've got a bunch of tasks to do, breaking the task-execution process into a pipeline of steps may potentially increase the latency (how long does it take to do a single task), because, well, you've now got to go through several steps do do that task. But (assuming your pipeline is designed well), it will also increase the throughput (how many tasks can you do in a fixed amount of time).

In software, it's likely that latency is not going to be a problem as long as you can get more throughput (you probably don't care if the time between when the program hits your opcode and the CPU actually executes your instruction is 1 nanosecond later than it conceivably could be as long as the number of instructions executed per millisecond is greater).

In any case, yes, you need to expose the same machine code interface, not just so people can write new software using the same code, but so that old code continues to work. It would be a nightmare if every few years there were incompatible changes to the instruction set.

vram22 · on Aug 23, 2017

Yes, could be, interesting reason.

And your last paragraph makes sense for why the same interface.

pcordes · on Aug 23, 2017

You're basically asking why Intel doesn't expose their internal uops as an ISA that you can program in. See my answer on a SO question asking exactly that. https://stackoverflow.com/a/32866797/224132

The other answers on that question make some good points, too.

Most instructions decode to a single internal uop already, so x86 machine code is just a compact way of programming in uops.

Things like variable-count shifts are multi-uop because of x86's legacy CISC flag-handling, but BMI2 fixes that by giving you no-flags shifts that do decode to a single uop.

You're usually not missing out on much.

Also, there's plenty of voodoo besides just translation to uops. Programming in uops directly wouldn't help you figure out why there seems to be some kind of limit on register reads per clock (http://agner.org/optimize/blog/read.php?i=415#852) for example, since the uops you'd program with would still have register renaming done to them as they're issued into the out-of-order core.

laythea · on Aug 22, 2017

In that case, all I can advise, is to never get involved with GPU programming :)

mark-r · on Aug 22, 2017

If I ever see Knuth's quote "premature optimization is the root of all evil" in response to a question again, I think I'll puke. Not only is it hard for outsiders to know what's premature and what isn't, but sometimes it's nice to make a habit of doing things the faster way when you have two choices that are otherwise indistinguishable. For example I try to use ++x instead of x++, even though 99% of the time it makes no difference.

nulagrithom · on Aug 22, 2017

In my opinion, any optimization done before taking the naive approach is usually premature optimization.

That might sound a little extreme, but in the past 5 years I've run into exactly 1 problem that was solved by busting out the profiler and optimizing. In that same time, I can't count on all my digits the number of features that didn't ship, estimates that were overshot, deadlines that were slipped, etc etc. I've even been part of a team that ran out of runway while popping open jsPerf to choose between !! and Boolean(). Our app was fast as hell -- too bad no one will ever get to use it.

If you're expending cycles choosing between ++x and x++ and you're not ahead of schedule, please stop.

mark-r · on Aug 22, 2017

That was my point, I'm not expending cycles choosing between ++x and x++. I've just chosen a different default than most of the code I've seen, and you still need to realize when the default doesn't do what you want - but that's usually obvious.

Sorry to hear about your unsuccessful projects, that's a bummer. I hope that premature optimization wasn't a major part of the blame for any of them.

ben0x539 · on Aug 22, 2017

It's so irritating that golang has x++ but not ++x. I never remember until shit isn't compiling. Grr!

laythea · on Aug 22, 2017

This vastly differs depending on what software you write. I have done lots of necessary performance optimizations. Of course, there is a balance. To say never first optimize, is as bad as optimizing everything, in my opinion. It should require a case by case "study" of whether it is worth it. This kind of judgement comes with experience.

blockoperation · on Aug 22, 2017

Writing performant code by using the correct idioms, data structures, algorithms, etc, from the start is just common sense rather than 'premature optimisation'.

Writing unreadable, micro-optimised code in the name of performance without even attempting to profile it first is another matter.

My personal rule (as a not-very-good hobbyist) is that if I have to refactor everything in order to accommodate an optimisation in a path that's already 'fast enough', or introduce some arcane hackery that only makes sense with comments into otherwise clean code, then it must be backed up with realistic benchmarks (and show a significant improvement).

benchaney · on Aug 22, 2017

I was with you until you got to the ++x/x++ example. It is a bad example because they generate the same machine code except in situations where they are semantically different.

mark-r · on Aug 22, 2017

For simple types that's true, and that's part of the 99% I was talking about. Where it makes a difference is class types, C++ iterators being a prime example. Maybe the optimizer can make the two cases generate identical code, but why take the chance?

I chose that example because it didn't need a full paragraph to explain, but you're right that there are probably better ones. Edit: maybe a Python example? ''.join([a, b, c]) is faster than a + b + c, but again 99% of the time you won't be using it in performance critical code. But it's a useful idiom to know.

jacobolus · on Aug 23, 2017

From what I remember, for such small examples, a + b + c is significantly faster. (The CPython folks might have done some optimization work on this a few years back?)

It’s only if you plan to concatenate a whole lot of strings that join is a clear winner.

npsimons · on Aug 22, 2017

> Not only is it hard for outsiders to know what's premature and what isn't,

My rule of thumb is if you haven't profiled it, it's premature to try and optimize it.

> but sometimes it's nice to make a habit of doing things the faster way when you have two choices that are otherwise indistinguishable.

Now this is a good habit to be in.

mark-r · on Aug 22, 2017

> My rule of thumb is if you haven't profiled it, it's premature to try and optimize it.

The context was specifically about questions on StackOverflow. You don't know if the person asking the question has profiled it or not, and the assumption is often that they haven't. Probably true more often than not, but very condescending and unhelpful to the person who has.

jcelerier · on Aug 22, 2017

especially considering that people usually do not quote the following part of the quote : "Yet we should not pass up our opportunities in that critical 3%".

santaclaus · on Aug 22, 2017

> For example I try to use ++x instead of x++

++x can be slower, depending on the context -- it introduces a sequence point.

mark-r · on Aug 22, 2017

Really? If you have the time, I'd love an explanation. Or if you can point to some other resource that would be great too.

santaclaus · on Aug 22, 2017

I remember seeing a blog post from Andrei Alexandrescu that I can't seem to dig up, but this SO post seems to be a nice summary [1]. In short, in 99.9999999% of usages post increment is probably better.

[1] https://stackoverflow.com/a/38947878

mark-r · on Aug 22, 2017

Thanks. I guess I'm applying this rule only when the result of the increment isn't being used directly, so there is no dependency on the operation. When the result is used, the semantic difference between ++x and x++ obviously determines the choice.

flamedoge · on Aug 22, 2017

Just look at the liveness of the expression x+1. post-increment means expression has to be alive before whatever uses '++x' whereas pre-increment can delay until next usage of x.

ww520 · on Aug 22, 2017

Ah, hindsight is 20/20. It's easy to criticize premature optimization after the fact.

pjmlp · on Aug 22, 2017

I rather take "correctness and safety before performance", and only if it doesn't meet the performance criteria of an user story, then with the help of a profiler analyse what are the next steps.

99% of the time I don't need to worry about the profiler.

ksk · on Aug 22, 2017

Anything one line is prone to being misinterpreted. IMO, people who're still learning shouldn't be bombarded with a million different things, so one liners work for them. People who're experienced, realize the intent of that one liner and will easily know enough to know when it applies.

vardump · on Aug 22, 2017

> I try to use ++x instead of x++, even though 99% of the time it makes no difference.

I think 100% of time is closer to truth, unless you're using ancient or embedded 8/16-bit compilers.

Animats · on Aug 22, 2017

That hasn't really been a thing since the PDP-11, which had an auto-increment option on the register used for pointer offsets. That's why that feature is in C. It really mattered in 1978.

vram22 · on Aug 22, 2017

Interesting, so the C creators used a feature in their language, that was hardware-specific. I thought (from having read K&R) that one goal of C was to be hardware-independent. Maybe this was an exception, or maybe that auto-increment was common to many computer designs at that time.

mark-r · on Aug 22, 2017

Wikipedia states that its inclusion in B (the ancestor to C) was independent of the PDP-11 instruction set, which hadn't been invented yet [1].

[1] https://en.wikipedia.org/wiki/Increment_and_decrement_operat...

vram22 · on Aug 22, 2017

Cool.

Wow, B. A blast from the past, sort of. I had read a good book about BCPL (the ancestor to B) many years ago. IIRC, it was by Martin Richards, inventor of BCPL. Pretty interesting book and language. BCPL and B were both typeless languages, or languages with just one type, the machine word (16 or 32 bits, don't remember). Still I found that many algorithms and programs were expressed rather compactly in BCPL - or so it seemed to me at the time. Was quite junior then, and without exposure to more advanced programming languages - only knew BASIC and Pascal, probably; even C, I only learned a bit later.

https://en.wikipedia.org/wiki/B_(programming_language)

https://en.wikipedia.org/wiki/BCPL

https://en.wikipedia.org/wiki/Martin_Richards_(computer_scie...

Also just saw some other interesting stuff from the BCPL article above:

[ BCPL is the language in which the original hello world program was written.[3] The first MUD was also written in BCPL (MUD1).

Several operating systems were written partially or wholly in BCPL (for example, TRIPOS and the earliest versions of AmigaDOS).

BCPL was also the initial language used in the seminal Xerox PARC Alto project, the first modern personal computer; among other projects, the Bravo document preparation system was written in BCPL. ]

jcelerier · on Aug 22, 2017

naye.

Even for the following trivial function:

  #include <vector>
  #include <string>
  std::size_t foo(const std::vector<std::string>& vec)
  {
    std::size_t n = 0;
    for(auto it = vec.begin(), end = vec.end(); it != end; ++it)
    {
        n += vec.size();
    }
    return n;
  }

MSVC 2017 generates different assembly for ++it vs it++, at the highest optimization level.

moefh · on Aug 22, 2017

That's sad. Both gcc and clang at -O3 can optimize it++ to ++it (see [1]).

Hopefully MSVC will improve soon.

[1] https://godbolt.org/g/pxzVdP

vardump · on Aug 22, 2017

That's very surprising, perhaps a compiler bug? Does it perform any differently?

Just tried gcc 7.1, both post and pre-increment compile identically.

Arnavion · on Aug 23, 2017

Looks identical to me.

https://godbolt.org/g/mELzSE

https://godbolt.org/g/nGcdqT

apta · on Aug 23, 2017

For the it++ case:

    std::_Vector_const_iterator<std::_Vector_val<std::_Simple_types<std::basic_string<char,std::char_traits<char>,std::allocator<char> > > > >::operator++, COMDAT PROC
            mov      rax, QWORD PTR [rcx]
            mov      QWORD PTR [rdx], rax
            add      rax, 32              ; 00000020H
            mov      QWORD PTR [rcx], rax
            mov      rax, rdx
            ret      0

For the ++it case:

    std::_Vector_const_iterator<std::_Vector_val<std::_Simple_types<std::basic_string<char,std::char_traits<char>,std::allocator<char> > > > >::operator++, COMDAT PROC
            add      QWORD PTR [rcx], 32      ; 00000020H
            mov      rax, rcx
            ret

vardump · on Aug 23, 2017

Interestingly the code you quoted is not called at all. I guess eventually linking phase removes it as dead code.

That considered, both pre- and post-increment generate identical code, even with VS2017.

This matches my previous experience about pretty much any compiler in last 15 years or so -- there's no difference between ++i and i++, unless, of course, it's in a statement and changes the actual meaning of code.

"it++" case. Note that iterator function is not called.

  foo PROC
        mov      rdx, QWORD PTR [rcx]
        xor      eax, eax
        mov      rcx, QWORD PTR [rcx+8]
        cmp      rdx, rcx
        je       SHORT $LN70@foo
        mov      r8, rcx
        sub      r8, rdx
        sar      r8, 5
        npad     8
  $LL4@foo:
        add      rax, r8
        add      rdx, 32              ; 00000020H
        cmp      rdx, rcx
        jne      SHORT $LL4@foo
  $LN70@foo:
        ret      0
  foo ENDP

Here's the code generated for "++it" case. Iterator function is not called here either.

  foo PROC
        mov      rdx, QWORD PTR [rcx]
        xor      eax, eax
        mov      rcx, QWORD PTR [rcx+8]
        cmp      rdx, rcx
        je       SHORT $LN68@foo
        mov      r8, rcx
        sub      r8, rdx
        sar      r8, 5
        npad     8
  $LL4@foo:
        add      rax, r8
        add      rdx, 32              ; 00000020H
        cmp      rdx, rcx
        jne      SHORT $LL4@foo
  $LN68@foo:
        ret      0
  foo ENDP

apta · on Aug 23, 2017

Good catch! Thanks for the correction.

Arnavion · on Aug 23, 2017

Exactly.

gnaritas · on Aug 22, 2017

> but sometimes it's nice to make a habit of doing things the faster way when you have two choices that are otherwise indistinguishable.

That is virtually never the case.

> For example I try to use ++x instead of x++

Those have wildly different semantics.

colejohnson66 · on Aug 22, 2017

Yes, but if used as their own statement and not used as some bigger expression (parameter to function, etc.), any decent compiler will compile them the same way.

gnaritas · on Aug 22, 2017

Yes, and thus a perfect example premature optimization; semantics should define default usage, not some possible micro optimization. This is exactly the kind of case Knuth is talking about, going around doing ++x because you think it's faster when the standard idiom is x++ is premature optimization.

zerr · on Aug 22, 2017

x++ is not the "standard idiom".

mark-r · on Aug 22, 2017

In my experience x++ is far more common than ++x when it doesn't matter which you use, "for" statements for example.

gnaritas · on Aug 23, 2017

Yes, it is.

zerr · on Aug 24, 2017

I meant no more than ++x.

gnaritas · on Aug 24, 2017

I know what you meant, I don't agree. Scour any large codebase, x++ is far more idiomatic than ++x.

zerr · on Aug 25, 2017

What about top C++ books? Also when it comes to advancing iterators...

marcosdumay · on Aug 22, 2017

Well, there is no reason for any standard idiom there.

Those `++` and `--` operators are something to avoid anyway.

jcelerier · on Aug 22, 2017

Notice how the asm changes here if changing it++ to ++it: https://godbolt.org/g/W5w7Dc

Karliss · on Aug 22, 2017

Looks like only thing that changes is label name which doesn't matter.

jcelerier · on Aug 23, 2017

nope, there are less instructions

vardump · on Aug 23, 2017

Except that those instructions are not actually called in either case from function foo(). So they're dead code and linker will remove them.

_54qb · on Aug 22, 2017

> Not so long ago all I seemed to find there was people saying something like, "well, you shouldn't optimize that anyway

I find this "Why would you even do that?" attitude in StackOverflow to be very irritating. And it's not just in low-level optimization questions. StackOverflow should be a place for questions and answers, not a collection of lessons on corporate best practices where curiosity and experimentation is discouraged.

jimmaswell · on Aug 22, 2017

I've had the same annoyance in IRC channels - getting an unhelpful lecture on the right way to do things from people who don't fully understand the use case or that "best practices" might not always be worth it or matter, especially in a small script/application

0xffff2 · on Aug 22, 2017

As someone who used to hang out in a few language channels on freenode, I completely understand why the lectures exist. You may have a good reason for asking the question you're asking, but the overwhelming majority of the time when a nick I don't recognize asks an esoteric question about low level optimization, it's because they are new to the language (and often programming in general), and have gone far down a rabbit hole they don't need to go down before asking for help.

When I tried to answer these kinds of questions, I always tried to start by asking questions to make sure I did fully understand the use case and the application. In my experience, a large percentage of people, regardless of how valid their question is, refused to give enough detail to allow me to actually understand their specific use case.

After repeating that cycle several times a week for a few years, it becomes very temping to just shut down the weird esoteric crap with a canned response that might get newbies pointed in the right direction. A small handful of people will stick around to explain why they really do understand their problem and need an answer to the question they are asking. A small handful of newbies will take the advice of the canned response and learn from it. The rest were largely not going to lead to an interesting conversation no matter what.

ryandrake · on Aug 22, 2017

I agree. Programming forums are full of this crap. I hate asking a pretty straightforward question online and getting that patronizing "But what are you really trying to do?" response.

What I'm really trying to do is get an answer to this specific question. I'm not asking you to re-think my problem statement. Thanks for not helping.

eXpl0it3r · on Aug 22, 2017

At the same time we're getting a lot of people with XY-problems. They think they know what the problem is and they think they know what they about have to do, while the real problem is somewhere else. If someone's intend with an asked question isn't clear, then "But what are you really trying to do?" is a valid question.

It's impossible for anyone to just know what you're intentions are and whether you understand the deeper semantics of a problem. More often than not, people try to do things for the wrong reasons.

colomon · on Aug 22, 2017

I dunno, I've been on both sides of this. I agree it's really annoying when someone waltzes around without answering my question. But I've also seen people asking questions which sounded goofy to me, and when you finally get them to tell you want they are actually trying to do, there's a much easier and more direct approach.

mjw1007 · on Aug 22, 2017

I think it's great for people on the "answering" side to indicate that they're willing to go beyond the literal question and offer more general support.

But as soon the asker has turned down an offer of that sort once, I really really wish people who aren't going to answer the question being asked would stay out of the conversation.

hxtk · on Aug 24, 2017

Just say it's for Project Euler. That way they just accept that any optimizations you're asking for actually need to happen.

fhood · on Aug 22, 2017

> "modern computers are very complex, don't even try to understand what's happening"

I live my life by that adage

kazinator · on Aug 22, 2017

TL; DR: > If you think a 64-bit DIV instruction is a good way to divide by two, then no wonder the compiler's asm output beat your hand-written code.

Once (maybe 25 years ago?) I came across a book on assembly language programming for the Macintosh.

The authors wrote a circle-filling graphic routine which internally calculated the integer square root in assembly language, drawing the circle using the y = sqrt(r * r - x * x) formula!

What is more, the accompanying description of the function in book featured sentences that were boasting about how it draws a big circle in a small amount of time (like a "only" quarter of a second or some eternity of that order) because of the blazing speed of assembly language!

How could the authors not have used, say, MacPaint, and not be aware that circles and ellipses can be drawn instantaneously on the same hardware: fast enough for drag-and-drop interactive resizing?

jacquesm · on Aug 22, 2017

Bresenham's line algorithms and the adaptation of the general principle to circles and arcs are absolute gems. I've used those over and over in the first two decades of my career and I never ceased to be impressed with the elegance and speed.

Surprising that people writing a book 25 years ago would not have been aware of this work.

https://en.wikipedia.org/wiki/Bresenham%27s_line_algorithm

https://en.wikipedia.org/wiki/Midpoint_circle_algorithm

salgernon · on Aug 22, 2017

Back in the 90s there was a series of books "Graphics Gems" that was a great resource in learning these tricks. Most people are further removed from drawing directly into a frame buffer today,

https://www.amazon.com/Graphics-Gems-Andrew-S-Glassner/dp/01...

mark-r · on Aug 22, 2017

Yes, it's absolutely brilliant. The line drawing algorithm is also applicable to other problems, whenever you need to interpolate between two integer values.

jacquesm · on Aug 22, 2017

It's so useful I embedded it in a library module called 'dda' (for digital differential analyzer), and then used that module in all kinds of applications. It's one of those Swiss army knife subroutines, you end up using it in places not even remotely related to rendering bitmaps.

https://en.wikipedia.org/wiki/Digital_differential_analyzer_...

I have two-D and three-D versions of it.

kazinator · on Aug 22, 2017

Bresenham is applicable to other conic sections and functions.

I proved this back as an undergrad: i used Bresenham to plot the y = K/x hyperbolic curve.

I had this idea that since 1/x can be interpolated with Bresenham without doing division, somehow that could be applicable to the perspective transformation when walking over texture maps in 3D rendering.

jacquesm · on Aug 23, 2017

Hehe. That's so cool, this is something I did without knowing any of the formal math behind it when writing a small 3D game engine (after seeing Doom). It seemed to be the shortest path to a solution and it worked very well.

Then, after getting it to work I replaced the interpolator with a bunch of assembly starting from the intermediary representation the compiler output.

I unfortunately didn't date that source file but I do remember I was living in Amstelveen when I wrote it so this was about 23 years ago, summer of '94.

We made the textures with one of the first affordable and commercially available digital cameras:

http://www.digicamhistory.com/1991.html

cmiller1 · on Aug 22, 2017

Relevant: https://www.folklore.org/StoryView.py?story=Round_Rects_Are_...

bhaak · on Aug 22, 2017

How are rounded corners on rects harder than circles and ovals?

Is a rounded rect not just a circle sliced with straight lines between the quarter circles?

shdon · on Aug 22, 2017

I get the feeling that what you mention is the precise realisation that Atkinson had when he figured out a way to draw them quickly. It's quite possible that he was initially thinking of ways to describe rounded rects using a single equation. Even one as undeniably brilliant as Atkinson can get stuck in thinking and overcomplicate things. A good night's sleep can bring a fresh perspective and make simplicity much more apparent.

phire · on Aug 22, 2017

I think he came to this realization while looking at all the examples jobs was showing him, which is why he suddenly agreed to give it a try.

kaslai · on Aug 22, 2017

I needed to create a circle drawing algorithm for some purposes a long ways back. I was in middle school at the time.

While speed wasn't a huge issue, consistency was. Simply using sin/cos to draw the circle works, but it can lead to jagged and inconsistently placed pixels if your degree step isn't perfect. Even with my middle school level math, I realized I could just iterate over the bounding box for the circle and just use r^2 as a threshold on x^2+y^2 to determine whether a pixel should be on or off - no square root required, since it's totally redundant.

It has the bonuses of using only integer math (consistency!) and only requires slight changes to yield filled or unfilled circles. It's also almost as fast as drawing a simple filled rectangle!

The idea that the author could miss such a simple algorithm baffles me.

kazinator · on Aug 22, 2017

> iterate over the bounding box for the circle and just use r^2 as a threshold on x^2+y^2 to determine whether a pixel should be on or off -

Jeff Tupper's GrafEq software does this kind of plotting.

Over the development of this, Jeff Tupper came up with a quine concept: a formula whose f(x,y) thresholded plot reproduces an image that can be interpreted as its math notation.

See here:

https://en.wikipedia.org/wiki/Tupper%27s_self-referential_fo...

The Devil is in that 543 digit integer. :)

(It's been submitted to HN a few times already.)

payne92 · on Aug 22, 2017

tl;dr -- the asm author used DIV to divide by a constant 2

More fundamentally: it's theoretically possible to at least match compiled code performance with assembly, because you could just write the code the compiler generates.

BUT, it requires a LOT of experience.

Modern compilers "know" a lot of optimizations (e.g. integer mult by fixed constant --> shifts, adds, and subtracts). Avoiding pipeline stalls requires a lot of tedious register bookkeeping, and modern processors have very complicated execution models.

It's almost always better to start with a compiler-generated critical section and see if there are possible hand optimizations.

bradford · on Aug 22, 2017

In University computer architecture courses, we were challenged to write a quicksort routine in assembly. We were also asked to compare the assembly that we authored with assembly that was compiled from C++ (after we authored our own solutions, of course).

It was an amazing crash-course on just how good compilers have become at optimizing. Not a single student could hand craft assembly that was faster than the compiler output. The teacher of the course was able to generate assembly that was slightly faster, and he stated that in order to do so, he had to greatly exploit his in-depth knowledge of the processor's pipeline system. That was roughly year 2000, and I'm sure compilers have only become better at their job since then.

All in all, excellent learning experience. I've since encountered several instances where developers assert superior assembly skills, and by default I'm silently skeptical of their claims.

krylon · on Aug 22, 2017

Gathering and exploiting in-depth knowledge of a CPU's internals has become more difficult over time, too, I think.

At least for x86/amd64 - with out-of-order-exection, branch prediction and whatnot one not only has to know the architecture, but one has to know the specific implementation the code will run on. And knowledge on the deep internals of CPUs made by Intel or AMD (or Via? are they still around?) is not easy to come by.

carussell · on Aug 22, 2017

See https://blog.cr.yp.to/20150314-optimizing.html

krylon · on Aug 22, 2017

In a prior job I worked as a C programmer, and once I tried to rewrite a few heavily used utility function in assembly.

I measured, but it did not run any faster. Damnit, this is assembly, I said, it has to be faster. So I looked at the assembly code generated by the compiler: Turned out it was pretty much identical to the code I had written. At that point, I felt brief surge of pride (because I was as clever as the compiler) and then disappointment (because I was not more clever than the compiler), and I figured trying to be smarter than the compiler was a waste of my time.

ktRolster · on Aug 22, 2017

It's almost always better to start with a compiler-generated critical section and see if there are possible hand optimizations.

Yeap. But once you do, it's almost always easy to get some efficiency gains, because you understand what the code intends, and the compiler does not.

Surprisingly it doesn't take much experience to do this: just a profiler, and a willingness to try different things until you find something that works.

bachmeier · on Aug 22, 2017

> you understand what the code intends, and the compiler does not

It seems to me that the best approach to this would be to feed more information to the compiler rather than writing assembly yourself. Otherwise you give up portability.

ktRolster · on Aug 22, 2017

You are right. A lot of times you can rearrange the code in various ways, and the compiler will happily generate more efficient code.

When that's not possible, the way to keep it portable is by writing a generic C function, then writing an optimized version of the same function that will be compiled when that architecture is available.

At work it's rare to need to compile the same code for various architectures, but sometimes it happens.

mikeash · on Aug 22, 2017

That's generally better when it's possible, but it's not always possible.

If you do start writing assembly, you usually want to have it exist next to a higher-level version of the code that you can toggle on and off. This lets you maintain portability (just compile the higher-level version on platforms where you don't yet have assembly) and makes it easy to try out new optimizations that wouldn't require assembly.

carussell · on Aug 22, 2017

See http://cr.yp.to/qhasm/literature.html

copperred · on Aug 22, 2017

Your use of the phrase "critical section" threw me off for a second. All good now.

payne92 · on Aug 22, 2017

Sorry, "hot spot" would be more accurate.

dingo_bat · on Aug 22, 2017

What did you think of first?

pcordes · on Aug 22, 2017

Presumably the usual technical meaning: https://en.wikipedia.org/wiki/Critical_section

i.e. the part protected by a lock() / unlock() in multi-threaded code.

marcosdumay · on Aug 22, 2017

That deserves a place on the CS bad names gallery, just right of "strong type system". There isn't anything even critical about it, just exclusive.

The usage on the context of optimization is much more deserved, although I never heard anybody calling it "section" before, just critical code, critical loop, or stuff like that.

dingo_bat · on Aug 22, 2017

But isn't that exactly what OP meant?

2ton_jeff · on Aug 22, 2017

Compelled to chime in here and say that you can just as easily write bad HLL that the compiler can't do much with in the first place.

Also back in 2016, I participated in a pseudo-challenge on the fasm board [1] and it is trivial to optimise both the C/C++ as well as hand-written assembly. IMO comparing how good a compiler is at optimising is akin to how good it is at figuring out your intentions (and they all suck at that).

1: https://board.flatassembler.net/topic.php?t=19103

heavenlyblue · on Aug 22, 2017

The issue is that division by a constant of 2, when it's highly optimisable - must already be embodied in the algorithm on a higher level.

Essentially you must have a piece of code where that constant is just a parameter to the algorithm that suddenly when set to a value of 2 makes it that much more efficient.

Now here's the bummer: if you had already planned for it, then writing custom assembly for it is not going to make the algorithm that much faster. But when you didn't - you're rewriting the whole of the algorithm anyway.

colejohnson66 · on Aug 22, 2017

But on x86, division by 2 on an unsigned integer is the same as shifting to the right by 1. So instead of

    even:
        mov rbx, 2
        xor rdx, rdx
        div rbx

the OP should've done something like

    even:
        xor rdx, rdx % possible to remove?
        shr rbx, 1

pwdisswordfish · on Aug 22, 2017

    shr rax, 1

div rbx will divide the 128-bit value in (rdx, rax) by rbx, then store the quotient in rax and the remainder in rdx.

hrydgard · on Aug 22, 2017

Your second snippet will just zero rdx - don't confuse the registers (rbx is in the mix in the original :)

colejohnson66 · on Aug 22, 2017

Thanks :) Fixed.

kutkloon7 · on Aug 22, 2017

Nice post. It really surprises me that someone who writes assembly for fun doesn't know that dividing by constants can be done faster, nor that dividing/multiplying by powers of two can be done even faster.

On the topic of hand-optimizing or not: There always seem to be two camps. The first say: "Never bother with assembly. Let the compiler do optimizing. If anything, try to write code that the compiler can optimize.".

The second camp always states something like "It is not that hard to outperform the compiler.".

I guess the complicated pipeline models are mostly the reason for the complexity. I'm not sure how much compilers do to avoid pipeline stalls though. For example, how detailed the execution model that they use is. It shouldn't be too complicated, since most compilers (as far as I know) don't differentiate between different processors, while different processors have a different execution model.

bluedino · on Aug 22, 2017

>> Have you examined the assembly code that GCC generates for your C++ program?

A very polite way of saying, "why are you even using assembly, when you don't understand assembly?"

eterm · on Aug 22, 2017

And why wouldn't they be polite?

The OP clearly does understand assembly enough to start doing project Euler type problems, which is a good way to learn basic programming in any language. They get a solution in assembler which is more than many people here would be able to do I suspect.

And they're looking to expand their knowledge by asking on stack-overflow about something they don't understand.

So why do you think they should they be met with rudeness and hostility?

mikeash · on Aug 22, 2017

It's ironic that you read so much hostility into this completely neutral comment, and thereby ended up writing a fairly hostile reply. There's nothing in that comment which implies it should have been stated that way.

marsRoverDev · on Aug 22, 2017

Because apparently it's a rite of passage / hazing ritual to go on IRC / stackoverflow and be flamed for asking simple questions. It's been that way since the dawn of time.

bshimmin · on Aug 22, 2017

The ritual isn't really complete until someone has complained about the people complaining about people asking simple questions and someone else has posted the ESR document about how to ask smart questions, a document so astonishingly, vastly patronising that it probably ought to be compulsory reading for every developer starting out, with the addendum, "If you've made it all the way through this soul-crushing drivel, congratulations! It can't get worse than this!"

godelski · on Aug 22, 2017

Some rituals are meant to die.

mark-r · on Aug 22, 2017

Not quite the dawn of time - it took a while for that attitude to prevail. Maybe the first 6 months were OK. Unfortunately it only takes one person to show the attitude before the experience turns negative, and I've been guilty of it myself even though I try not to. Possibly even today.

AdmiralAsshat · on Aug 22, 2017

The question was more interesting than the answer.

tl;dr version--the author's hand-written assembly was poor.

I guess the more interesting takeaway is "Just because it's assembly doesn't mean it's good assembly."

pdpi · on Aug 22, 2017

That takeaway is more or less uniformly true, though. It also often comes up as saying that $LANGUAGE is slower than C or C++. Your algorithms aren't naturally faster just because you're writing in C++. You don't magically stop allocating and double-buffering all over the place just because you're writing C++. In fact, coming in from the likes of Java you're liable to underestimate just how much (relatively expensive) copying is going on if you're careless.

What C and C++ give you (and Assembly gives you even more of) is control. If you can, and know how to, capitalise on that control, you _will_ get more performance. But those requirements are non-trivial.

bluedino · on Aug 22, 2017

Right - you have to have a bit of hardware knowledge to use it effectively

ericfrederich · on Aug 22, 2017

For fun I ported the C++ to Python and Cython without any kind of mathematical or programmatic optimizations. C++ was 0.5 seconds, then Python was 5.0 seconds. Cython, which was the same exact code as Python except sprinkled with "cdef long" to declare C types, was just 0.7 seconds.

SeanDav · on Aug 22, 2017

General comment and not aimed at this specific instance:

Just because you are writing in assembler, does not mean it is going to run faster than the same code in a compiled language. There has been decades of research and who knows how many man-years of effort that has gone into producing efficient compiled code from C, C++, Fortran etc.

Your assembly skills have to be of quite a decent order to beat a modern compiler.

BTW: The answer to the question on Stack Overflow by Peter Cordes is a must-read. Brilliant.

Veedrac · on Aug 22, 2017

Quite decent is perhaps a bit pessimistic. Yes, you need to be fairly fluent if you want to write hand-optimised vector code. On the other hand, you only need to know a little assembly to be able to write C code that compiles better; that compilers are frequently not stupid doesn't mean the level below has nothing to teach.

I personally find it a shame that there are so many juicy instructions that compilers have no hope of ever using effectively. How often is a compiler smart enough to solve a problem with PDEP/PEXT, for example? Those functions are versatile as heck, but you need to plan for them if you want them to show up.

15155 · on Aug 22, 2017

> Your assembly skills have to be of quite a decent order to beat a modern compiler.

Start with compiler output, add intrinsics where you can see help is needed, benchmark, repeat.

Pathologically-slow ASM is pretty rare from modern compilers in my experience.

jzwinck · on Aug 22, 2017

> Pathologically-slow ASM is pretty rare from modern compilers

Here are some I've found:

https://stackoverflow.com/questions/45496987/gcc-optimizes-f... (horrific codegen for known-size C++11 loops in member functions, all GCC versions prior to 8 which is not yet released)

https://stackoverflow.com/questions/43651923/gcc-fails-to-op... (SIMD opportunity squandered when C++11 features are used)

https://stackoverflow.com/questions/42263537/gcc-sometimes-d... (failure to inline trivial operators)

https://stackoverflow.com/questions/26052640/why-does-gcc-im... (C isnan() not efficient, for many years)

acdha · on Aug 22, 2017

There's a key corollary: you also either have to be doing this as a one-off or committed to maintaining it over time as hardware changes and those cool tricks become moot or even de-optimizations.

mjw1007 · on Aug 22, 2017

Also as your code changes. It's often not that hard to hand-write a single function a bit better than the compiler.

But when you want to make a small change the compiler will rethink the whole function from scratch, and if you want to keep your advantage you may have to do the same.

acdha · on Aug 23, 2017

That's a very important point: this is especially the case when, say, a data structure grows and clever packing or alignment tricks can no longer be used. I certainly remember that being used as an excuse to delay a change because it was much harder to update some complex optimizations — and at least one case where the original developer tested the generic .c file they'd used to develop the algorithm with a current compiler and found it was faster on the latest hardware we were buying.

iamjk · on Aug 22, 2017

The people who write "article answers" like this on SO are the real MVP's of the web.

raphlinus · on Aug 22, 2017

Apologies if this is somewhat off-topic for the thread, but I suspect this will be a fun puzzle for fans of low-level optimization. The theme is "optimized fizzbuzz".

The classic fizzbuzz will use %3 and %5 operations to test divisibility. As we know from the same source as OP, these are horrifically slow. In addition, the usual approach to fizzbuzz has an annoying duplication, either of the strings or of the predicates.

So, the challenge is, write an optimized fizzbuzz with the following properties: the state for the divisibility testing is a function with a period of 15, which can be calculated in 2 C operations. There are 3 tests for printing, each of the form 'if (...) printf("...");' where each if test is one C operation.

Good luck and have fun!

pcordes · on Aug 22, 2017

One common trick is to handle the %3 and %5 with down-counters that you reset to 3 or 5 when they reach zero. Or better, unroll the loop by 3 so you only need one down-counter.

(Also, no, `%5` isn't "horrifically" slow when it's a compile-time-constant modulus: you or the compiler can do it with a multiply and shift for the integer division, and then x%5 = x - (x/5 * 5). https://godbolt.org/g/3HwBrF. It still sucks though.)

I wrote an x86 assembly FizzBuzz for fun a while ago, intended to be an example of how to optimize (https://stackoverflow.com/a/37494090/224132). Some parts of it kind of suck, though. I made some parts efficient (like appending "buzz\n" to a buffer with a mov [rdx], r13 / add rdx, 5), but left in too many conditional branches and a function call instead of inlining everything when unrolling.

I'm glad so many people like my post that the OP linked. It was fun to write. :)

raphlinus · on Aug 22, 2017

Indeed, llvm is smart enough not to use a DIV instruction for modulo of a constant.

Turns out you can compute x%3 == 0 with even fewer steps, it's (x*0xaaaaaaaab) < 0x55555556 (assuming wrapping unsigned 32 bit arithmetic).

pcordes · on Aug 22, 2017

Nice. It seems gcc/clang don't know that trick. They still divide and then check that x/3 * 3 == x. https://godbolt.org/g/V4kfuu. (So do ICC17 and MSVC CL19).

showdead · on Aug 22, 2017

Well, there is this approach:

    #include <stdio.h>
    #include <stdint.h>
    
    #define FIZZ(x)         (x & 0x0924)
    #define BUZZ(x)         (x & 0x0210)
    #define FIZZBUZZ(x)     (x & 0x4000)
    #define NONE(x)         (x & 0x34cb)
    
    void fizzbuzz(size_t n_max)
    {
        uint32_t x = 1;
        for (size_t i = 1; i <= n_max; i++)
        {
            if (FIZZ(x))
              printf("fizz\n");
            if (BUZZ(x))
              printf("buzz\n");
            if (FIZZBUZZ(x))
              printf("fizzbuzz\n");
            if (NONE(x))
              printf("%d\n", i);
            x <<= 1;
            x |= (x >> 15);
        }
    }
    
    int main(int argc, char **argv)
    {
        fizzbuzz(100);
    }

It's less costly than integer division, and I'm not sure it's what you had in mind. The output here still duplicates the strings.

raphlinus · on Aug 22, 2017

Tada! That's very close to what I had in mind. The only difference: you can fizz on 4924, buzz on 4210, none on 34cb, then always \n. Also, you have 3 ops to calculate the rotation, and that can be replaced by (0x8001 * x) >> 1 (for uint16_t x), but with tweaking of the masks because it's rotating right.

Congrats!

mikeash · on Aug 22, 2017

Good nerdsnipe.

I decided to reject your conditions and replace them with my own, for no apparent reason. Mine has some of the redundancy you wish to eliminate, but the loop body is completely branchless, aside from the call to printf, of course, which we're apparently ignoring.

    #include <stdio.h>
    
    #define NUM "\x06" "%zu\n\0"
    #define FIZZ "\x07" "Fizz\n\0"
    #define BUZZ "\x07" "Buzz\n\0"
    
    const char *x =
        NUM
        NUM
        FIZZ
        NUM
        BUZZ
        FIZZ
        NUM
        NUM
        FIZZ
        BUZZ
        NUM
        FIZZ
        NUM
        NUM
        "\xa6" "FizzBuzz\n";
    
    void fizzbuzz(size_t upTo) {
        for(size_t i = 1; i <= upTo; i++) {
            printf(x + 1, i);
            x += *x;
        }
    }
    
    int main(int argc, char **argv) {
        fizzbuzz(100);
    }

pcordes · on Aug 22, 2017

This is almost exactly the same as dmitryg's answer, using a state-machine the same way. It would compile to nearly the same code, but with an extra instruction to sign-extend the first char into a register before adding.

You did remove a level of indirection for the format-strings, though. You could have done that with

    struct state {
        int next;
        char fmt[6];
    };

Anyway, this has probably a 5 cycle loop-carried dependency chain on Skylake, from x += *x; compiling into a 4-cycle latency movsx rax, byte [rdi], then a 1-cycle add rdi, rax. (Or whatever registers the compiler picks).

If you'd stored pointers, you could have got it down to 4 cycles on Skylake for the load-use latency of a simple addressing mode ([reg + disp] where displacement is 0..2047).

raphlinus · on Aug 22, 2017

Good candidate for most evil use of the assumption that char is signed. Incidentally, that's not going to be true on arm[0].

[0]: http://blog.cdleary.com/2012/11/arm-chars-are-unsigned-by-de...

mikeash · on Aug 22, 2017

Haha, whoops! At least it's easy to fix.

rbehrends · on Aug 22, 2017

Let's optimize the if statements away, too (no extra memory accesses or branches, either).

  #include <stdio.h>
  #include <stdint.h>
  
  #define F(n) (1 << (2*(n)-2))
  #define B(n) (1 << (2*(n)-1))
  
  int main() {
    uint32_t m = F(3) | F(6) | F(9) | F(12) | F(15) | B(5) | B(10) | B(15);
    int i;
    static char t[4][16]= { "%d\n", "Fizz\n", "Buzz\n", "FizzBuzz\n" };
    for (i = 1; i <= 100; i++) {
      printf(t[m&3], i);
      m = (m >> 2) | (m << 28);
    }
    return 0;
  }

Transforming that into an if-based version is left as an exercise for the reader. :)

ghettoimp · on Aug 22, 2017

This probably wouldn't be a very good optimization puzzle as the cost of printing would so dominate the arithmetic operations. :(

raphlinus · on Aug 22, 2017

This is true, and I've thought about it a bit. If you were really optimizing, then maybe the goal would be stated as putting the completed answer into a memory buffer. But for now, shave as many nanoseconds as you can from the logic and pretend we don't have to worry about printing.

dmitrygr · on Aug 22, 2017

  #define FIZZ 1
  #define BUZZ 2
  static const uint8_t divs[] = 
 {FIZZ+BUZZ,0,0,FIZZ,0,BUZZ,FIZZ,0,0,FIZZ,BUZZ,0,FIZZ,0,0};
  static const char* strs[] = {
    [0] = "%u\n",
    [FIZZ] = "fizz\n",
    [BUZZ] = "buzz\n",
    [FIZZ+BUZZ] = "fizzbuzz\n"
  };

  void fizzbuzz(uint32_t upTo)
  {
    uint32_t i, state;

    for(i = state = 0; i < upTo; i++) {
      printf(strs[divs[state]], i);
      if(++state == sizeof(divs) / sizeof(*divs))
          state = 0;
  }

No divisibility tests at all

One level of arrays may be skipped at cost of extra .rodata size (store string pointers directly in divs). But in a modern cpu cost of the extra load is small

raphlinus · on Aug 22, 2017

I should have stated extra constraints; you're not allowed to use either additional memory (memory accesses are slow) or an if statement to update your state (branch mispredicts are slow). Yes, these are arbitrary (and I know the total time would be dominated by the cost of printf), but the point is to shave nanoseconds. Also, you've still got the duplication of strings.

dmitrygr · on Aug 22, 2017

Memory accesses are not slow on modern CPUs. With speculative loads and large L1d caches these loads won't stall even a single cycle

String duplication helps speed here.

Branch misprediction will happen once every fifteen times worst case.

Plus, if shaving nanoseconds, printing is a bad idea

raphlinus · on Aug 22, 2017

I see your points; maybe I need to work on the storytelling in posing the puzzle. I do stand by the principle that avoiding branch mispredicts is worthwhile if you can replace the branch with 1 or 2 logic/arith C operations. In any case, the point of the puzzle is how to get a not-quite-trivial result from extremely efficient sequence of pure logic/arith operations.

dmitrygr · on Aug 22, 2017

  #define FIZZ 1
  #define BUZZ 2


  struct {
    uint8_t flags;
    uint8_t next;
  } static const nfo[] = {
    {FIZZ+BUZZ,1},
    {0,2},
    {0,3},
    {FIZZ,4},
    {0,5},
    {BUZZ,6},
    {FIZZ,7},
    {0,8},
    {0,9},
    {FIZZ,10},
    {BUZZ,11},
    {0,12},
    {FIZZ,13},
    {0,14},
    {0,0}
  } __attribute__((align(YOUR_L1D_LINE_SIZE)));
  static const char* strs[] = {
    [0] = "%u\n",
    [FIZZ] = "fizz\n",
    [BUZZ] = "buzz\n",
    [FIZZ+BUZZ] = "fizzbuzz\n"
  };

  void fizzbuzz(uint32_t upTo)
  {
    uint32_t i, state;

    for(i = state = 0; i < upTo; i++) {
      printf(strs[nfo[state].flags], i);
      state = nfo[state].next;
    }
  }

No divisibility tests at all. No branches besides loop and printf call. Space can be saved by using a bitfield, but masking it will add speed costs.

  .data: 0
  .rodata: 30 + 4 * sizeof(void*) + strings
  .text: depending on arch, but not much

Assuming call to printf has no cost and the caches are hot, a modern x86 cpu could execute one iteration of this loop in 1 cycle (issuing 2 loads, one add, one cmp, one branch)

I have no compiler and am typing this on a phone so please forgive typos, if any

pcordes · on Aug 22, 2017

It's pretty ridiculous to pretend that printf is free. The crux of the matter is to concat string constants and string-representations of integers into a buffer. "buzz\n" is only 5 bytes, so you can store it in a uint64_t.

Also, no, a typical x86 CPU would take about 4 or 5 cycles per iteration even if printf (and the cost of moving its arguments into the right register) was free.

state = nfo[state].next is a pointer-chasing loop-carried dependency chain, so you will bottleneck on L1D load-use latency. (For Skylake, 5 cycles for a complex addressing mode: http://www.7-cpu.com/cpu/Skylake.html).

If out-of-order execution could overlap many of these loops then the throughput could be close to 1 iter per clock.

dmitrygr · on Aug 22, 2017

@raphlinus: well, if you can get it better than one iteration per cycle, I would love to see how (truly)

raphlinus · on Aug 22, 2017

I see your point regarding pure cycle counting, but as I posed the puzzle it's still open :)

beeswax · on Aug 22, 2017

Don't want to be _that_ guy but your loop starts with 0 instead of 1

jepler · on Aug 22, 2017

Hm. I don't quite see how to do it, but I think I know one of the tricks you have in mind. If you have the book Hacker's Delight handy, open it up to 10-17, "Test for Zero Remainder after Division by a Constant". With n<=100 you can get an intermediate result with a single 32-bit multiply and mask that will allow testing divisibility by 3 and 5 each with a single operation (and a bit of union trickery to access high bits without counting as an operation). I just don't see how to make the test 'not a multiple of 3 or 5' come out of that in one C operation yet. So maybe this isn't the right technique.

jepler · on Aug 22, 2017

and having allowed myself to read the rest of the replies .. fizz bzzzzt!

besselheim · on Aug 22, 2017

Assuming 32-bit unsigned integers (but similarly for other integer sizes), multiplying by 0x11111111 and shifting right 28 bits would give a function of period 15, at least for inputs between 1 and 0x1000000e inclusive. Faster than dividing by 15.

raphlinus · on Aug 22, 2017

You're absolutely on the right track with a (wrapping) multiply and a shift right, but the constants aren't quite right.

bjoli · on Aug 22, 2017

I know is it is not the point of the question, but that problem would benefit greatly from memoization. Calculate it recursively and memoize the result of every step. With all the neat trickery that they are doing with assembly they could easily go sub 10ms.

I whipped together a short poc in chezscheme, and it clocks in at about 50ms on my 4 yo laptop.

Veedrac · on Aug 22, 2017

When everything is in cache and you're calculating sequentially, definitely, but equally why not just precompute the whole range and do a binary search? Once you're looking for better algorithms the whole problem just falls away!

That said, I do think it's neat that the fastest brute-force variant is a rough factor-2 from your naïve memoized version. It might even win if it was fighting a hyperthread for cache space! Just shows how much throughput modern CPUs have... if only they were used this well all the time ;).

bjoli · on Aug 22, 2017

> That said, I do think it's neat that the fastest brute-force variant is a rough factor-2 from your naïve memoized version.

My naive memoized version in a GCd language that is consistently 4-5x slower than my own (pretty bad) C versions of the same project Euler problems without any of the tricks the fastest C++ version is doing.

Still pretty impressive considering most solutions in the project euler forums take more than 1s.

shoo · on Aug 22, 2017

> but equally why not just precompute the whole range and do a binary search?

what do you mean?

Veedrac · on Aug 22, 2017

The code finds the longest Collatz sequence below N, where N is a 32-bit integer. There are only ~1k longest sequences possible, so just list them all. Given N, search for the largest N below that and return its corresponding iteration count.

elcapitan · on Aug 22, 2017

tldr: compiler replaces /2 with a shift.

jandrese · on Aug 22, 2017

Which makes it sound like the author shouldn't have been trying to hand assemble for speed. Or maybe treat it as a learning exercise.

michaelt · on Aug 22, 2017

  Or maybe treat it as a learning exercise.

Is there any other reason to complete a Project Euler question?

empath75 · on Aug 22, 2017

i think it was a learning exercise. It's not like there's any productive use for testing the collatz conjecture.

iplaw · on Aug 22, 2017

Which is a common knowledge substitution that the OP overlooked.

msimpson · on Aug 22, 2017

> If you think a 64-bit DIV instruction is a good way to divide by two, then no wonder the compiler's asm output beat your hand-written code...

Compilers employ multitudes of optimizations that will go overlooked in hand-written ASM unless you, as the author, are very knowledgeable. End of story.

coldcode · on Aug 22, 2017

When I started programming on a Apple II+ assembly was important. Today there are likely only a few people in the world who truly understand what any particular CPU family is actually doing sufficiently to beat the compiler in some cases, and they probably are the ones writing the optimizer. But 6502 was fun to code for and the tricks were mighty clever but you could understand them.

abainbridge · on Aug 22, 2017

> Today there are likely only a few people in the world who truly understand what any particular CPU family is actually doing sufficiently to beat the compiler

I'm not sure that's true. There are hundreds of compilers in the world, being maintained by thousands of developers. Then there are all the JITs. And the people who make the standard library implementations. And performance critical stuff in game engines, OS kernels, hardware drivers, high-frequency trading. Then there's the embedded space. And then there's Fabrice Bellard.

My previous employer, Cambridge Silicon Radio (one of hundreds of similar companies nobody's heard of) had dozens of people on the staff that worked on this kind of thing. I have friends at ARM, Broadcom, Samsung and Raspberry Pi that mess around with processor designs for a living. This is just my little experience of the industry. There are armies of these people.

mark-r · on Aug 22, 2017

I had the same kind of fun with the 8080/Z80. My favorite was the conditional function return instruction, it was only 1 byte instead of 3 bytes for a conditional jump.

takeda · on Aug 22, 2017

Not too surprising answer: "your assembly sucks"

m3kw9 · on Aug 22, 2017

Because the complier has optimized it better than you.

smegel · on Aug 22, 2017

> but I don't see many ways to optimize my assembly solution further

I can't do it therefore it must be impossible!

hathym · on Aug 22, 2017

[flagged]

dang · on Aug 22, 2017

Would you please not post unsubstantive comments, and especially not flamebait, to HN?

https://news.ycombinator.com/newsguidelines.html

We detached this subthread from https://news.ycombinator.com/item?id=15072499 and marked it off-topic.

bluGill · on Aug 22, 2017

I'm a C++ developer in an embedded system where speed counts. I still live by that badge. I know that if I find a particular tiny area of code that is a bottleneck I can drop to low level assembly and after a year of work beat the compiler by a tiny amount on ONE of the THREE CPUs we support - but it would take me an entire year of nothing else.

I know of thousands of places where our code does not use the best algorithm (they are not bottlenecks). I know of hundreds of real bottlenecks that I think could be fixed with algorithm work, but the bottleneck isn't large enough to be worth putting effort into. (this decision is subject to change)

ethbro · on Aug 22, 2017

I call that the Python perspective.

'In a non-critical area, bad performance is equivalent to good performance. So optimize for other scarce resources.' (Code readability, development time, simplicity, maintainability, etc)

Waterluvian · on Aug 22, 2017

Standing on the shoulders of those who came before so that we may focus on solving real problems ;)

I tease. I'm also a pretty big Python/JS developer in the robotics industry and I definitely live, to an extent, by that adage. Intentionally so, as I only have so much time in a day and optimization would be a sub-optimal use of company resources. Admittedly I have a comfortable baseline understanding of how computers work, but I do mentally map certain things as black boxes.

heavenlyblue · on Aug 22, 2017

I've spent my childhood reverse-engineering packed and obfuscated code in assembly, writing matrix multiplication libraries in C++ for fun, also did a lot with the first available frameworks for GPU programming in assembly (e.g. implementing kernel applications to images).

I can sit down and estimate how much my code is going to take (including the cache misses and other nonlinearities).

The way I earn money is by programming python. Why is it so? Because it's a cheaper solution and because you know that there's a multitude of libraries in Python that already do that for you. Optimisation is the overhead where you hire a developer solely to sit down and optimise; and quite frankly it is quite commoditised in that sense.

subwayclub · on Aug 22, 2017

A certain irony about knowing all the tricks and using Python is that you invariably encounter people online who say "well Python was too slow so I dropped it" when the problem they encountered is covered either by one of Python's prolific extensions or by a slight reformulation of the problem.

And, having said that, I also didn't know that some optimizations were possible until later in my programming career. There's just a point where it escapes one's current level of skill.

pjmlp · on Aug 22, 2017

I am coding since the mid-80's, so I have also done a lot of low level stuff in the past, but eventually ended up earning money by focusing on managed languages.

Knowing how everything works helps to understand how to write code with performance in mind when required, write that occasional function/method or VM plugin in C++, or even read the Assembly code generated by the AOT/JIT compiler.

But for the typical application use cases, the performance for the types of applications I write is more than enough.

ryandrake · on Aug 22, 2017

I don't know--it just seems like we have lost something as a profession. Lost some of the craft and efficiency. I get that this is a practical attitude and often the right answer. But always working with "managed" languages and moving higher and higher up in the abstractions is one of the reasons we now need clusters of computers to solve problems. We've gone from computers the size of rooms to personal computers that can rest on your lap, all the way back to racks of computers that need an entire room.

pcordes · on Aug 24, 2017

Actual high-performance-computing clusters are typically used efficiently, running well-tuned code. BLAS libraries (matrix multiplication) are usually very heavily optimized.

Of course, a lot of code just uses these optimized building blocks and ends up doing multiple passes over the data instead of doing more with each pass while it's still hot in cache. It's disappointing that we still don't have optimizing compilers that know how to produce code for an efficient matrix multiply or something, and be able to mix in some other work right into that.

Your point definitely applies to server farms, though, running a huge stack of interpreted languages and big clunky "building blocks" instead of something written efficiently.

pjmlp · on Aug 22, 2017

I see it a bit differently.

Developers using Lisp during the 60's all the way to Lisp Machines in the 80's would jump of joy if they could get their hands on a RaspberryPI level hardware.

So the capabilities are there, but it is up to the developers to actually make use of them.

pjmlp · on Aug 23, 2017

As addendum to my previous comment, managed languages have always been a thing outside AT&T.

It was Bjarne's disappointment with BCPL after tasting the productivity of Simula, that made him design C++ as his way to avoid using plain C.

fhood · on Aug 22, 2017

Haha, nope, C++ and embedded C at the moment.

barrkel · on Aug 22, 2017

This was a borderline help vampire question, but it ended up working out well, probably for nerd-sniping reasons.