I suspect if you were to compare code that uses both the remainder and quotient of a division, you would find a similar trend: the x86 division instructions produce both, but ARM has only a quotient-generating division instruction, and you have to do another multiplication(!) and subtraction to get the remainder.
That, along with the fact that earlier ARMs didn't even have a divide instruction, is one of the reasons why I'm not at all fan of the "RISC approach" --- division may not have been easy to implement, but microcoding would've been at least as fast as doing the algorithm in software while taking up far less space, and with hardware advances that move operations from microcode into direct execution, existing software that already has those instructions would automatically become faster on newer hardware with no other action needed.
Division hardware almost always will generate the remainder along with the quotient, but ARM would either need to add another instruction (no benefit to existing software, only new software that knows how to use it) or attempt to detect patterns of simpler instructions that are doing the same thing in order to replace them with one "fused uop" (much more difficult to do.)
There is a good reason why nobody implements multiply and divide like x86 did: it's terrible for register allocation. Especially on 32-bit, where register pressure is very high, having division input and output registers hardwired is obnoxious, and it kills a register for the upper half of the product or remainder or quotient that you almost never need. ARM has the correct design. Even Intel admits this: look at the MULX instruction that retrofits the right design onto x86.
Earlier versions of ARM didn't have a divide instruction because of die space and the fact that it's rarely needed. Integer division is usually by a constant, so a shift and/or a multiplication suffices. There's no need to microcode a divide instruction either, because the kernel can just trap the #UD and perform the division itself, just like soft float works. (Yeah, it's slow, but software division is always slow.) Modern designs like RISC-V do this properly: expose division as an optional extension so as to scale down to microcontrollers.
I wrote a simple compiler for mathematical operations recently, https://github.com/skx/math-compiler/ and I have to say I really appreciated the ability to get the two halves of the result in one operation.
I've just provisioned an ARM server so I can experiment with generating ARM assembly as well as Intel. No doubt I'll have fun learning of the different instructions available - until now I've only ever written assembly for intel and Z80 processors.
I wonder how much that has been changing as an increasingly large percentage of software is compiled far more frequently than in the past (JIT, formats like WASM and Apple’s bitcode, toolchains like LLVM becoming widespread, etc.). It seems like at least Apple could ship a new ARM instruction and be relatively confident that most of the important code would be updated to use it on a time-scale which used to seem impossible.
Java might be a counter-example but I’m not sure you can separate that from all of the other dubious decisions which happened over the years there.
Can someone explain this in a bit more detail? Why does "the computation of the most significant bits of a 64-bit product on an ARM processor requires a separate and expensive instruction"?
ARM64 has separate instructions for computing the low and high halves of a product, in keeping with its single destination register approach. x86-64 has a single instruction that computes both halves simultaneously, writing to two registers.
It would have been just simpler for the author to show us the difference in the multiplication performance directly. Here the benchmarks are comparing the completely different hash functions and we have to simply take his word that it is this instruction that is causing the performance discrepancy.
On a modern out of order uarch that kind of "complex" instruction would by spited in two micro operation, one for each register result.
And Agner's tables [1] confirms that the mul 64*64 => 128 is split in two micro-ops on Skylake.
So it doesn't give any strong advantage.
Yes, but the second uop is not expensive like the first in this case. That is, it seems like the the full multiplication is done by the latency-3 op on p1 and the other uop is just needed to move the high half of the result to the destination (indeed, instructions with 2 outputs always need 2 uops due to the way the renamer works). The whole 64x64->128 multiplication still has a latency of only 3, and a throughput of 1 per cycle.
So the 64x64->128 multiplication is still quite efficient compared to ARM where two "full strength" multiplications are needed. It is odd though that there is nearly a 20x difference in relative speeds though, I wouldn't expect multiply upper to be that slow.
Note: The test seems to have been done on Skylark (aka: Ampere), which is a non-standard ARM core. I can't find any documentation on Skylark's latency / throughput specifications.
I just wanted to thank you for this website. It is putting end to so many discussions we have about the quality of the code. Extra, it has a support for FORTRAN. Previously we were doing debug and disassembly ourselves.
Hm, why aren't compilers generating that instruction?
upd: apparently reasons like:
> So I guess for most of the case loading or storing i128, the data will be used by some library functions running on cores instead of NEON, so storing i128 to two GPR64 is more general.
> Hm, why aren't compilers generating that instruction?
Thats polynomial multiply. Its (almost) a multiplication in GF2 for elliptical curves. Thats not a "normal" multiply.
"PMULL" is basically a bitshift and XOR. Your traditional "MUL" is bitshift and ADD. Its called "polynomial multiply" because bitshift-and-xor has very similar properties to bitshift-and-add (it distributes over XOR, associative, communative, etc. etc).
Bitshift-and-xor has a few features that are better for cryptography. But its NOT the multiplication you are taught in grade school.
--------
EDIT: With that being said... those "better features" for cryptography would make PMULL probably a better function for random-number generation. PMULL will return a different result than the real multiplication, but you'll have an easier time making a field (aka: reversable 1-to-1 bijections) out of PMULL than MUL...
I see "pmull" (vmull_p64), but that's polynomial multiply, which is used for Galois Field multiplication. Not "normal" multiplication.
You can search for "uint64", to look for all NEON intrinsics that take a 64-bit integer. (ex: uint64x2_t). I personally didn't see any 64x64->128 bit standard multiply in NEON.
Moreover, different ARM chips have different performance characteristics as well. Apple's implementation, for instance, seems to be easily twice as fast on most problems at the same clock speed as its leading competitors.
No, the compiler is generating good code. If you use a smaller word size you just end up doing more multiplications (e.g. cut your word size in half, do 4x as many multiplications).
Should we be surprised? The __(u)int128_t types are an optional extension of the C standard, hence there can be no expectation that operations with these are implemented well or at all, let alone implemented in silicon.
I don't see what specification has to do with this. I mean, a 32 bit 2's complement integer is also a technically optional part of the C standard, and indeed there is hardware that doesn't support multiplications on them with a single instruction.
What's happening here isn't related to word size, really. It's that multiplication, as an operation, is lossy. It produces 2 words worth of output, not one. Traditionally, most RISC architectures have just skipped the complexity and returned the product modulo the word space. But x86 didn't, it put the product into two registers (specific registers: DX and AX, and their subsequent extensions).
Most of the time this just a quirk of the instruction set and an annoyance to compiler writers. But sometimes (and this trick has been exploited on x86 for decades for applications like hashing) it turns out to be really, really useful.
Integer multiplication always carries the risk of Integer overflow. Integer Overflow is undefined behavior in C, so it's the programmer's responsibility to make sure it doesn't happen.
To that end in the example a __uint128_t was used, which is nonstandard, and apparently not implemented all that well with the given combination of compiler and ARM CPU. Given that we're talking about a 64-bit CPU, my argument is that this is not very surprising.
Again, I think you're looking at this backwards. C's undefined behavior rules exist because targetted hardware didn't have common behavior, where you seem to be arguing the reverse.
I mean, I can't sit here and tell you what to be surprised about, but to me, as someone interested in how machines behave, it's absolutely interesting and "surprising"[1] that one machine with an otherwise very similar architecture can be 8x slower than another on straightforward code. And trying to wave that away with "but the spec never promised!" seems like it's only going to get you in this kind of trouble (but "unsurprising trouble", I guess) more, and not less.
[1] Not here, precisely, since I already understand what's going on, but in general.
Signed overflow is undefined behavior, unsigned overflow is defined in both C/C++.
Apart from that, I agree with you. It has to do with the fact that OP is using 128-bit variables on a 64-bit architecture.
Come to think of it, it's actually more mesmerizing that x86 is not slowed down by a 128-bit variable. The ARM architecture is behaving as is to be expected, Intel is actually the odd one out.
Someone mentioned cryptography, I can imagine that because of it, Intel has a few instructions to optimize integer arithmetic on wider integers, and that is probably the reason of the anomaly, which is actually Intel and not ARM.
As mentioned upthread, the mermerizing instruction in question is "MUL", which debuted in 1978 on the 8086 and, except for register width, behaves identically today.
I'm no expert, but shouldn't x86 then produce two 128-bit register entries if it multiplies two 128-bit integers, so totaling actually four registry entries on a 64-bit architecture? If this were the case, Intel would slow down just as much as ARM on a double-than-archictecture-width-integer multiplication, but it doesn't. That's what I find mesmerizing. I'm guessing that Intel simply discards the earlier double registry logic once it goes beyond architecture width, which would explain the speed up.
I.e. 64b * 64b = 2x64b registry entries, according to MUL should be 128b * 128b = 2x64b * 2x64b = 4x64b, but Intel discards this in favor for 128b * 128b = 2x64b * 2x64b = 2x64b.
x86 can't multiply two 128 bit numbers at a time. But it can multiply two 64 bit numbers without losing the high 64 bits of the 128 bit product, which makes the 128 bit multiplication much faster to implement.
> x86 can't multiply two 128 bit numbers at a time.
What's happening here then? Are these not two 128-bit integers? One's a 64-bit recasted to 128-bit, the other a 128-bit constant. Code would be doing faulty math, if it just decides to drop any bits. Coincidence, maybe, that the upper half of the recasted is in this case 0x0, but the code must work for 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF as well, and probably does too.
Article is very light on details, and contains zero citations, and only a single result of a single benchmark the guy ran, with no details of how it was run. It follows up by stating his theory as to why this happens as a fact (again with no citations). Author does not even offer us a clue as to what ARM core is used. The claim is:
> The difference is that the computation
> of the most significant bits of a
> 64-bit product on an ARM processor
> requires a separate and expensive
> instruction.
I see no proof of this anywhere in the ARMv8 spec. You get the lower 64 bits of result using MUL and higher 64 bits using UMULH. Neither of those is that expensive.
Looking at [1] we can see that MUL has throughput of 1 and latency of 3, UMULH has 1/4 and 6, but as long as you do not issue another multiply just after your UMULH, this 1/4 throughput is easily hidden, since only the multiplier is busy, the rest of the CPU can go on. So unless your entire loop is under 6 cycles, or you simply have no instructions to schedule that do not need a multiply within the next 3 of UMULH, it shouldn't matter. Given those large constants that need to be loaded, they will each need 4 instrs (mov+movk+movk+movk), there are plenty of instrs to schedule after UMULH. Either OP's compiler messed up, or something entirely different is going on.
If, the author was using a weaker in-order core, say Cortex-A55, still more performance is expected than appears demonstrated. There [2] the low part is calculated in 2 or 3 cycles, the high in 4. But comparing an ARM in-order little core to a modern OoO x86 is just not fair.
EDIT: Indeed, looking [3] at what gcc produces for this code is sad. For example, why it is bothering synthesizing 0x1b03738712fad5c9 before issuing the first UMULH is unclear, but it IS stupid.
EDIT2: on skylake [4] MUL has a latency of 3, so faster than on ARM but not by that much. I'd guess the constant loading on arm using 4 instructions per constant hurts more than UMULH
EDIT3: in comments on original site, author said the ARM chip being used is a "Skylark" by "Ampere Computing" [5]. Given that I cannot find any info on that microarchitecture, I cannot say more about why it might be slow.
I think Daniel's use of the word "separate" in "separate and expensive" is ill-advised, as it implies a critique of ARM's ISA design in a way that isn't relevant for this case. One might be concerned if you needed all 128 bits in some other use, but not here.
As for loading large constants, if you read the post and follow the link at "reuse my benchmark" (https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...) you will see that these functions as measured are inside hot loops. As such, presumably constant loading is very likely to be hoisted out of these loops on both architectures.
This will make the considerably slower UMULH stick out like a sore thumb. Also note that the measurement loop allows most of the work of each iteration to be done in parallel - the work of the rng is a long dependency chain within the calculation but the update of the seed is quick and independent of that.
I would guess that the Ampere box has a wretchedly slow multiply. In a comment on the post, Daniel finds an ugly performance corner on A57 (possibly related, possibly not): "On a Cortex A57 processor, to compute the most significant 64 bits of a 64-bit product, you must use the multiply-high instructions (umulh and smulh), but they require six cycles of latency and they prevent the execution of other multi-cycle instructions for an additional three cycles."
There could be an instruction scheduler impact here as well. Intel processors are known for having an uncommonly deep execution window.
It turns out that the Nth wyhash64_x doesn't depend on any of the multiplies in the N-1th iterations. It only depends on the addition of the zeroth order constant.
So, with a sufficiently deep pipeline, the instruction scheduler can effectively be in the middle of several of those wyhash iterations all at the same time, thus hiding nearly all of the hash's latency by using the other iterations to do it.
Indeed. Of course, the idea that this is invalid implies that "real" application code (whatever that is) would be designed to have a sequential dependency on a single wyhash64 result and to sit on its thumbs waiting. Maybe, and maybe not. One can make up any argument one likes.
Sure, sometimes Daniel's posts are a bit light on the details - but he provides his source every time which allows you to reproduce (or fail to reproduce as the case may be) the results on your own hardware, as well as self-answer any questions about the benchmark methodology.
In my experience he's quite willing to answer any questions you leave as a comment, also.
I think the author writes most of his posts as food for thought and a starting point for further discussion rather than the final word on anything.
I was just looking at the Godbolt preview for this, and had the same reaction about the mov/movk pairings. Note that -Os does sink some of them below the umulh.
I'm not sure it would help much, though. The algorithm is over-serialized. There's a bare minimum latency of 3x ALU ops + 2x MUL. It just so happens that the 2x high-result multiplies are particularly high-latency on this core, and he must also pay the pipeline piper of having only one multiplier pipe available.
I don't think that's the case. I read the optimization manual as saying that on entry to the pipe, no other multiply can enter it for 3 more cycles. That would be to ensure that there isn't contention on the functional unit's result bus. Therefore, the preceding mul should be able to enter without being delayed by the next umul.
If you look at the whole section, all of the multiplication results that take more than 3 cycles stall the multiplier pipe for N-3 cycles.
Amendment: Clang's decision to schedule the mul right after the umulh would also appear to be terrible. But in fact, I think that if the umulh enters on cycle 0, that the mul enters on cycle 3, the umulh's result appears on cycle 5 and the mul's result appears on cycle 6. So, it has the same total latency through the pipe as mul followed by umulh: 7 cycles.
not enough to explain the perf discrepancy. as i explained, both of these values 1/4 and 6 are hideable with proper instruction scheduling. at best author proved that arm gcc sucks at instruction scheduling, which is not news to anyone
I mean 1/4 isn't hideable with "proper instruction scheduling" if you are doing a multiplication-heavy benchmark. No amount of scheduling gymnastics is going to get you more than 1 MULH per 4 cycles.
Since the wyhash function needs 2 64x64->128 multiplication, you'll need two high and two low muls on ARM, so this is pretty much a dense multiplication benchmark. No amount of scheduling can save you.
Still by my calculation that should only put the ARM chip at a 5x disadvantage in multiplication throughput, but it was nearly 18x slower. Frequency difference probably explains some of that, but not all.
A generation of programmers was indoctrinated to believe they couldn't beat the compiler. It is definitely surprisingly to find a popular compiler generating low-hanging fruit. For many use-cases, if code can't be written to get the compiler in use to generate that sequence of instructions, it might as well not exist.
objection was to the implied claim that this is the cause for the slowdown. It is implied by the sentence structure and the preceding paragraph. I've expanded the quote to make it clearer. Thanks
Sure, but it may not be intuitive to people that one function could be exactly as fast on two very different processors, but another very similar function would be orders of magnitude different.
Novice here ... but how do these results give random results? Are the uninitialized memory considered random? Or is there some other source of randomness. It seems like a deterministic function to me if the variables are initialized to zero.
Why wouldn't one use some sort of pseudorandom seed instead of just uninitialized memory? Couldn't one sample a clock, image sensor, thermometer or some other sensor that would have a random value to use as a seed? Seems like a part of memory allocated by the compiler might always be zero.
That, along with the fact that earlier ARMs didn't even have a divide instruction, is one of the reasons why I'm not at all fan of the "RISC approach" --- division may not have been easy to implement, but microcoding would've been at least as fast as doing the algorithm in software while taking up far less space, and with hardware advances that move operations from microcode into direct execution, existing software that already has those instructions would automatically become faster on newer hardware with no other action needed.
Division hardware almost always will generate the remainder along with the quotient, but ARM would either need to add another instruction (no benefit to existing software, only new software that knows how to use it) or attempt to detect patterns of simpler instructions that are doing the same thing in order to replace them with one "fused uop" (much more difficult to do.)