What is the performance impact of using int64_t on 32-bit systems?

ChuckMcM · on Jan 10, 2016

When the discussion was ongoing about what the 'native' length for an int should be in Java the group did some profiling on machines to compare 32 bit, 64 bit on 32 bit, and 64 bit native, and 32 bit on 64 bit performance. Generally 32 bit on native or 64 bit platforms was pretty much the same. Emulated 64 bit (where the compiler had to generate extra code to do the math operations) was about .1% slower. (yes one tenth of one percent). In investigating it the dominant factor in program execution was not instructions per second but fetch time from DRAM into the cache. All of the math operations fit in the cache and so executed at full clock rate. As the typical CPU at the time was 1Ghz with an execution efficency of 1.2 to 1.5 IPC, during the 110ns where the CPU was waiting for the next bit of data from the RAM to get into the cache it could execute 60-80 instructions. So generally the "extra" instructions were essentially free for the most part. So Java got 64 bit integers regardless of the underlying processor architecture.

Footnote: I have mis-remembered the system performance numbers, the SS10 was a 50Mhz machine. The R4000 (native 64 bit machine was only 100Mhz. That said, the CPU being 'starved' while it waited for the cache to fill was the proximate analysis of why the performance differed very little.

boomzilla · on Jan 10, 2016

Java has both int (32 bit) and long (64 bit) types. What did you refer to as 'native'?

Lucene, the popular core search library, has always used int as internal document IDs. As a result, the number of documents in one Lucene index is limited to 2 billions (Java int are signed). It's even worse if there are many deleted docs in the index as they are are just marked as deleted and consume up IDs. There are a number of reasons why no one is keen on moving internal docIDs to long, and performance is one of them.

ChuckMcM · on Jan 10, 2016

Wow, and thanks for that. You are absolutely correct that Java exported all of the various integer sizes. Odd that I can clearly remember the debate and mis-remember the final outcome. Bill Joy was pretty adamant about "fixing" the various sizes of integers problem that C had been going through on x86 with the whole sizeof(int) != sizeof(ptr) code and we benchmarked it and it did not affect performance hardly at all. Wow it sucks getting old.

userbinator · on Jan 10, 2016

fetch time from DRAM into the cache

I suspect Java code on average does far more pointer-chasing and function calls than pure maths, so the difference wouldn't be very visible.

joosters · on Jan 9, 2016

Something not mentioned there is locking. For the specific use case (timestamps) that the OP wants, it might not be important, but for many common use cases, like 64 bit counters, you can suddenly hit consistency problems in programs. I think that older x86 processors couldn't atomically write 64 bit numbers, so you might have to add locks around all the accesses. This could be a huge performance hit.

ISTR this was a problem for Linux kernels? There used to be several kernel counters that were stuck at 32 bits because the overhead of making them 64 bits was just too much. I think one example was network byte counters for NICs? I might be misremembering this, though.

quincunx · on Jan 9, 2016

The x86 instruction for this would be LOCK CMPXCHG8B and would allow atomically writing 64 bits. (The x64 version would be LOCK CMPXCHG16B for writing 128 bits.) Instruction goes back a while, there was a bug on pentium https://en.wikipedia.org/wiki/Pentium_F00F_bug

qb45 · on Jan 9, 2016

Yep, but first you have to load this counter and increment it manually in some CPU register. And you also need to repeat the whole sequence in case CMPXCHG8B fails.

With 32 bit integers you just execute XADD and move on with your life.

nickpsecurity · on Jan 9, 2016

It's tricky. There's some interesting comments and issues mentioned here:

https://stackoverflow.com/questions/5162673/how-to-read-two-...

rewqfdsa · on Jan 9, 2016

> [Using 64-bit registers in 32-bit mode] is not possible. If the system is in 32-bit mode, it will act as a 32-bit system, the extra 32-bits of the registers is completely invisible, just as it would be if the system was actually a "true 32-bit system".

That's not entirely true. Under Linux, with the x32 ABI (see https://lwn.net/Articles/456731/) you have access to the entire register file, but still use 32-bit pointers.

In many ways, it combines the advantages of 32- and 64-bit mode code.

simcop2387 · on Jan 9, 2016

As addressed in the comments on SO[1], x32 mode is actually still long mode. It's just a change in ABI to use the upper and lower 2GB of address space so that you can fit pointers in 32bits. It doesn't really act like a 32bit system because it isn't it's a 64bit system with a restricted address space.

That said it can be useful for certain loads where you aren't consuming large amounts of memory but are doing lots of calculations.

[1] http://stackoverflow.com/questions/16841382/what-is-the-perf...

joosters · on Jan 9, 2016

x32 mode can be a huge win if your data structures contain lots of pointers/references. It can shrink the memory usage and speed up computation by surprising amounts (compared to running the same process in a 'full' 64 bit mode). It's not just for when you are writing your own intricate code, either. An x32 build of perl sped up some of my programs dramatically, and let me run stuff on a 4GB machine that previously couldn't fit the data in.

Ubuntu has some x32 packages available for their standard 64 bit distribution, but the selection is limited.

qb45 · on Jan 9, 2016

Did you compare x32 versus the old fashioned i386?

joosters · on Jan 9, 2016

Nope, other stuff on the machine needed 64 bits so unfortunately it would be difficult to do the comparison.

In theory, x32 should still be faster because the code gets to use all the other x64 features, like a larger register set and so on. I've no idea how big a difference that actually would make though.

qb45 · on Jan 9, 2016

You can run 32 bit binaries on 64 bit system. I bet Ubuntu has several 32 bit packages available because they are needed by Wine and and closed-source 32 bit binaries.

The theoretical possibility for speedup is exactly why I asked. The x32 website has benchmarks where it's as fast as i386 and 40% faster than amd64 on pointers or as fast as amd64 and 40% faster than i386 on 64 bit math, but what's missing to really justify x32 is some "20% better than either" case.

joosters · on Jan 10, 2016

Sorry, I see what you mean now! For some reason, I completely forgot about running a plain 32 bit version... sorry but I don't have the chance to easily run the code again right now and compare.

emn13 · on Jan 9, 2016

Sounds like a 50-50 mix would do that trick, right?

Dylan16807 · on Jan 10, 2016

Not that you need to use x32 to use arena allocation with 32 bit pointers inside it.

AnthonyMouse · on Jan 9, 2016

I'm surprised it's not more common on small [virtual] machines where the total memory+swap can still be addressed by a 32-bit pointer. 32-bit pointers are much more cache-friendly if you don't need the address space.

apaprocki · on Jan 10, 2016

I've also seen this happen on POWER. When generating 32-bit binaries, but with compiler tuning parameters telling it what minimum CPU it is on, I'be seen it generate 64-bit instructions for certain operations.

Another fun tidbit -- the debugger is blissfully unaware that the CPU and compiler are conspiring to do this. If you debug such a program, it will work fine. If you set a breakpoint on the instruction in question and simply continue, the entire program freaks out because the debugger just chopped off the high 32 bits of whatever it touched. I found that amusing when hitting it while debugging an issue. (Sad face)

In general most tools freak out if they encounter this because they were built under the assumption this was impossible.

JoeAltmaier · on Jan 9, 2016

There is an opcode prefix in x86 machine code, that asserts 64-bit mode for the next instruction. Even when compiled for 32-bit and running on a 32-bit OS. Its anybody's guess if your particular compiler is smart enough to use this.

0x0 · on Jan 9, 2016

I don't think you can access 64bit registers from 32bit code. I know you could access EAX (32bit registers) from 16bit code with a prefix (0x66), but I do not think there's an equivalent access to RAX for the 32<->64bit case.

JoeAltmaier · on Jan 10, 2016

Ah right - those prefixes in 32-bit mode are inc/dec register operations. They only exist in 64-bit mode (where reg inc/dec is done with other instructions).

pmjordan · on Jan 9, 2016

I seem to remember Agner Fog mentioning that the `adc` x86 instruction (add with carry) was considerably slower on many CPUs than `add` (regular addition) because it essentially had 3 dependencies (2 operand registers and carry flag), and had to be decomposed into 2 instructions with 2 inputs each for register renaming purposes. Adding 2 64-bit values on x86 is an `add` followed by a dependent `adc`, so that's considerably worse than 2 regular, independent 32-bit additions.

userbinator · on Jan 10, 2016

On the other hand, the superscalar/OoO means that the dependent instruction doesn't block anything and the CPU will find something else to execute in the meantime, unless everything else is also waiting for it, in which case there's not really anything you can do since a double-width add is two dependent single-width ones.

emn13 · on Jan 9, 2016

IIRC that limitation was lifted in haswell (the then new fused multiply-add would also otherwise be pointlessly inefficient), though I have no idea about how or whether that affects things like adc.

pbsd · on Jan 10, 2016

In the case of the carry flag it was only lifted in Broadwell, which also introduced ADCX and ADOX to let you perform two carry chains concurrently. Conditional moves were also improved; you can do 2 per cycle now.

GFK_of_xmaspast · on Jan 10, 2016

The original question was in the context of storing times, and that makes me wonder: how many freaking operations are you doing on time stamps that you even care about this?

oliver_from_so · on Jan 11, 2016

Original poster from SO here: there are some components that are doing schedule calculations, so in some parts there are quite a lot of time-related calculations, and performance can be important in these parts.

Also, I see now that the sentence "Our C++ library currently uses time_t for storing time values" might have been misleading; by "storing" I meant "keeping in RAM and registers, for calculations" rather than "keeping in non-volatile memory, for archival".

dgreensp · on Jan 10, 2016

Where would one encounter a 32-bit system today?

nextweek2 · on Jan 11, 2016

You probably have one in your pocket.

Most ARM chips are 32bit, so most phones are 32bit. There is little reason for mobile chips to be 64bit.