Common Systems Programming Optimizations and Tricks

kabdib · on Sept 20, 2019

"Repurposing Top Bits" - don't do that. Honest.

The IBM 360 shipped with 32-bit addresses but only 24 bits decoded. "Hey, there's a whole byte up top that nobody's using today, let's put some stuff there!" When they wanted the address space IBM found themselves architecturally hamstrung, and the cost to dig out was significant.

The 128K Macintosh used a 68000; it had 32-bit addresses but only 24 bits were decoded. "Hey, there's a whole byte up top that nobody's using today, let's put some stuff there!" When Apple needed the address space they found themselves hamstrung by pieces of MacOS that did use those bits, and many applications that did, too. The cost to dig out was significant.

It is practically guaranteed that "Hey, there's 16 whole bits up there that nobody's using today" will wind up the same, because this industry just never learns.

You can do things with lower bits and usually get away with it; many systems put GC tags and type bits down there. But those upper address bits do not belong to you.

muricula · on Sept 20, 2019

Armv8 has an opt-in feature you can turn on to ignore the top byte: https://en.wikichip.org/wiki/arm/tbi This is also where the pointer authentication code goes for arm pointer authentication: https://lwn.net/Articles/718888/

On x86_64 and arm without those features enabled, the top bits of the pointer must be sign extended. This means that x86_64 by default gives you the top two bytes to play with as long as you don't use the values 0xffff or 0x0000. Attempting to access a pointer whose top 16 bits aren't a sign extension of bit 48 will fault. You can still safely play this game as long as you fix up the pointer before dereferencing it.

jnwatson · on Sept 20, 2019

"Attempting to access a pointer whose top 16 bits aren't a sign extension of bit 48 will fault."

Currently. Coming soon to an Intel chip near you is 57-bit virtual addressing and 5-level page tables [1]. It would be quite a bug that would only crash on new Intel hardware on probably quite full memory maps where your pointer fix up code wouldn't restore bits 48-57 correctly.

[1] https://www.phoronix.com/scan.php?page=news_item&px=Linux-De...

vardump · on Sept 20, 2019

It'll probably be quite a while before operating systems rush to turn on extra level of page walk fun, increasing TLB miss penalty even more.

If only x86 could have 64 kB pages...

Of course you're right it's not a great idea to use those bits. Eventually they will be in use, although it's probably 10+ years.

simcop2387 · on Sept 20, 2019

The linux kernel has patches for it submitted now, ready for when the hardware arrives https://www.phoronix.com/scan.php?page=news_item&px=Linux-De...

vardump · on Sept 20, 2019

My point was that very few need more than 48 bits of virtual address space. Having extra layer of lookup will reduce page walk performance.

52-bit physical & 57-bit virtual address space is a no brainer if you have more than 128 TB of RAM installed, of course. :-)

PixelOfDeath · on Sept 21, 2019

Isn't the tendency to use more and more large pages (1MiB) on x86 anyway? And then just use user space allocators to split them up for malloc.

vardump · on Sept 21, 2019

x86 large pages are 2 MB (64-bit) or 4 MB (32-bit). Even larger page variety is 1 GB. They do save a lot of TLB misses, but can be painful to reliably allocate.

scottlamb · on Sept 21, 2019

> Currently. Coming soon to an Intel chip near you is 57-bit virtual addressing and 5-level page tables [1]. It would be quite a bug that would only crash on new Intel hardware on probably quite full memory maps where your pointer fix up code wouldn't restore bits 48-57 correctly.

Yeah, I'd hope that code that uses such tricks would have an error check for pointers that use the bits it wants to use. Better a clean crash than corruption.

It'd be neat if a program could tell the OS what address range is acceptable. Linux has mmap(..., MAP_32BIT, ...) but obviously that's pretty limited. Maybe something like map(addr, ..., MAP_MAXADDR, ...) which would tell it addr represents the maximum acceptable address to return. So if you intend to use the top 16 bits, you could tell it this mapping can't use those.

Edit: oh, actually, I see they do something kind of like this. https://lwn.net/Articles/717293/ "An application that needs [virtual address beyond 48-bits], and which does not play games with virtual addresses, can provide an address hint above the boundary in a call to mmap(), at which point the kernel will understand that mappings in the upper range are accessible." Still not quite as flexible as I was imagining but not bad.]

floatboth · on Sept 20, 2019

The coolest thing this enables is a way more precise ASAN: https://clang.llvm.org/docs/HardwareAssistedAddressSanitizer...

pjungwir · on Sept 20, 2019

Back around 2001 I was part of a webdev shop that had its own proprietary application server. It pre-parsed HTML files for <script language="perl"> or <script language="python"> and would run whatever you put there. We linked in slightly patched perl/python libraries so we didn't have to start a new interpreter every request. There were a couple in-house RPN languages too, and other comment-based markup for easy loops/interpolations. One design goal was to let you round-trip between developers' editors and designers' Dreamweaver, so that even after adding code the designers could still see real HTML and usually even change it. I haven't seen any system so successful at that part since. (I love Haml, but design tools don't. :-) Anyway, I got to help out with porting the app server from Solaris to Linux, and one of the tricks they were using was to stuff some extra info into unused areas of the Solaris file struct, basically to smuggle it around without needing to rewrite anything. As long as you could get to the struct, you could retrieve what you put there. It was a brilliant-but-horrifying trick, and of course it was a big challenge for the porting effort.

esotericn · on Sept 20, 2019

This is majestic. Is your current work anywhere near as interesting? :P

pjungwir · on Sept 21, 2019

It was pretty cool. The company was founded by MIT folks, so I guess they weren't afraid of mixing C & scripting languages. :-) I was just a junior developer, so I barely understood what was going on. The company tried to build an open source version, but it suffered a lot from second-system effect IMO.

I don't often get to do work as cool as that, but I've really enjoyed pursuing more researchy things in my spare time. Right now I'm working on adding SQL:2011 temporal features to Postgres, so I guess I'm still getting my C fix from somewhere. :-)

emmelaich · on Sept 21, 2019

Was this https://openacs.org/ ?

pjungwir · on Sept 21, 2019

No, but it looks like it's still around (at least the open source successor, which was not as nice or stable as the original): https://sourceforge.net/projects/integratis/

jlg23 · on Sept 20, 2019

> "Repurposing Top Bits" - don't do that. Honest.

+1

I once shrugged this of as paranoid BS and got bitten by it in the very same project. It did cost me some all-nighters to fix the results of my arrogance.

CJefferson · on Sept 20, 2019

Honestly, you may as well use the bits if they are available, because as you say too many will break and will break backwards compatibility.

With x86_64, the top bits have to be all 0, or all 1. You can still reuse them as long as you clear them before actually using them as pointers. Of course that might break at a later date is memory spaces increase in size and they are used, but I'll deal with that when it happens.

kabdib · on Sept 20, 2019

You'll have to deal with APIs that do return full 64-bit pointers with real information in the top bits. You won't be able to play games with these bits. Stuff is going to break unless you are careful. If you are shipping a platform of some kind then you must also ensure that your customers cannot get into trouble (if you tell them "Hey, these upper bits are up for grabs" then you are doing them a disservice, in several ways).

"We know this is wrong, but we'll deal with it when it happens" is a demonstrably expensive road. I can't think of many clearer cases of refusal to learn from history in computing.

AgentOrange1234 · on Sept 20, 2019

I think as with everything it’s about context. If you play games deep within some internal component that is very limited in scope, easy to rewrite, and it buys you performance that is truly worthwhile, that’s one thing. If you’re doing this with pointers beyond some region of local control, you’re asking for it.

CJefferson · on Sept 20, 2019

Nothing on x86_64 is returning "full 64 bit pointers", unless they are also playing games, as CPUs only support 48 bits of address space.

Also, your suggestions are a bit strict, in my opinion. Every VM or interpreter I know about uses this kind of trick, usually to squish things like integers. Not doing this requires adding an extra allocation for every integer.

kabdib · on Sept 21, 2019

> Nothing on x86_64 is returning "full 64 bit pointers"

Well, I'm going from the processor reference manuals from Intel and AMD. Existing implementations are 48 bits, and future implementations WILL have more. I trust these folks to make good on their promise. (Similarly, the 68000 had 24 bits, and when the 68020 came out it had full 32 addressing, and an MMU, and badly written software broke).

There are pretty safe ways to do tagging; common techniques usually distinguish limited range integers and pointers with a low bit, and pointers often have extra bits since those are relatively safe to manipulate. I have not seen runtimes that use high-order bits for tags in quite some time, probably the late 80s; you haven't seen that software recently either, because it doesn't work anymore.

The x86 segmented architectures (which is all we had from Intel until the 386's flat address space came out) had several variations on pointers, basically a menu of terrible tradeoffs between fast and small code versus small and large address spaces. Far pointers were formed with a segment and an offset, and the game was all about making these multipart pointers efficient for things like comparison. In the face of pointer normalization, you mucked with high bits at your peril.

The one thing I'm sure of in this industry is that software is always going to be getting bigger. :-)

WalterBright · on Sept 20, 2019

The responsibility for this lies not with just the programmers, but the address decode logic. The hardware should have seg faulted on non-zero upper bits (which would have required the addition of a few OR gates, hardly a problem).

caf · on Sept 21, 2019

The hardware - at least, x86-64 hardware - does raise an exception on use of a non-canonical address.

So to use these dirty tricks successfully, you have to mask out those bits before you dereference your pointer.

CalChris · on Sept 20, 2019

Apple’s market cap is $983B and IBM’s is $125B. So somehow they’ve survived this lowbrow hack.

Seriously, with 64 bit addresses available on the iPhone I’m typing this into, this is an excellent trick, especially for applications. Even ARM’s TBI leaves 56 bits which is more address space than a data center’s DRAM.

  log2(16 billion * 100,000) is about 50

You have a point at 32. You’re just wrong at 64.

__jal · on Sept 20, 2019

Right, because the quality of every lesson one learns about programming practice should be evaluated by the market cap of the firm where it was learned.

CalChris · on Sept 20, 2019

The parent post mentioned those companies and the corporate 'cost' of recovering from this mistake. Take it up with him.

nine_k · on Sept 21, 2019

"You see, these guys hit this boulder , and they still ride their 60-ton battle tank! Why should I not try the same with my car?"

See also: survivor bias.

CalChris · on Sept 21, 2019

That’s nice rhetoric but I didn’t pick the survivor examples.

nine_k · on Sept 21, 2019

You did, by mentioning the companies that are alive and kicking.

OTOH DEC placed a bet on new and cool Alpha architecture, and died switching (bought by HP).

CalChris · on Sept 21, 2019

Those examples and companies, IBM (the IBM 360) and Apple (the Macintosh), were picked by the parent post.

DEC was acquired by Compaq not HP. Compaq was later acquired by HP but before that happened Compaq sold the Alpha line to Intel because it would have competed with the Itanium. The Alpha was then licensed by Intel to the Chinese (according to Gustafson) and was used as the basis for the Sunway Taihu Light.

Can I help you with anything else?

AndrewGaspar · on Sept 21, 2019

Meh, when you get to the point where you're exceeding your available address space, there's probably a million other refactorings you had to make to address all of the other architectural changes.

kabdib · on Sept 21, 2019

Or, you play by the rules and your old code continues to just work. Might not be optimal, but at least it will run. Your customers will thank you . . . well, we all know they won't, but they won't be speculating about your ancestry in public forums.

I don't remember how many Mac applications were '32-bit clean' and ran without modification when VM was turned on. I do know that some apps had to do significant rework, and that there was a whole generation of abandonware where developers didn't think it was worth the effort and just walked away from their customers.

bogdanoff_2 · on Sept 21, 2019

Can't you just make sure that your allocator only uses the bottom x bits? (Of course you still need to be careful when interfacing with code that maps memory in some other fashion)

kabdib · on Sept 21, 2019

If your allocator calls mmap or an equivalent, and the OS gives you back a pointer that uses all 64 bits (or more than 48, anyway) then what do you do? I guess the allocator has to retry and pray, or maybe just give up and fail. In any event your clever code is not going to work well, since you have made a choice that is non-portable.

I imagine that the guy on the MacOS team who stuffed some desk accessory bits into the high byte of a pointer parameter thought he was being clever. Four or five years down the line he was paying for it, and it took years to fix his original moment of convenience. He could have added a second parameter on the call in question, or allocated another byte in a struct, and the original system would have been slightly bigger but he wouldn't have hatched a nightmare. Go read about '32-bit clean' Mac applications if you don't believe me. It sucked real hard.

A bunch of companies with a bunch of smart people have made the decision to use "unused" high bits in pointers and have later regretted it, to the tune of a ton of expensive remedial engineering. I see a lot of talk here along the lines of "b-b-b-but we know what we're doing" and "oh, 48 bits is enough, and if it's not then we'll deal with it later" and I'm thinking, "This is precisely how the software industry just never learns."

I figure that I have about 20 years left in my career writing software. I've seen address spaces grow from 16 bits (and we used to think "wow, that's BIG") to 64 bits [okay, 48 bits if you're still in denial :-) ] and wow, that's big. But I'll bet I'll see 64 bits generally considered kind of tight before I retire, and I'll bet there are at least half a dozen people on the planet who are up against a 64-bit wall today. (High probability at least a couple of those folks are on HN. Any hands?).

bogdanoff_2 · on Sept 21, 2019

You can ask mmap to map memory to an exact address. You could have an allocator that decides itself what addresses get allocated.

bogomipz · on Sept 20, 2019

Interesting history, thanks.

Sorry if this is a silly question but is the "upper" in these examples the most significant bit end of the address then?

kabdib · on Sept 20, 2019

The most significant bits, yes.

(The world numbers these from 0 to N, least significant to most significant. IBM numbers them from 1 to 32, with 1 being most significant. I guess that made sense to accountants).

rurban · on Sept 22, 2019

Exactly. We more frequently use the lower two bits for this. This also works on 32bit systems. NaN tagging is also frequently used.

SlySherZ · on Sept 20, 2019

For everyone that enjoyed this, there's an entire free online MIT course called Performance Engineering of Software Systems[1] where you'll learn plenty more tricks and common pitfalls like these. You'll also learn how to use tools to debug the low level performance of your programs: looking at cache misses, cpu utilization, time spent per assembly operation and so on. It's pretty cool :)

[1] https://ocw.mit.edu/courses/electrical-engineering-and-compu...

kgs42 · on Sept 20, 2019

Thanks for this, first lecture looks cool!

vardump · on Sept 20, 2019

Very good article, facts looked correct and it had useful advice.

I'd add, keep things local. Don't access memory (or cache) outside core (L1 & L2), NUMA region or processor socket boundary unnecessarily.

Keep networking, GPU, etc. code in same NUMA region where the physical adapters are.

Use memory like tape, stream through. CPU branch predictors love that kind of access pattern.

Oh, and perhaps most importantly: use a profiler that can access CPU internal performance counters. Do this on different system types, from low power laptops to servers with 2 or more CPU sockets.

One annoying thing, though. Remember that the fastest thing in a microbenchmark might not be the fastest thing on a real system when different code modules fight for shared limited resources, like memory bandwidth, caches and inter-core communication links.

ibrault · on Sept 20, 2019

Could you elaborate on what it means to "use memory like tape"?

vardump · on Sept 20, 2019

Sequential access patterns, forward or backward. Repeating predictable gaps are ok, but do remember minimum unit that can be read from memory is a cache line. So if you read one byte, you'll read 64 bytes on modern x86.

ajross · on Sept 20, 2019

DRAM latency for a random access read can get into the low hundreds of cycles on modern multi-socket devices. But the streaming bandwidth remains very high. Cache systems will routinely prefetch the next block ahead of an access if they detect that memory is being used sequentially, eliminating a huge chunk of that pipeline stall.

tyingq · on Sept 20, 2019

Sequential access instead of random. Like arrays instead of linked lists, for example.

vardump · on Sept 20, 2019

> CPU branch predictors love that kind of access pattern.

My brain's "branch predictor" had some sort of issue. CPU prefetchers of course. :-)

ncmncm · on Sept 20, 2019

Last time this came up (https://news.ycombinator.com/item?id=20808778) and disappeared almost instantly, I wrote:

A discussion of systems programming optimization that doesn't start with single-writer ring buffers starts on the wrong foot.

Those other tricks are excellent, and I use all of them, in cases where they work at all. But, e.g., seeking a way not to need to take a lock at all should come before discovering a low-contention locking protocol.

Readers should note that packing spare bits into the bottom bits of suitably aligned pointers is more portable than using high bits. Any page-aligned pointer has at least 12 bits free at the bottom, and any malloc'd pointer has at least 2, more often 4.

Ring buffer counters into a power-of-2 sized buffer can be incremented without bound, enabling use of ordinary arithmetic on them, and high bits masked off cheaply on each use. [But use 64 bits!]

Probably the most neglected primitive data structure is the bitmapped set. A `uint32_t` gives you a universe of 32 elements; a byte is enough for the days of the week. The popcount native instruction is very valuable here, usually expressed as `__builtin_popcount` in source code. C++98's `std::bitset` provides Standard, portable access to it, but C++20 offers `std::popcount` directly.

[I add here that storing things in high bits of addresses is very likely to fail on systems with ASLR, and that I have learned MSVC bitsets have a very slow popcount.]

ncmncm · on Sept 21, 2019

I wrote a little more about this stuff a couple of days ago:

https://news.ycombinator.com/item?id=21011976

mamcx · on Sept 21, 2019

bitmapped set is one I don't remember at all. It have another name? A quick google not give me a clear idea of what is or how could be usefull...

ncmncm · on Sept 21, 2019

std::bitset is one. But you can just use an unsigned int, or array of them. In C++ or C, set intersection is &, union is |, complement is ~. Cardinality is __builtin_popcount, or std::bitset<>::count(). Membership test for m is (s & (1<<m)). For a larger set, represented as an array of unsigned, it's (s[m>>5] & (1<<(m&0x1f)). std::bitset encapsulates all this, optimally. There are similarly efficient implementations for lots of other useful operations: consult HAKMEM.

Bitsets are useful in more places than you might guess, because they can be used to filter out, with typically a single instruction, a majority of uninteresting cases in searches, leaving fewer cases to be examined with more precise and expensive operations. You record interesting properties of elements of a collection upon insertion as bits in a word stored alongside (or, better, in an index); and then first check the bits during any search. It is easy to speed up an amazing variety of programs by an order of magnitude, sometimes much more, with a tiny change.

For example, you can store an int value for each of a (possibly very large) collection of lines of text, representing the set of less-common letters that appear in the line. Searching for lines that contain a string, you first check it against the set for the string. Any line that doesn't have them all certainly doesn't have the string.

Leibniz (inventor of calculus) used this method very heavily in his own work. Before computers--and even into the 1960s--it was the most important way of automating data processing. Back then, you used cards that had a row of either holes or notches along one edge, and selected matching cards from a stack by inserting a rod through the stack at a selected spot, and lifting.

pasabagi · on Sept 21, 2019

>Leibniz (inventor of calculus) used this method very heavily in his own work. Before computers--and even into the 1960s--it was the most important way of automating data processing. Back then, you used cards that had a row of either holes or notches along one edge, and selected matching cards from a stack by inserting a rod through the stack at a selected spot, and lifting.

Hey, do you have a reference for that? I've been doing some research into Leibniz's calculators, and I've been finding few sources.

ncmncm · on Sept 21, 2019

There is quite a lot about Leibniz's calculating machine designs on Wikipedia.

I think I found out about Leibniz's bitwise activities in Neal Stephenson's Quicksilver, but he invented the modern notions of both sets and digital logic, according to Wikipedia. He would have used the cards in catalogging libraries.

mxz3000 · on Sept 22, 2019

Yup I used bit sets to write a super duper fast Sudoku solver. For examples you can detect naked singles extremely fast by just and-ing the bitsets in a given row, column or house, taking the complement, and checking if a unique bit is set.

ncmncm · on Sept 23, 2019

Sudoku is a poster child for bitset optimization.

But they used to be essential for automating chess. A 64-bit word represents a board, and a set of words represents the positions of each type of piece, one word for white pawns, etc. Another word shows places threatened by the white pawns, and ORing them shows all the threatened places.

dbcurtis · on Sept 20, 2019

Good article. Basics that everyone can benefit from knowing.

Just one nit/warning... breaking coarse locks into fine-grained locks can be taken too far. There is a point of diminishing returns where you end up spending increased time acquiring/releasing/waiting-for locks. At some point you want to clump together under a single lock resources that tend to often be used together, even if you often end up locking an extra resource or two unnecessarily. As always, benchmark workloads are your friends.

hinkley · on Sept 20, 2019

Not just performance, but the moment there are two locks required for any task, you may have the Dining Philosophers problem.

jacobush · on Sept 20, 2019

Taken to the extreme, ONE lock in Python. :)

dbcurtis · on Sept 20, 2019

Ha Ha! Yes. The gillectomy project collected some interesting data on that a couple of years ago, creating very fine-grained locks and measuring. Not unexpectedly, with a zillion locks performance suffered. The reason though, had much to do with how much it thrashed the cache on a modern CPU. Lots of cache invalidation traffic back and forth between cores. The C-Python implementation is architected around the GIL, and that is a tough problem to crack.

My favorite too-many-locks story goes back quite a long time, a former coworker is an old-time Unix guru, who was at Sequent back in the day when 8 CPU's was kind of a big deal. He spent about 9 months on a project putting lots of fine-grained locks into their Unix kernel. After shipping that, his next project was 6 months taking about 25% of them out :)

hinkley · on Sept 20, 2019

75% correct is not a bad ratio.

josalhor · on Sept 20, 2019

I know this is a joke, but you still need locks in Python

ajross · on Sept 20, 2019

I didn't read it as a joke, you're just operating at a different abstraction level. The CPython interpreter famously uses a single global interpreter lock to protect the language internals and runtime, so it has trouble scaling beyond a single CPU on interpreter-heavy workloads. You're saying that threads in python scripts can be arbitrarily preempted, and so locking is required to protect them against each other, which is also true.

bch · on Sept 20, 2019

Theoretically not necessarily the referenced GIL, though. There are conceptually alternative models; see Microsoft doc[0] on apartment threading model, or Tcl[1].

[0] https://docs.microsoft.com/en-us/windows/win32/com/processes...

[1] https://stackoverflow.com/questions/45799121/runtimeerror-ca...

Symmetry · on Sept 20, 2019

In most cases your compiler should do the clever work of turning your division or modulo operation into easier to do bit banging... but only if you're operating on a reasonable constant. Powers of two are best but you can do other constant divisions by stringing together 1-latency operations in ways that are still far faster than division.

caf · on Sept 21, 2019

Yes - as long as x is unsigned, then (x % 1024) will be compiled to the same thing as (x & 1023) with any reasonable compiler.

If your hashtable dynamically resizes though, (x % size) will use a full divide. You could keep the log2 of the size around instead, and rely on (x % (1U << size_shift)) being optimised (this works on gcc: https://godbolt.org/z/HbGnJH) but (x & (size - 1)) might be easier at that point.

legulere · on Sept 20, 2019

Instead of repurposing top bits you can also repurpose the Bits beyond alignment. E.g 32 bit integers are aligned to 4 bytes, so you can use the lower two bits of pointers to them instead.

eschneider · on Sept 20, 2019

As someone who's worked on old Macs and has also done lots of 32 -> 64-bit porting, this is the sort of trick that works wonderfully...until it doesn't. And then you've got a nightmare on your hands.

I'm not saying never do that (ok, maybe I am...) But definitely think long and hard about how long your code will be around before you do it.

jcelerier · on Sept 20, 2019

> As someone who's worked on old Macs and has also done lots of 32 -> 64-bit porting, this is the sort of trick that works wonderfully...until it doesn't. And then you've got a nightmare on your hands.

That's why you hide the trick behind a zero-cost abstraction which checks at compile-time if the platform supports this

notacoward · on Sept 20, 2019

One of my favorite system programming tricks is to never believe that a "zero cost abstraction" lives up to the name.

jjuhl · on Sept 20, 2019

Modern optimizing C++ compilers (especially with Link Time Optimization enabled) are pretty amazing and can very often actually achieve that abstraction collapsing.. But, of course, always measure.

notacoward · on Sept 20, 2019

While it's true that modern compilers are wondrous things, checking whether they're clever enough to optimize away a particular construct - and to do so correctly, and to continue doing so in the next release - still takes time. If the same optimization can be done at a higher level, such that it will apply for any correct (but not necessarily clever) compiler, that's preferable. In my experience that's practically all the time. The best compiler optimizations IMO are the ones that can't be easily done at the source level.

hinkley · on Sept 20, 2019

Sometimes programmer advice needs disclaimers like prescription medicines.

† Offer only valid in the contiguous United States. Offer cannot be used in combination with any other offers. See stores for details.

‡ When not in use, return Happy Fun Ball™ to its protective lead case.

ajross · on Sept 20, 2019

Yeah, but an even more important system programming trick is to measure what you're doing every time. Performance optimization at this level is never about just trusting tools. If you aren't willing to be reading generated machine code and checking perf counters, you aren't really going to get much benefit.

And if you're willing to check your optimizations by reading disassembly, tricks like stuffing tag bits into the bottom of aligned pointer values is pretty routine.

saagarjha · on Sept 20, 2019

The abstraction necessary to make tagged pointers platform-dependent is quite small and trivially removed by any optimizing compiler.

vardump · on Sept 20, 2019

Yeah, don't believe in dogmas, measure. Zero cost abstractions sometimes are anything but.

hinkley · on Sept 20, 2019

And sometimes the surprises can be in the other direction.

Cache thrashing is a thing. And perf analysis tools have bookkeeping errors.

BeeOnRope · on Sept 20, 2019

Those failures were all about top bits, not low bits though weren't they?

What problems did lower bit use cause? You always had to mask it away even on the old boxes, no?

astrobe_ · on Sept 20, 2019

I'm curious, when does it fail?

groby_b · on Sept 20, 2019

That does not hold for Intel x86 architecture chips, which are perfectly happy to have unaligned integers.

As for struct members, the alignment is (of course) "implementation defined", which is the fancy way of throwing up your hands and saying "whatever". (Since C++11 we actually have alignas(), which at least gives manual control)

tynorf · on Sept 20, 2019

Many (most? all?) allocators are guaranteed to give you back aligned pointers.

For instance, jemalloc(3) on amd64 platforms will give back 16-byte aligned pointers:

  [...] The allocated space is suitably aligned
  (after possible pointer coercion) for storage of any type of object.

astrobe_ · on Sept 20, 2019

One can construct an aligned pointer on a chunk of allocated memory by asking for the needed quantity plus the size of the alignment and the "nudge" the pointer you get to the next word boundary (and you can do something with the "wasted" byte before it, like storing the size of the object).

But that sort of trick is only worth it if you are writing a compiler and its runtime, an interpreter, a memory allocator (in particular GCs) or at the very last some sort of high performance library (you would return your "featured pointer" as an opaque type to the user).

ufo · on Sept 20, 2019

> Part of why the change couldn’t be enabled by default is because various high performance programs, notably various JavaScript engines and LuaJIT, use this repurposing trick to pack some extra data into pointers.

Does any one know if this sentence can be backed up by a citation?

I know that the NaN-tagging trick assumes that pointers have 48 bits (small enough to fit inside a floating point mantissa), but was this ever a factor for deciding whether 5-level page tables should be added to the Linux kernel or not?

caf · on Sept 21, 2019

5-level page tables have actually been in the kernel for a couple of years now.

The issue listed was definitely a concern, but was worked around by having the kernel only allocate userspace linear addresses that aren't 48-bit-canonical in response to a mmap() call that supplies such an address as the hint argument. See the commit message in this commit for example: https://lore.kernel.org/patchwork/patch/796025/

So for a program to get an address above the 56-bit limit, its memory allocator has to specifically indicate to the kernel that it supports that.

_xkvm · on Sept 20, 2019

Interesting article. Is there a reason why there isn't a book/article that has a more comprehensive list?

groby_b · on Sept 20, 2019

I guess partially because it can quickly become very dependent on the system you work on. But if you're specifically interested in caching issues, one good keyword to look for is "cache aware algorithms" - or "cache oblivious algorithms".

On the locking side, the counterpart would probably be "lock-free algorithms", but I still don't believe their complexity means that in most cases you shouldn't look at those :)

holy_city · on Sept 20, 2019

The false-sharing macro in the example expands to __attribute__((alligned(/* etc / )) or __declspec(align(/ etc*/). Is there a reason these are preferred over the alignas specifier introduced in C++ 11?

saagarjha · on Sept 20, 2019

I believe there's a note that recommends the use of alignas when available: https://github.com/abseil/abseil-cpp/blob/fa00c321073c7ea40a...

y7 · on Sept 20, 2019

> Now, to support multiple processors on a single machine reading and writing from the same memory in a coherent way, only one processor on a machine can have exclusive access to a given cache line.

Does this also apply when multiple processors are only reading memory?

atq2119 · on Sept 20, 2019

No. That's what MESI-based cache protocols are all about. Multiple cores/processors can have the same cache line in a shared/read-only state. Only writes require exclusive access.

ncmncm · on Sept 20, 2019

"Exclusive access" means writing. But the reader doesn't know when it might change. Sometimes that's exactly what you want.

floatboth · on Sept 20, 2019

> The Magic Power of 2: Division Is Slowaloo

Is LLVM not smart enough to optimize this?

BeeOnRope · on Sept 20, 2019

Yes, for constant divisors.

However, for signed division, the C semantics (round towards zero) are different than the semantics when you apply an arithmetic shift (round towards negative infinity).

If you are fine with the latter behavior explicit shifts remove several extraneous instructions dealing with the difference.

caf · on Sept 21, 2019

Or you can do the division in unsigned types - reasonable for something like an array index.

SomaticPirate · on Sept 20, 2019

Are there any golang implantation a of high performance hash maps? Does the the standard lib do this?

bboreham · on Sept 21, 2019

Go’s built-in ‘map’ is very good.

But “high performance” is not a single dimension - if you say a bit more about what you care about, maybe another choice would fit.