Hacker News new | past | comments | ask | show | jobs | submit login
Beating the L1 cache with value speculation (mazzo.li)
157 points by rostayob on July 23, 2021 | hide | past | favorite | 53 comments



How realistic is it to have a linked list in contiguous memory that stays nicely ordered? In my experience, lists are normally cobbled together with nodes that are allocated widely separated in time and hence I have no knowledge that I can use to predict the address of the next pointer.

If you're in a scenario where this optimization works, it seems like it would be better to just use an array.


Are you familiar with Donald Knuth's "Dancing Links" ??

Its a covering-problem solver (NP-complete) that uses linked lists highly-efficiently. All nodes are "statically allocated" (at least, with respect to the algorithm), because the covering-problem takes up a constant amount of space.

The linked-list "Dances", pointing at other nodes to represent which combination of covers are attempted. So any link could potentially point to any other link later in its list. However, the number of nodes themselves is constant and set before the algorithm runs.

As such, the linked-list is fully X1->X2->X3->X4... at the start of the algorithm. (where X1, X2, X3 are contiguously allocated in memory). If X3 is proven to "not be a good guess" for our covering problem, then it is cut out of the linked list (X1->X2->X4).

All links remain in order, and are ordered on a 2-dimensions (So not only X1->X2->X3... but also X1->Y1->Z1...). Where X, Y and Z are the compound-elements trying to cover locations 1, 2, 3, ...

------------

Most simple malloc implementations also have the free-list as a simple linked-list that is in fact, ordered in memory. (The first node in the free-list is the lowest numbered RAM spot, while the final node in the free-list is in the highest-numbered RAM spot).

The FAT32 linked-list across a hard drive is also nicely ordered (and when it isn't, the user will perform "defragmentation" to reorder the list and optimize the hard drive).

-----------

IMO, the "nicely ordered Linked List" is uncommon, but... common enough that its worth discussion. And sure, maybe we don't use FAT32 filesystems today very often, but that's the technology I used growing up lol. I know the methodology works!!


I think you need to be aware of why such an implementation (fat32) had to exist. As I believe (I also grew up with fat32) that it basically vanished when HDDs with caches in the MB started to show up. Journaling became a thing; actual caches became a thing so the HDD could actually lay things out well by utilizing said cache.

NTFS still needed to occasionally be defragmented but it allocated more of a buffer to prevent having to do so at all.

I think; in general; You're very spot on about utilizing continuous arrays of memory for linked lists as they will almost always result in better performance characteristic (unless you really need insertion/removal in order and it's very write heavy).


It's fairly standard for a compacting GC to rearrange lists this way, because memory latency was a concern even back in the 80s.

If you're using linked lists in a language without a compacting GC, then you almost certainly want something else instead.

If you're using them in a language with one, then you still usually want something else, but at least the cost isn't as high. Also you might be using Haskell or something, where other constraints make data-structures like arrays painful to use.


You can get help from the allocator. For example jemalloc has an experimental API called "batch allocation" where it returns a list of allocations of the same size, and it makes its best effort to make them contiguous (not guaranteed though): https://github.com/jemalloc/jemalloc/blob/12cd13cd418512d9e7...

The idea is that you may have a data structure that you want to be able to modify (in which case you need to free individual nodes) but in most cases you do not.

Then you could use batch allocation, and pair it with this optimization.


That depends on the allocator you use. If you allocate your nodes from an arena that belong to the list, the resulting layout is usually pretty good.


It's extremely common in Lisp. Linked lists are the natural data structure of Lisp and in most modern programs they rarely get modified. Common Lisp provides vectors for greater efficiency but they might be less necessary in many programs if this trick works.


That's a very good use case. I wonder if a good CL compiler would be able to convert many of these lists to arrays anyway.


This works because we've arranged for the nodes to be sequential in memory, if we have the luxury of doing that then we could use an array instead of a linked list. Is there a scenario where this value speculation trick would work, where we /don't/ have this luxury?


Given the nature of branch prediction, I’d expect it would still help dramatically in the case where the list was almost entirely sequential, which seems a bit more plausible.


Agree; if a mispredict costs 15-20 cycles and a predict saves 4 cycles, does that mean we break even around 1/4-1/3 sequential?


I could imagine behaviors like this happening with object pools, though it does still feel a bit contrived.


A quick googling brought me to the following paper:

"Data value speculation in superscalar processors"

https://www.sciencedirect.com/science/article/abs/pii/S01419...

It would appear there is an entire section dedicated to "Value prediction in a realistic scenarios", but I am not curious enough yet to pay money to access this document.

Edit: Found another that is accessible.

"Practical Data Value Speculation for Future High-end Processors"

http://class.ece.iastate.edu/tyagi/cpre581/papers/HPCA14Data...


In the field of high performance processor microarchitecture, one goal is to increase performance by executing a maximum number of instructions in parallel (increase IPC). For this purpose we identify dependencies between instructions. We identify 4 types of dependencies between instruction registers: Read after Write (RaW), Read after Read (RaR), Write after Read (WaR) and Write after Write (WaW). (the same kind of dependencies exist with memory)

Among these dependencies some are caused by the reduced number of registers, with more registers they can be removed, this is called false dependencies. 3 of these 4 dependencies are false dependencies: RaR, WaR, WaW. And with register renaming techniques we can eliminate these dependencies and therefore increase IPC.

Only the Read after Write (RaW) dependency remains and it really determines what can be executed in parallel or not in modern machines. The whole point of value speculation is to break the RaW dependency to increase IPC.

The papers you quoted cover value speculation in hardware, while the article covers value speculation in software.

The possibilities and gains relative to these 2 approaches differ since: in software we can predict a whole expression, while in hardware we can only predict a static value (or a pattern of values). But in hardware it doesn't require any particular effort at compile time and it doesn't increase the code size.

I recently spoke with one of the researchers you mentioned (who deals with value speculation in hardware) and he acknowledged that the performance gain is currently quite low relative to the amount of hardware to be added.

But maybe the research in hardware value speculation will lead to a higher performance gain? Maybe in software there are more realistic gains? Maybe even a mix of hardware and software can make the value speculation powerful?

Imagine an instruction `predict <rd>, <r1>, <r2>` where rd takes the value of r1 but the processor is allowed to use the value of r2 as a value prediction


``` if (node != next) { node = node->next; } ``` How does this work? Shouldn't this be ``` if (node != next) { node = next; } ```?


That looks weird, why would you need to test if two values are equal if you’re going to assign them to be equal anyway?


That is the trick.

In the happy path you are not assigning(`node=next`).

It is taken care of by `node++`, which removes the loop dependency and the processor can use the full instruction level parallelism.


It looks like a bug. The unhappy path contains both `node++` and `node=node->next`. Note that this is in the code following "Let’s go back to the code we showed for value speculation in C:", which is actually different from the preceding code it's supposed to be a copy of. I guess it's a typo.


The author has fixed this discrepancy now.


It looks as if the difference is in delaying the check for next == NULL.


  while (node) {
    value += node->value;
    next = node->next;

So it is the same thing.


How do they get the cycles/elem and instrs/cycle counts? Is it via the perf utility? Or something they calculate? Or they get it some other way?


Some tools you can do this with:

The perf command line utility, the perf_event_open system call, the libpapi high level API, pmc.h on FreeBSD (IIRC).

Now, some more clever analysis: These can tell you the why rather than the what, they collect the same information but can present it graphically and compute TMAM statistics for example (Top-down microarchitecture analysis method).

vTune, Intel Advior, AMD uProf, and a few others. vTune is the best for x86, but only supports Intel fully.


Personally, I prefer to just use the RDTSC assembly instruction (Read TimeStamp Counter), which provides the 64-bit number of clocks your core has ticked.

Both Windows and Linux provide more robust timers (in particular: if your thread takes longer than 10ms, there's a chance your thread will sleep to give other threads a shot at the CPU). So if you're timing something longer than 10ms, you probably want to use OS timers instead.

-------------

There were a few programs where I couldn't add rdtsc easily to the code (in particular: I was trying to test something so fast that rdtsc took up the bulk of the time). In these cases, I went into the BIOS, disabled "turbo" on my CPU, locking my computer to 3.4GHz.

From there, I took the Windows timer and measured 1-billion events, and then divided by 3.4-Billion (3.4GHz == 3.4-billion clocks per second).

---------

I don't know the specific methodology that the blogpost used. But there's many easy ways to do this task.


There's a bit more to it than turning a flag off in your BIOS.

Intel have a document about it [0]

[0] https://www.intel.com/content/dam/www/public/us/en/documents...


> , which provides the 64-bit number of clocks your core has ticked.

Not quite


I mean... computers are complex these days. I could type up like 3 paragraphs that more specifically describes what is going on but is that really helpful?

Yeah, pipelines and out-of-order exeuction makes the definition a bit difficult. If you want to ensure that all previous instructions are done executing, you need lfence, and if you want to prevent future instructions from filling in the pipelines you'll need an mfence.

There are many clocks (even within a core). The turbo-clock is different from the standard clock. I forget exactly which clock rdtsc uses, but I do know that under some processors under certain conditions, you'll get weird results.

Different processors may have different interpretations of "clock" (mostly due to turbo and/or sleeping behavior). Etc. etc. I don't recall the details, but these different clock states could vary as much as 2.2GHz to 4GHz on my processor (P1? Turbo? I forget the exact name...)

---------------

But all in all, you get a 64-bit number that describes the number of clock-ticks --- for some "definition" of clock tick that differs between processors... and for some definition of "now" (in the case of out-of-order execution and/or pipelined execution, the "now" is a bit ambiguous, as previous instructions may have not finished executing yet and future instructions may already be executing).

If you really want to know, read the processor manual specific to the microarchitecture (since different microarchitectures could change these definitions)


> If you want to ensure that all previous instructions are done executing, you need lfence, and if you want to prevent future instructions from filling in the pipelines you'll need an mfence.

LFENCE does not serialize, nor MFENCE. CPUID, however, is documented to as a serializing instruction and is the recommended way to serialize, particularly with RDTSC.

> I don't recall the details, but these different clock states could vary as much as 2.2GHz to 4GHz on my processor (P1? Turbo? I forget the exact name...)

Oh heck, it's way more than that. I've measured ~5x difference in clock cycle count for short loops using RDTSC. Supposedly RDTSC returns "nominal" cycles that advance at the same rate relative to the wall clock, but TBH that doesn't smell right. OSes also try to synchronize the absolute values of the various processors, so jumping between CPUs isn't that bad.


"Invariant RDTSC" has been the norm for a long time now (identifiable by a CPUID feature bit) and it doesn't vary with power states or dynamic frequency. Which means it's just a lightweight, high precision timer at this point. In the Pentium 4 era you had a weaker guarantee called "Constant RDTSC" which could stop ticking in certain low power states.

Anyway, invariant RDTSC's tick rate is completely separate from the core clock. So the main issue you have to worry about with invariant RDTSC is having your process unscheduled or having ticks "stolen" by interrupts (which includes firmware invisible to the kernel or hypervisor).


The program is reading perf events.

The code is at https://gist.github.com/bitonic/78887f5d3238bab5e31f3c5a41d4...


This seems like it would cause a lot of those "I just wish the damn machine did what I told it to: situations.


Not necessarily. After putting a lot of brain grease into writing the code, it will probably work just fine.

But it will lead to a bunch of serious "WTF!?" situations when other people look at the code and/or are supposed to make changes at a later point.

And possibly stuff breaking in spectacular ways after introducing "minor cleanup" patches. Or some weird performance regressions, after touching completely unrelated code that might affect the temporal order of the memory allocations in questions, and so on...

Besides improving performance, doing this kind of trickery in production code will probably also improve your job security.


Why? Most of the time these predictions help you.


I like it as an exercise for understanding the CPU but is this stuff really useful in the real world - the example given is "optimised" away by GCC/clang without hacks. My guess is that if this is a good way to iterate over linked lists compilers would have optimised for this by now? Either through cleverness or hints added to your code.


For a compiler to properly do the optimization detailed here would require it to have knowledge of runtime execution/memory access patterns. This might be done through profile-guided optimization, or by providing hints like __builtin_expect.

That said, compilers are not magic boxes. Relying on your compiler for optimal codegen has diminishing returns as you approach the point where microarchitectural details start making a difference. Most people don't get to that point, so it's rarely an issue.


The code relies on information not present in the program - "this linked list consist mostly of nodes sequentially arranged".

If you could express that in your programming language, then yes, the compiler could optimize it. One could argue that the "weird" if statement is just that hint.

And is it useful? It depends. There are plenty of situations where doubling your throughput is worth a lot of effort. There've been situations in my career where I've spent a month or two to shave a cycle or two. It's rare. But when you need it, you want the biggest bag of tricks you can bring, because it really matters :)


Compilers are 'just' graph transformations applied to code which probably leads to improved performance.

A compiler can therefore optimize away unnecessary loads/stores, or recognize implicit parallelism and transform a loop into a SIMD loop.

But a compiler will not change the logic of your code or add extra steps.


> But a compiler will not change the logic of your code or add extra steps.

Compilers can and do delete checks for null pointers, integer overflow... The stuff they do when they're faced with simple aliasing is honestly pretty crazy.


I mean this is quite easy to disprove - what do you think -funroll-loops is doing under the hood if not adding extra steps?


A "loop" is just a cycle in a graph.

A -> B -> C -> A, hey look, its a cycle. Eventually A->D (where D exits the loop). So really node A has an edge going to B and D.

Knowing that A has an edge to B and D, an equivalent transformation is A1->B1->C1->A2->B2->C2->A1, but also A1->D and A2->D.

Both A1 and A2 have to still point at D. But A1 points to B1, and A2 points to B2. Etc. etc. That's all your compiler is doing, its reasoning about the graph and transforming it with actually... very simple rules. (Unfortunately, those rules are NP-complete so heuristics are used to make the compile times reasonable. NP-complete seems to keep coming about transformations associated with "simple" rules...)

Then, we might see that some logic could make A2 not necessarily point to D (cutting out one possible jump statement). And that's where we get our optimization from: a simple reasoning about the graph which removes one instruction per 2-loops. Unroll to 8 and we get to 7-instructions (cmp/jump instructions) saved over every 8 loops.

An "optimizing" compiler just reasons about this graph and all possible equivalent transformations (or at least, uses a heuristic to reason over a portion of equivalent transformations), and chooses such a graph transformation that minimizes various cost estimates.

-------

That's the thing. Compilers can only transform the code-graph in ways that are "equivalent". Now yes, things get a bit weird with undefined behavior (and pointer-aliasing: assuming a lack of aliasing is needed for C/C++ compilers to make even the simplest of transformations. So that's a bit of a corner case). But that's the gist of what they're doing.


This is not a 'static' optimization. It relies on knowledge that the compiler cannot decide by looking at the code (whether the linked list is _mostly_ laid out continuously).

It might still be a very useful optimization in a JIT runtime. If your loop variable keeps increasing the same amount after every lookup, maybe rewrite the loop to add speculation...


> Both gcc and clang easily deduce that the guessing is semantically pointless

Can someone explain this please, because I don't see how - as `node` and `next` could possibly be different values, wouldn't that `if` be needed and so is in no way a NOP?


The compiler deduces that even though the statement `node = next` is inside an `if`, it is more general, so it removes both `node++` and the `if`.


Oh?! So it will aim to generalise even when the semantics aren't 1:1? Mind blown.

Thanks for that.


Interesting, but that kind of code can sometimes be made a lot faster with prefetch instructions, _mm_prefetch intrinsic in C++.


Very well explained.. its a really interesting idea.


Would this introduce spectre like vulnerabilities?


I was under the impression that high end CPUs already perform value prediction while speculating?


There are some restricted form of value prediction in the wild: Indirect branch prediction is fairly common; so is alias prediction that will speculate whether two data addresses (at least one if which is a store) do not alias; AMD memory renaming also makes assumptions around stack addresses; outside of these special cases I'm not aware on any cpu that will actually generally speculate actual values or address.

On relaxed memory model machines naive value prediction would break dependency ordering (i.e. memory order consume), so things get very complicated.


What you posted is right, but I wouldn't classify indirect branch prediction as value prediction, because it doesn't supply a value to any other part of the processor (e.g. substitute a predicted value into a register). It just steers the frontend to where instructions should be fetched from, so it is just branch prediction.


Branch-Prediction and Value-Prediction are two similar fields but they are really different fields.

When a branch prediction is wrong we have to change the execution path and therefore flush the executed instructions (on the wrong path).

With a value prediction the execution path is correct so you don't have to flush the instructions, but you have to replay them with the right value. These two techniques involve very different hardware.

Here in the article it is about value prediction emulated in software using branch prediction, it uses the hardware of the branch prediction therefore mispredicted instructions must be flushed.

The only valid example of value prediction you gave here is for prediction of the stack address for memory renaming. In all other cases you have to flush the instructions, while for memory renaming you can just replay it.

And this is a very restricted form of value prediction. There is no real complete value prediction mechanism in modern processors yet.


This works when list nodes are allocated subsequently without other irrelevant allocations intervened. If your case doesn't fit this constraint, value speculation doesn't seem to work. Any comments?


So the solution is to write assembly for an extremely common micro-architecture design. Maybe the compilers should provide more tools to exploit such parallelism. That asm block looks like a weird hack.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: