A Journey Through the CPU Pipeline (2013)

nkurz · on Oct 29, 2015

I understand some aspects of modern processors fairly well, but there are still some simple aspects of branch prediction and speculative execution that leave me bewildered. Perhaps someone here can explain to me what actually happens in a specific case for Haswell or some other recent processor?

Assume I have a list of id's, and each id is the index into array of strings. The strings are 0-terminated, and of uniform random length from 1-32 bytes. I want to create an output buffer that consists of the concatenated strings corresponding to the id's in the list.

The pseudo C code would look like this:

  char *stringList[NUM_STRINGS];    // strings of length 1-32
  int idList[LIST_LEN]; // initialized to rand() % NUM_STRINGS
  for (int i = 0; i < LIST_LEN; i++) {
    int id = idList[i];             // id from existing list
    char *string = stringList[id];  // use id to get string
    while (*output++ = *string++);  // copy string to output
  }

Assume that stringList is large enough that the strings are not in cache, and thus there will be a ~300 cycle latency after each is loaded. Presumably, the "copy" loop will be speculatively executed during this latency. But since the string lengths are random, the termination for this loop can't be well predicted, and since the data for the copy isn't there yet, I'm not sure exactly what happens.

Does the processor predict that the string copy loop will never terminate, and keep buffering string copy instructions until the buffers are full? Which buffers? Does it stop speculating at some maximum number of branches? If so, what is this depth? Or does it manage to guess that the inner string copy loop terminates after some reasonable number of iterations so it can issue the outer loop load for the next stringList[id]? If so, how many such loads will it manage to issue?

brigade · on Oct 29, 2015

Completely random from 1-32 -> first order branch predictor determines that the branch is usually taken, and generally predicts it as such. There is likely a second level predictor that notices that after 16 taken branches it becomes more likely to not be taken, so on average it'll predict loops of length 16.

The most relevant buffers are (sizes for Haswell):

- 192 µop reorder buffer (ROB)

- 42 entry store buffer

- 72 entry load buffer

So assuming an average length of 16, the processor should issue far enough ahead to keep ~3 outstanding strings being loaded from memory simultaneously (3*16 byte stores = 48, larger than the store buffer.) Then when a branch mispredicts occur, the recovery is quick since the needed cachelines are likely already in L1 (unless the string is longer than average and crosses another cacheline.)

If you wanted to speed this up, you could prefetch several (~19) strings ahead in the list. (300 / 16 byte stores at 1 store per cycle = 18.75)

nkurz · on Oct 29, 2015

Thanks! Would this mean that if the strings were a constant length longer than 42 (or possibly longer than the longest predictable loop, which Agner says is "32 or a little more"), that the 42 entry store buffer would be the limiting factor and the next outer load would only be issued after the string begins to be copied? Or is there some limit on the number of outstanding speculative branches kicks in first?

Also, can you point to any further documentation on the behavior of the second level predictor? Most things I've seen suggest that it matches only exact patterns, and thus can only predict patterns that happen at least twice in a row. I haven't been able to find anything about averaging.

brigade · on Oct 29, 2015

Hm, I assume that if it were longer strings such that the branch is always predicted taken, that yeah it wouldn't speculate to the next string until the mispredict is found, which won't happen until the data gets into registers. If, on the other hand, you had Pascal-style strings with a known length up front instead of null-terminated strings, then the mispredict could be resolved independently of loading every byte of the string, and it could have outstanding loads for two strings simultaneously. But only for strings that span more than one cacheline.

The only limit for speculation is the OoOE buffers. I abbreviated it, but the inner loop is a minimum of 5 µops, so you actually end up hitting the ROB limit at 39 iterations before the store buffer is full. Though some architectures could reduce it to 4 with a compare+branch instruction. (x86 fuses this sequence to one µop but I'm not sure if it reduces the ROB entries)

If you don't hit the ROB limit, then even after you hit the store buffer limit, the load buffer can take some loads. So assuming 4 µops, 42 stores and 48 loads would be actively buffered while their cache misses are resolved.

There's no public documentation on what modern CPUs actually do for branch prediction; that's why I said "likely" - I'm not claiming that any specific CPU does precisely what I described. Just that it's likely to be better than "predict the branch as always taken." The more general "hash pc history as input to branch prediction" is a better way to look at it. How it's hashed and used is a trade secret.

Actually, thinking about it again, for truly random data the only prediction improvement over "always taken" is predicting the 32nd consecutive inner branch as not taken. Predicting not taken before that increases the mispredict rate...

cfallin · on Oct 29, 2015

If it's the first time the CPU has seen this code, then the predictor has nothing to go off of, but it will learn "taken" pretty quickly when it sees the `while` loop continue to iterate.

However, branch predictors also learn based on branch-history context, and sometimes with reasonably long history lengths (see Agner Fog's guide for some numbers), so it's possible that on the second time into this snippet, the predictor might actually get the lengths right for each entry in `stringList`. It really depends on the specific chip.

(There have been some research proposals for "branch confidence predictors" that would halt fetch until the conditional branch is resolved in the backend, for efficiency reasons, but I don't think anything like this has actually been built. See e.g. [1].)

[1] http://people.engr.ncsu.edu/ericro/publications/conference_M...

fulafel · on Oct 29, 2015

Backwards jumps are generally predicted as taken in absence of history information or compiler-provided hint.

vardump · on Oct 29, 2015

Those hints are generally ignored by the CPU anyways.

fulafel · on Oct 29, 2015

Are you referring to Intel's recent change to start ignoring the hint prefix opcodes? I thought at least ARM and PowerPC still honor them in latest chips (and have them as bit flags in branch instruction encoding, instead of verbose prefix bytes).

kevinnk · on Oct 29, 2015

> Does the processor predict that the string copy loop will never terminate, and keep buffering string copy instructions until the buffers are full?

Pretty much. "Textbook" branch prediction isn't really dependent on the actual data, but on the global and local branch history. If it sees a loop is taken 9/10 times and there's no particular pattern, guessing "taken" is a pretty good idea.

> Which buffers?

If you're assuming the strings are not cached, probably the load-store unit (LSU), which keeps track of outstanding loads and stores.

> Does it stop speculating at some maximum number of branches?

Not that I'm aware of. With out-of-order execution you'd use something like Tomasulo's algorithm, which won't stop speculating until the reorder buffer (ROB) is full. In this case, the ROB is usually much larger than the LSU, so I'd think that would be the limiting factor.

yoklov · on Oct 29, 2015

Great article.

A similar, but much more in depth look at the GPU pipeline is available here [0] for those interested.

[0]: https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-...

halayli · on Oct 29, 2015

memory ordering can get complicated. Here's further discussion from intel's software developer manual:

section 8.2:

http://www.intel.com/content/dam/www/public/us/en/documents/...

renox · on Oct 29, 2015

Cache management is also quite complicated, I was really surprised to read: https://www.reddit.com/r/programming/comments/3p9idq/false_s... http://stackoverflow.com/questions/29199779/false-sharing-an... apparenly 'false sharing' like effect can happen with the cache line next to the one you're using.. Uh??

witty_username · on Oct 29, 2015

> Those workflows are relatively rare; Hyper-Threading usually provides consumers with approximately double the speed they would see for their everyday computer tasks.

Not really, IIRC it's more like 30-50%.

joseraul · on Oct 29, 2015

Section "An Example" in the middle of the page is an original description of the pipeline, from the point of view of an instruction.