A few questions I've had for a while: First, if a reasonably high performance pr...

Symmetry · on Oct 18, 2018

A processor tiny enough that co-locating the integer and floating point computation units closely enough to share a register bank is a good idea will be too small to use register renaming. Having separate clusters with their own banks and their own bypass networks is a really big win.

For the second, if you have a variable length instruction encoding scheme adding an extra argument is going to increase i-cache pressure. If not then you might as well if you do FMA4 but I think most fixed encoding ISAs use FMA3.

rayiner · on Oct 18, 2018

In a three-address machine, separating the integer and floating point registers basically saves you three bits per instruction word compared to a unified register file of the same aggregate size. Also, on a 32-bit machine, you save a few transistors by making the integer rename registers 32 bits instead of all 64 bits to accommodate a double float. (And if you have vectors, it really makes no sense to throw away 128 or 256 or 512 bits to store a 32-bit or 64-bit integer).

KMag · on Oct 18, 2018

As I mentioned, though, there aren't many functions that use both the full compliment of integer and fp registers, so I think the aggregate register file size is rarely a factor. Aggregate register file size is also a detriment to fast context switches.

As long as you defined consistent semantics for switching among integer, floating point, and vector use of the same logical register, there's nothing stopping one from using a 32-bit-wide integer register file, a 64-bit-wide fp register file, and a 512-bit-wide vector register file. From an ISA level, you could (for instance) define all operations as if they worked on vector registers. Your imul could always compute a 32-bit result and sign-extend it to 64 bits as the first vector element, and zero out all but the first element of the vector. You wouldn't actually store it that way, since the top 33 bits would always be identical for the results of integer operations (and subsequent vector elements would always be zero). So, from the outside, it would look like all operations worked on very wide registers, just that the vast majority of operations did very trivial things with most of the output bits in those wide registers. The sign extension and zeroing operations would actually only happen when moving values between the internal register files.

Presumably, you'd use the same tricks used in other processors for actually tracking the amount of vector state that needs to be saved on context switches. You might re-use some of the same techniques for economizing the amount of vector state saved across function boundaries. Or, perhaps you'd define an ABI such that system calls preserve vector state, but all vector state beyond the first double is caller-saved state across function boundaries.

deepnotderp · on Oct 18, 2018

Integer and fp is indeed separate in many modern processors.

KMag · on Oct 18, 2018

Yes. My question wasn't if processors have split register files. The question was why the split is exposed at the instruction level instead of being hidden away as an implementation detail. Register renaming hardware is very common in modern processor designs and would make it very easy to make split physical register files look like a unified architectural register file.

I did a bit more reading, and both the IBM Cell and the Adepteva Epiphany processors expose unified register sets at the instruction level (architectural registers).

Many processors these days already contain the hardware to hide this away as an implementation detail, giving more design flexibility to the hardware designers. Furthermore, the processors that don't have register renaming hardware are likely to be small embedded processors that would benefit from not having split register files.

Symmetry · on Oct 18, 2018

By exposing it at the ISA level you save bits in every instruction through having the register addressing implicit in the instruction type.

KMag · on Oct 20, 2018

That's only true if you replace, say, 32 integer and 32 fp registers with 64 registers. As I mentioned, very few functions require both a large number of fp and a large number of integer registers.

Symmetry · on Oct 22, 2018

You were talking about high end processors with register renaming though, right? At that point you have stuff like L2 caches which take up way more transistors than the register file. And with one register file the space near it is going to be at a premium as you try to squeeze both the integer and floating point execution units near to it. But with separate clusters you can surround your integer register file with the integer bypass network and the integer execution resources and you can surround your floating point execution unit with your floating point execution units and bypass network. It's the same reason, mostly, that processors have split data/instruction L1 caches - you want to put the cache near the structures that use it.