Revisiting the Intel 432 (2008)

ChuckMcM · on April 28, 2015

The Colwell paper really is excellent. And given feature sizes of chips today it would be fascinating to see a 432 implemented as envisioned, rather than as possible given transistor counts of the day. It was going to be the microprocessor version of the MULTICs system and much of what it imagined doing in hardware (capabilities) would make for secure environments that you could reason about more effectively. Probably make for a great FPGA project now.

willvarfar · on April 28, 2015

CHERI. http://www.cl.cam.ac.uk/research/security/ctsrd/cheri/

ChuckMcM · on April 28, 2015

Thank you! Those are awesome. Downloaded all the papers to my iPad for perusal.

willvarfar · on April 28, 2015

This is the newest (and my favourite) paper http://www.csl.sri.com/users/neumann/2015oak.pdf

Another chip that is better supporting privilege separation (but not using Capability-based addressing) is the Mill. (disclosure: I'm on the Mill team).

rdc12 · on April 29, 2015

lowRISC (an open SoC effort for the RISC-V arch) has a form of tagged memory, that among other things can be used for capabilities.

[0] http://www.lowrisc.org/docs/memo-2014-001-tagged-memory-and-... [1] http://www.lowrisc.org/blog/2015/04/lowrisc-tagged-memory-pr...

SixSigma · on April 28, 2015

Since learning of it, I have always been disappointed that I was not able to write and run software for a Burroughs machine running the MCP

http://en.wikipedia.org/wiki/Burroughs_MCP

"the first operating system to manage multiple processors, the first commercial implementation of virtual memory, and the first OS written exclusively in a high-level language."

using the Work Flow Language

http://en.wikipedia.org/wiki/Work_Flow_Language

Like the success the 432 could have been I feel that the Burroughs architecture was prematurely abandoned and at modern speeds would have plenty to offer.

skissane · on April 28, 2015

Burroughs MCP still exists, so I am not sure how true your statement that it was "prematurely abandoned" is. Burroughs became (through a series of M&As) Unisys, and Unisys still supports MCP and is updating it with new versions. It runs on Unisys Clearpath mainframes. Unisys has moved away from their distinctive physical hardware to software emulation on an x86 platform. x86 has improved so dramatically, that even given the emulation overhead, it still is faster than the old physical mainframe CPUs.

There is also an emulator which runs an old (1970s) version of MCP, not the current (2010s) version - http://www.phkimpel.us/B5500/

SixSigma · on April 28, 2015

Thank you, your response was partly what I was hoping to get by saying it.

Taniwha · on April 28, 2015

(As an old time B6700 hack ...) I think there were two main architectural problems with the Burroughs architecture:

- it depended on compilers making safe code for both its system security and system integrity - this mean that you couldn't write your own compiler without being the equivalent of root, and you couldn't test it without potentially taking down the system - there was no hard memory protection between processes

-memory descriptors contained both lengths and pointers - that limited memory size to 2^^((wordsize-overhead)/2) words - in Burroughs case it was 6Mb (6 byte words) huge at the time, tiny today

The 432 had it's own issues - I think it designed itself into a corner at a time when memory speeds were very low so tight instruction coding was important, microcode was where you got your speed - the problem with that is of course that buggy microcode is hard to fix - cheaper/faster memory made the risc revolution possible and now we look at different bottlenecks

Taniwha · on April 28, 2015

Oh and WFL was nothing much to write home about - it was essentially the shell scriptin language - most people coded in Algol or Fortran (or for system stuff Espol which was just the Algol compiler with a few extra system related features enabled)

pavlov · on April 28, 2015

This post could have its publishing year 2008 included in the HN title.

Seven years later, we have some perspective on the author's prediction:

"Indeed, like an apparition from beyond the grave, the Intel 432 story should serve as a chilling warning to those working on transactional memory today."

Intel's transactional memory implementation TSX was famously broken in its initial incarnations in Haswell/Broadwell. [1]

[1] http://en.wikipedia.org/wiki/Transactional_Synchronization_E...

bcantrill · on April 28, 2015

And a few months after that, the gloves really came off.[1]

[1] http://dtrace.org/blogs/bmc/2008/11/03/concurrencys-shysters...

acqq · on April 29, 2015

Thanks!

The link to the article in ACM Queue (October 24, 2008) which is a deadlink on that page at the noment is now:

http://queue.acm.org/detail.cfm?id=1454462

"Real-world Concurrency, Bryan Cantrill and Jeff Bonwick, Sun Microsystems"

It's sad that ACM is not able to redirect their old links properly.

The other article mentioned is probably:

http://queue.acm.org/detail.cfm?id=1454466

"Software Transactional Memory: Why is it only a Research Toy?"

I'm actually searching for the analysis of the Intel's implementation.

bcantrill · on April 29, 2015

Argh -- my apologies for the dead links! I have updated all of them, with my apologies again for the apparent inability of the ACM to honor old links.

willvarfar · on April 29, 2015

(There is a distinction to be made between STM and HTM)

EdSharkey · on April 28, 2015

Are there any parallels worth observing between the Intel 432 history and the Itanium history?

I was always enthusiastic about Itanium when it was announced and sad when it didn't unseat the x86.

kps · on April 29, 2015

I think of Itanium as the i860¹ redux. Andy Glew et al had some interesting discussion on comp.sys.arch² about MPX³ as an emasculated descendant of a capability system.

¹ http://en.wikipedia.org/wiki/Intel_i860

² which this www is too small to contain

³ http://en.wikipedia.org/wiki/Intel_MPX

scott_s · on April 28, 2015

I immediately made the connection, too. Itanium pushed a lot of complexity up into the compiler. The performance gains were supposed to come from the Itanium's wide-issue instructions, where one would issue multiple instructions at once. But that requires the compiler to do the necessary analysis to know which instructions can be executed at the same time. Typical out-of-order processors figure this out at runtime. My understanding is that in practice, the Itanium compiler was not able to do this well enough to justify the architectural decisions.

That, though, is still far less radical than the design presented in this article.

scott_s · on April 28, 2015

I want to call this text out, because I nearly did a spit-take when I saw it, and someone skimming the post may miss it:

The mortally wounded features included a data cache (!), an instruction cache (!!) and registers (!!!). Yes, you read correctly: this machine had no data cache, no instruction cache and no registers — it was exclusively memory-memory.

No. Registers. I would love to know what discussions they had, and what arguments were made, to come to that decision. That point alone has made me re-open the paper to take a closer look.

spitfire · on April 28, 2015

It was a stack machine. No user visible registers.

One of the neat things about stack machines is that you can quietly add registers to a machine in the background without having to recompile code.

I think this would have been a huge advantage over time. Being able to add registers and performance without having to recompile code. It took us decades to add a few new registers to the X86 architecture.

Just imagine if every single (286, 386, 486, pentium, pII, etc) generation had more registers and they were automatically used by software.

Pretty neat imho.

IIRC it was also a tagged architecture as well. So they could define a generic instruction set and fallback to software implementation for many ops, adding hardware implementations at leisure.

Executive summary: They could have done it right first, then made it fast. Rather than make it fast, then try and clean up the technical debt later.

yuriks · on April 28, 2015

>Just imagine if every single (286, 386, 486, pentium, pII, etc) generation had more registers and they were automatically used by software.

This is, in fact, exactly what modern processors do. They have upwards of a hundred registers internally.

twic · on April 28, 2015

Although because the x86 only has a handful of architectural registers, it has to spend quite a bit of area and heat working out how to make use of them all.

The SPARC's register window design was rather more elegant - provide 520 architectural registers, arranged in a stack, and shuffle between physical registers and memory as needed:

http://ieng9.ucsd.edu/~cs30x/sparcstack.html

Unfortunately, it seems it didn't actually work very well!

angersock · on April 28, 2015

I had a professor in college bag on those, but he never really explained why. Does anybody know?

FullyFunctional · on April 29, 2015

Oh where to start.

* Primarily, compiler technology leap frogged it to where you can do at _least_ as well with a fixed set of registers and a global allocator.

* The windows were inspired by SPURS (IIRC), where it allowed a much finer granularity whereas SPARC's window is always exactly 16 registers (8 are global, and 8 overlap with the next or previous window).

* Windows turned out to be a real PITA for super scalar implementation.

* Windows assume a constrained model of computation and makes efficient tail recursion hard and co-routines impossible.

etc etc

Give me more time and I could make the list longer, but the crux is that it's another example of a misguided shorted sighted optimization (like branch delay slots, shared with many RISCs).

FullyFunctional · on April 29, 2015

SOAR, not SPURS.

* The window overflow and underflow are complicated to handle correctly for the OS.

* The fixed window size means a tax for deeply recursive functions that doesn't need the 16 registers.

twic · on April 29, 2015

I really liked the idea of branch delay slots too :(.

FullyFunctional · on April 30, 2015

I can promise you wouldn't once you've tried going beyond the simplest possible single issue pipeline. Thankfully branch prediction made them sort of pointless. RISC-V and Alpha are two of the better RISCs ISAs in this world and they don't have them. Read the RISC-V ISA footnotes [1] for excellent design decisions rationales.

[1] http://riscv.org/download.html#tab_isaspec

JoeAltmaier · on April 28, 2015

I recall it was a microprogrammed processor - a set of chips including a rom 'executed' the instruction set. More of an emulator than a processor. So maybe nowhere in that design for a cache?

scott_s · on April 28, 2015

Possibly - the design started back in 1975, which is when we were still figuring out a lot of how processors should be designed. That's also far back enough that, frankly, I don't have much of an intuition for what people were thinking.

By the way, I am struck by how close their object model sounds like the JVM. A major difference, though, is that the JVM can do some extra work at runtime to figure out when it is profitable and safe to throw away the overhead causes by safety and isolation - JITing. Then it can just execute instructions optimized for the hardware, not constrained by the object model. When your hardware maps to your object model, you can't do that.

wmf · on April 28, 2015

The IBM 801 project started around the same time but its proto-RISC approach was kind of the anti-432. The 801 papers give a pretty readable explanation for "why we are proposing the opposite of what everyone else is doing".

twoodfin · on April 29, 2015

Do you have links for any of these? I love reading historical rationale papers.

JoeAltmaier · on April 29, 2015

I don't. I asked my brother but he just reminisced:

"I was at Intel as that project ramped up. One of my friends moved to Portland to work on it. I tried to move, but they didn't want any system software people. If I had, my life would have gone a totally different path...

Rich"

orionblastar · on April 28, 2015

http://en.wikipedia.org/wiki/Intel_iAPX_432

Intel's first 32 bit processor. Tried to get away from the 8008 and 8080 and be like the 8800. Was more like a micromainframe in design. It was a stack machine with no registers.

zackmorris · on April 30, 2015

"every function was implemented in its own environment, meaning that every function was in its own context, and that every function call was therefore a context switch!. As Colwell explains, this software failing was the greatest single inhibitor to performance, costing some 25-35 percent on the benchmarks that he examined."

It's too bad because this is the future of computing. Or more precisely - rather than having a "process" that owns child data and functions, future processors will default to every function running in its own isolated environment, defaulting to no shared state. There will be a more advanced copy-on-write permissions model that determines what's shared (if anything), more like shell processes or channels in languages like Go. To have that kind of scalability (with respect to future goals like multiprocessing) and security at only a 35% performance cost would certainly have been compelling, especially in the 1970s.

It's too bad that they didn't understand the physical limitations that make registers and caching basically a necessity, because it might have saved them. These things are only optimizations, so with today's available chip area and place and route techniques, they could be wedged into a conceptually correct architecture without much trouble.

In short, I would take these findings with a grain of salt and consider how technology has advanced to the point where yesterday's blunders might be tomorrow's breakthroughs.

acqq · on April 29, 2015

Reading about 432, I didn't expect:

"Instructions are bit variable covering a range from 6 up to and beyond 300 bits in length using Huffman encoding (I, 171)."

Huffman compressed instructions. Wow.

http://www.brouhaha.com/~eric/retrocomputing/intel/iapx432/c...