Author here. I've been looking at the 386 if anyone has questions. This post was inspired by userbinator's discussion on HN a couple of weeks ago about how many transistors there are in the 386.
> ..."tapeout", when the chip data is sent on magnetic tape to the mask fabrication company.
That's roughly true in a temporal sense, but it's not where the term "tapeout" comes from. They could have shipped the data on a Winchester disk, and the event would still be called tapeout.
In the earlier days of printed circuit board (PCB) manufacturing, you would literally "tape out" your circuit manually with black tape on a white board, typically in an enlarged form.
"Tapeout" came to mean the point in time when you finished taping out your circuit and it was ready to be sent to be photographed and reduced and boards manufactured from the layout.
There wasn't even any "data" involved here, magnetic or otherwise. Just a physical art board with tape on it.
Great post. The most interesting thing to me was how historically significant the 386SL turned out to be. I had always mentally slotted it as a cheap cut-down part for the emerging laptop market, but it actually was a relatively sophisticated (3X the transistors!) precursor of the modern SoC.
What made 386SL special was introduction of System Management Mode (SMM). Intel sued AMD over Am386 SMM implementation, AMD tried claiming its not really SMM but jut some left over debugging ICE implementation :D https://ir.amd.com/sec-filings/content/0000898430-94-000804/...
Total amateur here: does a 386 have “cleverness” or optimizations or does it just quite literally chug through a stream of instructions, adjusting registers and memory?
I guess by this I am thinking about how newer processors do all kinds of stuff at the microcode level that mean you cannot anticipate precisely what instructions are being executed in what order.
Well, yes and no. The 386 is chugging through the stream of instructions sequentially, unlike modern processors. There are various clever optimizations, though. First, there is some pipelining, so the microcode for the next instruction can execute while the previous one is finishing. Second, the CPU has 16 bytes of prefetching, so instructions are fetched from memory asynchronously from their execution. So you know what instructions are executed and in what order, but the timing is fairly unpredictable.
I should mention that microcode in the 386 is pretty different from micro-ops that modern processors run. They both break machine instructions down into smaller steps. But microcode runs sequentially, while micro-ops are sort of tossed into the CPU and run independently through a dataflow engine, with everything sorted out at the end to look sequential. Confusingly, modern processors use "microcode" to hold the micro-ops for complicated instructions that can't be handled by the regular instruction decoder; this is sort of like old-style microcode, but not exactly.
That's really cool: having learned most of my digital design from an era where interpreted microcode was the norm, and the PDP-11 or early 68k was the pinnacle of hardware design worth teaching, I've always wanted to learn about the transitional designs from old-style microcode, to how shrunk forms of microcoding lived in superscalar CISC designs. (I'm still trying for lack of resources.)
386 is actually somewhat slower clock for clock than 286 for same 16bit code. So no, no optimizations. For example https://web.itu.edu.tr/kesgin/mul06/intel/instr/movs.html, 486 was only slightly optimized, it was Pentium where they finally started working on IPC.
Right, the 486 was only slightly optimized for IPC. Most of the 486's performance over the 386DX was from its L1 cache (or just called CPU cache at the time, there weren't different cache levels yet.) You can turn the cache off in a 486's BIOS and it will be barely any faster than a 386 of the same clock speed.
One of the assertions of RISC is that microcode is a performance penalty, to be avoided.
And many have.
"To provide this rich set of instructions, CPUs used microcode to decode the user-visible instruction into a series of internal operations. This microcode represented perhaps 1⁄4 to 1⁄3 of the transistors of the overall design. If... the majority of these opcodes would never be used in practice, then this significant resource was being wasted."
Every time I see something new on righto.com, I get excited! I get to learn :D
Thank you so much, both for constantly sharing what you know and for preserving what's one of the most interesting times in the evolution of computing.
What I like most about these articles, is that it shows how messy even high-tech can be:
All the details in fabrication techniques (and its effect on what logic designers can/can't do), some opcodes removed to make room on the die for a bugfix (?, see note 26), etc etc. "Ultimately all digital electronics is analog".
The business mistakes, and lucky recoveries.
Keep up the good work, Ken! (btw you typo'd an "8" and "6" in note 25 :-)
Some DOI and Bitsavers links are broken (linking to righto.com or 404s). Also, where can I find "Automatic Place and Route Used on the 80386"? DDG only contains one result: this post.
There was some discussion of the automatic placement in this panel interview.
If I remember correctly, the software that performed the placement was written by a graduate student who debugged it from a terminal at his dormitory. It was one of many project decisions on the i386 that management would have absolutely stopped had they been made aware.
"386 is a complicated processor (by 1980s standards), with 285,000 transistors..."
Interesting that ARM1 was only 25,000 transistors. Did the i386 really have additional features that justify an order of magnitude?
One thing is certain in retrospect: Intel should have bought Acorn, not Olivetti.
Edit: Wow, there is even more detail on the placement software in the Righto article; the software was "Timberwolf" written by Dr. Carl Sechen.
Edit2: It appears that later versions of Timberwolf were sucked into Yale's licensing.
...Sechen served as an expert witness in the Cadence/Avanti trial in 2000 and 2001... "I had a chance to examine a great deal of the code in question when I was an expert witness on the trial. It was amazing – I even saw my own TimberWolf code in their tool, where only a single line of code had been changed. And I don’t mean the earlier, far-inferior version of TimberWolf available from Berkeley. The version I found in Avanti’s suite was a far more state-of-the-art version that had somehow been ‘acquired’ from Yale."
I wouldn’t know. Backwards compatibility certainly is high on the list because, when it was released, many users had fairly high investments in commercial software.
> Interesting that ARM1 was only 25,000 transistors. Did the i386 really have additional features that justify an order of magnitude?
Feature wise, the ARM1 is probably more comparable to the Motorola 68000 from 1979, both have ~16 32bit registers, no MMU and no instruction prefect queue. The ARM1 does have a full 32bit ALU compared to the 68000's 16 bit ALU, and a full 32bit barrel shifter, compared to the 68000's 1 bit shifter.
But the 68000 is still 68,000 transistors (hence the name). So not only is the ARM1 achieving about the same level of functionality with under half the transistors, but it can execute instructions significations faster.
>interesting that ARM1 was only 25,000 transistors. Did the i386 really have additional features that justify an order of magnitude?
you're asking somebody to answer RISC vs CISC in a subthread? there's not simple answer to that question, but the x86 family had a leg up with MSWindows compatibility and the processors they developed maintained that hegemony till they went astray with Itanium and were saved by amd64
I've been in physical design since 1997. Back then we were using Cadence Gate / Silicon Ensemble.
We tried the various Avanti tools. Aquarius, Astro, Apollo. I don't remember the exact order but all the space theme names are where the MilkyWay database comes from that is still in Synopsys IC Compiler.
I remember that when the court actually compared the source code they found that some of the comments had the same words misspelled in the same way. Two programmers may pick the same variable names but they aren't going to misspell words the same way in comments unless it was stolen code.
Thanks! I've fixed the links, so let me know if you see any other broken ones.
The "Automatic Place and Route Used on the 80386" article is from Intel Technology Journal, Spring 1986, p29-34. I don't think you can find it anywhere; Pat Gelsinger sent me a copy. Email me (ken.shirriff@gmail.com) and I'll send it to you.
Sadly Computerworld 'Intel backs off 80386 claims but denies chip recast needed (1986)' article was devoid of technical details :(
Do you have more detailed information about what were the main issues with multiple OSes in Protected Mode? Was it implementation bug or fundamental architectural problems?
Do you know by any chance why did Intel miss POPF trap in Protected Mode?
Pentium Virtual Mode Extension (VME) and its Protected Mode Virtual Interrupts (PVI) solve performance burden of trapping, but despite being named _Protected Mode_ Virtual Interrupts this works only in V86 mode leaving Protected with this bug:
"The protected-mode virtual-interrupt feature — enabled by setting CR4.PVI — affects the CLI and STI instructions in the same manner as the virtual-8086 mode extensions. POPF, however, is not affected by CR4.PVI"
Great post! Any idea how the turbo button worked? It seems like, given the transistor difference between the 8086 and 386, that merely decreasing the clock frequency wouldn't be enough?
the lack-of-turbo button was included to enhance compatibility with older software that using timing tricks that relied on the clockspeeds of the older generations of processors. Those older processors would not have used the enhanced features of the new processors, just the backward compatible features. Turbo was a motherboard OEM feature, not a 386 feature. Otherwise, there was no reason not to run "turbo"
Looking at the die, I'd say roughly 1/3 of the die is standard cell. I think some of it was standard cell but with hand layout, rather than automatic place and route. About 1/4 is the datapath, which is highly optimized. Maybe 10% is the microcode ROM.
The SL die photo seems to really show the differences in density that careful layout can produce; one wouldn't think that bus/memory controllers are of the same complexity as the CPU itself, but due to being entirely standard cells, they are almost the same size as the CPU.
Ah, the days of cubicles. Trimble Nav in Sunnyvale had medium-height, solid fabric-backed 6x8' and 8x8' cubicles in 2000. And it was nice to be tucked in a quiet corner of the building by the foosball table room before there was such thing as startup culture. Was almost detained by SGI security by Shoreline Amphitheater (near the 'plex now) doing field radio testing off a coworker's truck that looked like Van Eck phreaking equipment. I should've worn a hi-vis vest. ;D
PS: Raise a paw if you remember the rainbow Apple logo on the triangle building along 280.
One of my greatest treasures as a young computer nerd was a bare 386(I think) chip that I received after sending away for it via an Intel ad I found in Byte magazine. I just had to cut out part of the page and mail it in. Several months later I get a package back with the naked processor glued to stiff card along with a low powered magnifying glass to scope it out with.
I'm always interested in all things (80)386 because this was basically the processor that launched the 32-bit computing revolution, at least as far as popular adoption of computers based on this processor go (there were earlier 32-bit processors -- but no earlier processor became as commercially popular (or as adopted by the mass population) as much as the (80)386).
So this processor is of particular interest to me -- and should be to future computing historians...
This is truly a great article about that processor!
It's information rich (I have not seen a more information rich source about the 386 on the entire Internet, except for perhaps 386 technical manuals and manual fragments, but those documents lack general human readability), and it is of great value to anyone who wishes to study the 386, and it will be of great value to future computer historians...
Personally, I'd say that the IBM System/360 (1964) was the first widespread and influential 32-bit architecture. The Motorola 68000 (1979) also deserves a mention for its use in the Macintosh. (And I'll argue with anyone who says it wasn't a real 32-bit processor :-) But, yes, the 386 started the 32-bit x86 architecture on most (non-phone) computers today.
System/360 is indeed interesting because (if my memory serves me) it was one of the first computational architectures to implement microcode (also, if I recall correctly, the necessity of updating of this microcode was one of the reasons that floppy drives were invented...)
I'm also a fan of the Motorola 68000 -- I used to have an Amiga 1000 "back in the day". I'd choose it over any 8 or 16 bit CPU of the time period, but (correct me if I am wrong) it didn't have an MMU -- which would have made it a less-than-ideal candidate for writing a modern-day Unix compatible operating system, although the authors of AmigaOS managed to pull off quite an impressive multitasking OS on it, despite this fact, nonetheless...
Also, as a 32-bit architecture (as opposed to discrete single-package IC CPU) we'd probably additionally want to remember the VAX 11/780 (1977) -- whose CPU was implemented as circuits of multiple simpler TTL IC's...
(Oh sure, IBM might have done something like that earlier -- but the VAX 11/780 brought down the cost of IBM's comparable computing offerings by at least one order of magnitude! -- Although, even so, the 11/780 still would have been ridiculously expensive to the average person of that time period... (and yes, I know it was intended for mid-sized to large businesses, not people! :-))
But anyway, great article, and yes, the IBM System/360 and 68000 were indeed groundbreaking!
My 20/40MHz had an LED display. When I opened the case, I found the display had jumpers to make every combination of leds possible, even non numeric. A bag of jumpers was taped next to it. I'd put HI/LO on it, or 01/99, or make the turbo 20Mhz and the slow mode 40.
i remember MWC coherent well! it was where i was first learned unix. :)
(except for maybe some vague memories of some unix like os for the atari st that i didn't really understand at the time, but those memories are extremely vague)
It's interesting that if the 386SX wasn't much simpler electronically than the 386DX, but mostly in packaging, that there was never a "386DL" for cost-no-object laptops.
I guess its market window might have been too narrow. It seems like if you wanted the performance back then, you might even go without a battery, so using a desktop CPU without the power-management special tricks is no big deal.