Hacker News new | past | comments | ask | show | jobs | submit login
The Intel 8088 processor's instruction prefetch circuitry: a look inside (righto.com)
185 points by matt_d 10 months ago | hide | past | favorite | 43 comments



A 4116 DRAM chip with a 250 ns access time would have a 500 ns cycle time, so could read or write 2 million words per second. The Acorn people called this a 2 MHz memory. You could have shorter cycle times in "column mode" where you pulsed the /CAS signal only to access words in the same row and the Acorn designers made good use of that for the Electron (reading 8 bits from a 4 bit wide memory) and the ARM based Archimedes (fetching instructions other than jump, load or store as well as loading the video and audio FIFOs).

From one DRAM generation to the next the cycle time might be different for chips with the same access time. This allowed me to make the 512KB Macintosh faster than the 128KB one (http://www.merlintec.com/lsi/mac512.html).


You could also have shorter cycles times by just doing it and assuming it'll be fine. Acorn's Master Turbo uses 4864-2 DRAMs, which have a 150 ns access time/270 ns cycle time, for, I think, a max throughput of 1e9/270=3,703,703 bytes/sec - but the Master Turbo's 65c02 CPU runs at 4 MHz, and therefore requires 4,000,000 bytes/sec. And, somehow, it works...


That is probably similar to how you could use both sides of a single sided floppy. Both sides were tested and the factory ended up with a huge pile of floppies that passed both sides and a small pile that failed one side (and some that failed both sides and were discarded). The market wanted more single sided disks so they simply moved a number of floppies from the first pile to the second. That meant that you mostly likely got a double sided product when you paid for a single sided one. But it was a risk.

Same thing of processor and memory speed bins.


That's a fantastic read! Thanks for sharing. What are you working on these days?


Thanks! I still design Smalltalk computers like I did back then but now I also design my own processors.


> The 8086/8088 do not provide consistency ... self-modifying code can be used to determine the queue length, distinguishing the 8086 from the 8088 in software.

I sincerely doubt I was the first to work that out; but I remember being so incredibly happy when I figured that one out, when it solved a problem i had.

Cannot now recall why the difference was significant, something about installing different routines for bashing serial ports i think.


In the era before internet, you either figured things out on your own or you couldn't get anything done.

If you were very lucky, some magazine might have mentioned it. Another way out was to just use disassembler if some other software package performed the same thing.


You still do on occasion. Some years ago I’ve tried to write a precisely timed bitbanging loop on a STM8 microcontroller (a 650x/680x-alike with an allegedly 3-stage pipeline and a small prefetch buffer; cents per chip pre-Covid), and the instruction prefetch completely screwed me up. (At least I think it was the instruction prefetch? The thing depended on branch target alignment, whatever it was.) The one relevant question on Stack Overflow that hasn’t received any answers in years; the manufacturer documentation mentions its instruction timings are “simplified”, aka a lie, and gives a more elaborate model of the pipeline that it admits is also a lie and can’t reproduce the timings I’m seeing.

Many things are immeasurably easier than what I remember as a middle-schooler with an utterly anachronistic 286 in post-Soviet early 2000s Moscow, so that’s nice. It doesn’t make the blasted loop work, though.

(Many others are also worse. Today’s me could work the motherboard design of the 286 by looking at it, even without the manuals; my current laptop’s manufacturer’s refusal to release the schematics annoys me enough that I’ve half a mind to ask some physicists if they have a CT machine they could run the board through.)


Compared to now, back then it was a lot more you had to figure out on your own.

I remember coding a game on C64 in the eighties. Just to figure out how to print the players score so that it is sufficiently fast was a challenge. Dividing by 10 with modulo to convert numbers to digits was just way too slow.

My method was not to use normal math, but to directly manipulate screen RAM characters when the score increased.

That was a very cheap way to increase the players score by say 1000 – you didn't even have to care about 3 lowest digits, just inc thousands place by 1, if it overflowed past 9, increase next position left, etc.


Looks to me like you, and perhaps others, could benefit from a set of timing tests.

You know, setup a test harness. With timers and such, then walk through the cases.

Of course, that is exactly what the manufacturer should have done! I always wondered at the high errata metrics associated with some catalog parts. It is just not enough to work through the circuit and hope for the best!


The public or college library could really come in handy.


Half of what keeps me going is the belief that I can randomly wander into this forum and witness black magic being tossed around like spare change. ;D


Neat! Eventually, Intel added the CPUID instruction so you could determine the processor type without trickery.


There were all kinds of tricks prior to cpuid to figure out what kind of cpu you were running on . I had actually forgotten about that - thanks for reminding me !


Robert Collins' article lists some of these methods, mostly centered around flag bits behavior: http://www.rcollins.org/ddj/Sep96/Sep96.html


Straight from dr Dobbs - that felt just about right thanks.


IIRC it was added on Pentium and maybe late 486. You had to do classic tricks to identify the model before that.


CPUID has been added first in Pentium (66 MHz), in 1993.

Nevertheless, there have been some late variants of 486 that have been introduced after the first Pentium, in 1994 or later, and which had CPUID, e.g. the Intel 486DX4 (100 MHz).

AMD had 2 generations of 486DX4 (and of 486DX2), the first did not have CPUID (and it had a write-through cache memory), while the second had CPUID (and it had a write-back cache memory).

Some Cyrix CPUs with properties intermediate between 486 and Pentium had CPUID, but it was disabled by default and it could be enabled in the BIOS.

Measuring the length of the prefetch queue was the standard method to identify 8088 vs. 8086 and this was available in several commercial CPU detection utilities that were available for MS-DOS, e.g. in Norton Utilities or the like.

At that time I have discovered this by disassembling such a utility program.


Indeed, I have a 486 DX-50 without CPUID, and a 486 DX2-66 with CPUID support. The latter provides much more detail when viewed in CPU-Z.


I beieve in the Pentium, the prefetch queue became snooped, which coincides nicely with the introduction of CPUID.


In Pentium the cache memory became split into instruction cache and data cache.

This has forced the introduction of the snooping workaround, otherwise the stores into the data cache would not have influenced the content of the instruction cache.


it also may have something to do with why CPUID is strongly serializing? that always really confused me..its not like, the CPU type is going to race with a load or a store


I suspect the serialising is just a side effect of the CPUID instruction being implemented as a massive microcode routine.


Author here for your 8086/8088 questions...


From the blog:

> However, the 8-bit bus enabled cheaper computer hardware.

I wonder if that can be expanded on. An old (now departed) friend once said to me that this was a mirage, because the only thing that the narrower bus really bought them was that it made the 16kB configuration possible, which no-one actually bought (the minimum for using floppies was 32kB!). He claimed that the narrower bus didn't actually make the configurations with more RAM cheaper because it made them require more support chips. Is there any truth to this?


According to Dave Bradley, one of the creators of the IBM PC, "We chose the 8088 [over the 8086] because of its 8-bit data bus. The smaller bus saved money in the areas of RAM, ROM, and logic for the simple system."

Another thing is that even if almost nobody bought the minimal RAM configuration, having a low-cost configuration can be very important from a marketing standpoint. (By the way, it's kind of amazing that the base RAM for an IBM PC was just 16 kilobytes, and now 16 gigabytes is a base RAM configuration.)

https://www.tech-insider.org/personal-computers/research/199...


It's mentioned at the bottom of this article[0] that Intel had an obsession with using the smallest packaging and pin count possible, while competitors did not see any need to be that conservative. I wonder if that has something to do with it.

[0] https://www.righto.com/2023/07/8086-pins.html


The 8088 and 8086 actually have the same number of pins and almost identical pinouts, despite the the difference in data bus width. This is because the data and address are multiplexed onto the same pins, because of intel's obsession with lowest pincount (each pin does add quite a bit of cost to packaging).

On the 8086, the bottom 16 address pins (of 20) are multiplexed with data, on the 8088, only the bottom 8 address pins are multiplexed.

So you had to demultiplex the bus no matter which chip you chose. I think the cost savings mostly come from being able to configure systems with only 8 DRAM chips per bank (instead of 16 DRAM chips per bank with the 8086). A 16 bit bus also requires that your ROM chips are in pairs, and it probably increases the motherboard routing complexity. And a small bit of extra decoding logic.

A 16 bit bus would have also required IBM to skip straight over the 8 bit ISA standard with it's smaller sockets and forced up the complexity of all PC expansion cards (which would also require double the ROM chips, double the RAM chips and extra logic)

Edit: And now that I think about it, the fact that the upper bit of the address bus aren't multiplexed might actually allow you to simplify the DRAM row/column addressing logic... But only if you put the column into the upper bits of the address.


The cost saving was more a marketing trick.

Most of the savings were not from memory but from the reuse of the existing 8-bit peripherals without additional hardware and without software changes.

On a 16-bit bus, you could connect an 8-bit peripheral to one of its halves, but then the internal registers that previously were at consecutive addresses now were spread at multiples of 2.

When doing only 8-bit transfers, you could rewrite the software drivers to use the modified addresses. However you could not do 16-bit transfers, because the bytes were no longer in the same word. This could be fixed with additional buffers, decoders and registers, to convert 16-bit transfers into pairs of 8-bit transfers on the same bus half (like 8088 did internally), but that would increase the cost.

Even if these compatibility problems were not too difficult to solve, most preferred to avoid them, in the quest for minimum cost.


It's quite humbling to look at how much they had to know to implement all these little details, and then realise that this is a 45-year-old design which is several orders of magnitude less complex than modern processors. Even other well-known CPUs of the time like the 6502 and Z80 are only a few years older, yet far simpler.


>Whenever I mention x86's domination of the computing market, people bring up ARM, but ARM has a lot more market share in people's minds than in actual numbers. One research firm says that ARM has 15% of the laptop market share in 2023, expected to increase to 25% by 2027. (Surprisingly, Apple only has 90% of the ARM laptop market.) In the server market, just an estimated 8% of CPU shipments in 2023 were ARM. See Arm-based PCs to Nearly Double Market Share by 2027 and Digitimes. (Of course, mobile phones are almost entirely ARM.

"ARM has a lot more market share in people's minds than in actual numbers"

haha, that's true, especially on HN


Domination apart from the CPUs that power the main computing devices for the large majority of people (billions) around the world.


I think it's grossly disingenuous to say "computing market" and then ignore the entire smartphone/tablet market.

Hell, smartphones are by far the most personal computers we have ever had.


> I think it's grossly disingenuous to say "computing market" and then ignore the entire smartphone/tablet market.

Note the presence of the sentence: "Of course, mobile phones are almost entirely ARM."


Yes, but the implication is they aren't part of the "computing market" which I think is disingenuous.


How about "user programmable computers," whereas phones are "terminals".

I mean, I agree with your point; but the distinction is there. We need some way to differentiate between computers for hacking and computers that are vending machines for the hacks of others


Do we? Purists will claim that because I can't run xcode on an iphone, it doesn't count as a "real computer", and that's true, to some degree. the thing is, people who aren't programmers these days also use computers, and don't use them to compile binaries. Would we call a laptop that never has a compiler installed a "vending machine for the hacks of others"? (I'm not sure what that even means.) Would a chromebook be such a machine, up until a compiler is installed? Given that people run businesses off ipads; send/recieving emails, writing docs and spreadsheets, why do we draw the line for everybody at can-compile-on-machine? If a web developer using web vscode/ssh can do remote development on an ipad, why, again, is can-compile-on-mahine our important metric? It seems like an unnecessary purity test that doesn't stack up to the modern era of the Internet.

If I write simple programs on a iPhone using Shortcuts, does it become a general purpose computer? It's programming, just with a UI and its own graphical language. How complex a program do you have to be able to "compile" in order for it to count as writing a program on device? Because there are tons of little programs being written using Shortcuts, (and also Pythonista), so you'll have to be more specific.


> How about "user programmable computers," whereas phones are "terminals".

My install of Termux from F-Droid on my Android phone disagrees with that. While more limited than a PC, most smartphones can still be considered "user programmable computers".


Is the definition of hacking writing code only for yourself? Surely (most of) the point of writing code is for other people to use it.

If you ignore the majority of devices out there that you can write code for then that's not really a sensible definition of market share.


What meaningful distinction is there to be made? Computers are computers.


I wonder if queue length is already maintained by queue counter then is there really a need for MT flag?

Let’s say even if it is possible to do it, would the resulting saving of real estate and power would be worth the effort?


Two reasons:

1. The queue is 4 bytes long, which fits neatly in a two bit counter. You would have to switch to a three bit counter to store the 5th state, which increases the area and power usage by ~50%

2. The MT signal is explictly needed to stall the execution unit when the prefetch queue is empty. If you replaced the flag register with a 5th state of the queue counter, then you would still need combination logic to generate the MT signal (queue[0] != 1 || queue[1] != 1 || queue[2] != 1)

I'm guessing this two bit counter + MT flag scheme is actually optimal from a transistor count perspective.


The queue length isn’t maintained by the queue counter. For a start, there are two queue counters - the write counter and the read counter, each of which is a two bit counter. Each points to one of the four positions in the queue.

The queue itself though can be in one of five states - its length can be 0, 1, 2, 3, or 4.

The difference between the position of the read and write counters (which is always available through the hardcoded XOR subtraction circuit detailed in the article) is either 0, 1, 2, or 3.

The flag allows you to tell whether the 0 result of that subtraction is a zero length queue or a full queue.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: