8088 microprocessor IP core fits in 308 LUTs, runs at 180MHz on a Kintex-7 FPGA

mutagen · on Feb 19, 2016

Note that this core runs at ~100 MHz to get equivalent performance (cycle accurate timings) to the original 4.77 MHz 8088. The big deal is the relatively small number of LUTs used, leaving plenty of room for more stuff.

There's more from the creator at http://www.eetimes.com/author.asp?section_id=216&doc_id=1328... including this nice tidbit:

The result is the MCL86, which is basically a 7-instruction, 32-bit micro-sequencer. Some of the micro-sequencer's instructions are specialized so as to allow it to rapidly decode instructions as well as nest function calls. With these seven instructions, I was able to microcode all of the 8086 opcodes in a relatively small number of micro-sequencer clocks.

A video of this running 8088mph would be awesome, they already have a number of videos of this running other stuff on a PC: https://www.youtube.com/channel/UC9B3TaEUon-araO2j7tp9jg EDIT: There is a video of it runnin 8088 mph that polpo linked!

ajross · on Feb 19, 2016

Exactly. This is a "old is new" kind of design. Back when logic was so expensive as to make computers almost impossible, designs were build around a small number of execution units on a small number of buses and a comparatively compact microcode table that could share them sequentially. So you only needed one adder because the IP increment and ALU operation could share it on different clocks.

Then once there was space to put all the stuff needed for a single instruction on the die, we found ourselves clock-limited by long logic depth and started splitting the functions out across "pipeline" stages, which begat RISC, and we've never looked back.

But all that being said: this design is totally cheating. Sure, the logic takes only 308 LUTs. But the microcode is stored in 4 block RAMs, which a quick Google tells me are 36kbit a piece. That's a much more significant chunk of chip resources than is implied in the linked article.

abortz · on Feb 19, 2016

Perhaps in terms of gates, but LUTs represent a much larger fraction of resources on an FPGA than the block RAMs. In other words, if you wanted to pack a lot of these onto an FPGA you'd run out of LUTs before block RAM.

ajross · on Feb 19, 2016

I'm too lazy to look up numbers for the Kintex-7 part in question, but I'm almost certain you're wrong on this. A block RAM is a big chunk of die, and there are comparatively few of them to go around. A LUT is a tiny object (comparable in computation power to 10-50 dedicated transistors) and there are hundreds of thousands of them on the FPGA.

I'm willing to bet lots that 4/n_block_ram > 308/n_lut.

ajross · on Feb 19, 2016

Yeah, I went and looked up. The details are hairy, becuase Xilinx. But a KC7K410T part, which is roughly their mid-range offering has 63550 slices, where IIRC a slice has two LUTs (pay no attention to their "logic cells" number -- that's a normalized thing scaled so as to be linear with the old 4-input LUT design from long ago) and 28620kbit of block RAMs, where each block is 36kbit.

So that design uses 308/(2*63550) =~ 0.2% of the logic resources on the FPGA, but 4/(28620/36) = 0.5% of the RAM.

Not nearly as imbalanced as it sounded to me originally, but still: the LUT numbers are spun by more than a factor of two. The design is more closely equivalent to "640 LUTs". Which interestingly is very comparable to the equivalent transistor count on the original part from Intel.

abortz · on Feb 19, 2016

Yes, you were correct. I was working from the bulk stats on the chip and mixed up Kb and KB ;-)

sklogic · on Feb 19, 2016

Block RAMs are 2-port. You can up it to 4 ports by doubling their clock rate, if the rest is merely crawling on 100mhz. Yet, each core would eat up one port for reading its microcode every clock cycle, so you can only have up to 4 cores for each group of 4 brams with stored microcode.

rasz_pl · on Feb 20, 2016

The same big microcode resulting in small use of logic resources trick is also often used in software emulation. For example this classic Bisqwits NES in C++ with extensive explanation:

https://www.youtube.com/watch?v=QIUVSD3yqqE&t=4m40s

PeCaN · on Feb 19, 2016

Not just "relatively small"—positively tiny. You could fit hundreds of these on a single FPGA.

dingdingdang · on Feb 19, 2016

Yes, I was thinking the same thing: couldn't this be made into say a 500 core 8086? And if so, would this thing then be parallel programmable? I mean.. sort of fascinating.. though probably in an idle sort of way since the speed penalty of using cores that slow would almost definitely wipe out the parallel gain?

cantankerous · on Feb 20, 2016

Routing remains an issue when you scale up to that many IP cores. It's one of those aspects of FPGA design that doesn't really have an analog in software.

sklogic · on Feb 19, 2016

Yes, that's the most impressive bit. My best efforts in building a heavily microcoded tiny 16bit core resulted in a minimum of 700 ICE40 cells. Now I know that it's possible to go even further down.

tzs · on Feb 19, 2016

> Just how many LUTs is 308? The smallest Kintex-7 FPGA is the K70T with 65,600 logic cells (“the logic equivalent of a classic 4-input LUT and a flip-flop” according to User Guide UG474), so we’re talking about a resource consumption of much less than 1% of that very small programmable device.

So...you could put 100 8088 work-alikes on one FPGA?

At the risk of inducing /. nostalgia in the old timers here...can you imagine a Beowulf cluster of these?

cnvogel · on Feb 19, 2016

To put the mind-boggling amount of logic in a contemporary FPGA into perspective, the smallest device in the Kintex-7 family seems to be the XC7K70T...

http://www.xilinx.com/products/silicon-devices/fpga/kintex-7... (<-- HTML overview page at Xilinx)

...which contains 10,250 slices where one slice contains four 6-in-2-out lookup tables and 8 flipflops:

http://www.xilinx.com/support/documentation/user_guides/ug47... (<-- family user-guide) http://imgur.com/viEdQUv (<-- png of page 19 with "schematic" of one logic slice)

On page 19: The four boxes on the left are the lookup tables implementing combinatorical logic (A/W/O5/O6/...), the eight squares on the middle/right (D/CE/CK/SR/...) are flipflops (store one bit of data each). There's a bunch of random multiplexers (the trapezoid ones, they choose one output of X inputs) scattered around. This "schematic" is of course simplified ;-).

So, 10250*6/308=199.7 fits the 8088 "IP" 200 times. Of course this is a very naive calculation ignoring any routing between cores or any peripherals to make them do anything useful, and one would use one of such a 8-bit CPU for easy housekeeping tasks, and not 200 of them. But it shows nicely how incredibly dense current FPGAs are.

Of course, just one bare chip will set you back around $120. https://octopart.com/search?q=XC7K70T (<-- part search)

typon · on Feb 19, 2016

Even more mindboggling: the unreleased Stratix 10 FPGA from Altera claims to have 5,510,000 LUTs in the largest device.

That's a total of 17,889 Intel 8088 Microprocessors.

PeCaN · on Feb 20, 2016

What does one even use that much FPGA for? Any organization that could afford that could probably afford making an ASIC of comparable or better performance.

...That said, that's crazy impressive.

cnvogel · on Feb 20, 2016

One example that comes to mind is data analysis for radioastronomy. With arrays of radiotelescopes, as far as I understand it, you will downconvert an incoming frequency band spanning a few GHz, which gives you a few GByte/sec of data for each antenna. Then you might have an array of 50 telescopes (the ALMA array is planned to have 50 or so, I think).

This stream of maybe a TByte/sec of data will then be filtered and decimated/downconverted in real time by racks full of DSP/FPGA boards. Here's a picture of one board used for an Australian facility:

http://www.atnf.csiro.au/news/newsletter/oct06/CABB.htm

5x Virtex II XC2VP50 (23,616 slices, 2 PowerPC CPU blocks, $1700 each)

5x Virtex 4 XC4VSX55 (15,360 slices, $1300 each)

Yes, an ASIC might be more energy efficient and could be made faster, but FPGAs give you the flexibility to adapt your algorithms and filter topologies. And an ASIC run might cost you half a million dollars whereas with FPGAs you only spend ~15'000$/board.

pslam · on Feb 20, 2016

Funnily enough, one of the uses of these monster FPGAs is for simulating and prototyping ASIC designs.

sklogic · on Feb 20, 2016

If you're going to build and sell a hundred of £20000 devices, for some very specific market niche, it is still more justified than building hundreds of thousands of £10 devices and only selling one hundred.

mng2 · on Feb 19, 2016

Four LUTs per slice, not six, right?

cnvogel · on Feb 20, 2016

You are right. Four LUTs/slice, with 6inputs/2outputs per LUT.

sklogic · on Feb 19, 2016

Interconnect and memory would eat up most of the resources.

aceperry · on Feb 19, 2016

"can you imagine a Beowulf cluster of these?"

OMG,that would be one fast Beowulf cluster!

ChuckMcM · on Feb 19, 2016

That is so freaking awesome. CPLDs are approaching 308 logic blocks :-). And given the small footprint of the core it suggests you could probably build the entire IBM PC architecture on a single FPGA with CGA or Hercules Mono framebuffer support. Then boot Microsoft Flight Simulator and chortle at how much imagination you needed to use to believe you were flying a plane.

acveilleux · on Feb 19, 2016

Lol, I can't even suspend disbelief and play VGA games from the 1990s. I see the pixels and not the scene. Same with games like Civ 1 or Sim City. I remember playing for ages but now there's no way I can do it.

Maybe the size of my monitor is contributing to this. My DOS gaming days were on 12" and 14" CRTs current play is on 27" LCD.

rasz_pl · on Feb 20, 2016

This looks perfectly playable to me https://www.youtube.com/watch?v=qUOEAsCNZiA

FreeFull · on Feb 19, 2016

A good CRT filter helps somewhat

sklogic · on Feb 19, 2016

You can build a rather capable SoC with a 32 bit RISC core, DDR controller, Ethernet MAC, a couple of UARTs and a 800x600 colour VGA and an audio DAC on a cheap FPGA board like DE0 nano. It will run on 100mhz and be much more capable than an entire PC AT system, rather more like the Unix workstations of the early 90s.

coderjames · on Feb 21, 2016

And now you've added a new project to my TODO list. Thanks :-)

sklogic · on Feb 21, 2016

It can be as simple as checking out the orpsoc repo: http://opencores.org/or1k/ORCONF2013_Workshop_ORPSoC_On_DE0_...

samstave · on Feb 19, 2016

Did you just literally reference the Hercules graphics cards of the days of yore??? The only thing missing is DIP switches!!!

To be a kid again!

huangc10 · on Feb 19, 2016

This is sweet. Been a while since I worked in HW after switching to full time SW but this kind of news just makes me chuckle.

Say a modern day FPGA has 10 million gates. 6 gates/LUT. That gives 1.6 million LUTS. Let's say half are used up by other IPs and IOs within the chip. 800k/308 = ~2500.

You could have 2500 of the 8088 running at 180MHz simultaneously. Why? For science.

david-given · on Feb 19, 2016

I wonder how much extra space it would take to emulate all the weird-ass IBM XT peripherals and end up with a true 8088 embeddable-system-on-a-chip? Plug that into a cheapo SD card (or even a serial EEPROM) and you'd have a standalone machine that would run DOS.

That could actually be useful.

danjayh · on Feb 19, 2016

I haven't done any VHDL in about 9 years. Anyone know of a good site/tutorial to run through that starts out relatively basic and goes through to advanced topics? (bonus points if it has step-by-step instructions for a low cost / free for noncommercial dev environment).

analognoise · on Feb 19, 2016

Most vendors have free (as in beer) licenses available for their smaller parts.

Check out FreeRangeFactory. They have some good intro books. Not sure how advanced they get (as I do mostly verilog). Feel free to shoot me a message if you get stuck.

bitwize · on Feb 19, 2016

Let's see 8088mph run on that baby.

polpo · on Feb 19, 2016

Okay: https://www.youtube.com/watch?v=b3GkPGZR4BU

duskwuff · on Feb 19, 2016

Looks like it mostly works, with the exception of a couple of tricky color effects (which have nothing to do with the CPU, and might depend on some composite video tricks).

In particular, the Kefrens bars at 4:42 render completely wrong.

sp332 · on Feb 19, 2016

Looks like the demo leans on a lot of idiosyncrasies of specific CGA cards: https://scalibq.wordpress.com/2015/08/02/8088-mph-the-final-...

ajenner · on Feb 19, 2016

The Kefrens bars effect is the part of 8088 MPH that is most sensitive to the instructions being a cycle faster or slower here or there. The MCL86 doesn't claim to have perfect emulation of timing, so I would not necessarily expect this effect to work reliably on it.

Scali · on Feb 19, 2016

The artifact colours are quite off as well, so I suspect the card used was not a real IBM CGA card. On my ATi Small Wonder and Paradise PVC4 cards, I get the exact same Kefrens bars, with the top chopped off.

Having said that, the fact that the music slows down noticeable during the moire-effect seems to indicate that indeed some instructions are a few cycles off here and there.

fpgaminer · on Feb 19, 2016

I'm curious why it only runs at 180MHz on a Kintex-7. The Kintex-7 can do 32-bit additions at 400-500MHz, so it's odd to see an 8088 running at less than half that. The article mentions that removing the cycle accurate constraint would allow it to run faster, so perhaps that's why.

vvanders · on Feb 19, 2016

In theses solutions you're less interested in raw performance(you could go with one of their hybrid ARM-FPGA cores in that case) then keeping LUTs small so that the FPGA can use them for things that it's good at. Processors like these are kinda like the glue for larger parts.

muterad_murilax · on Feb 19, 2016

I'm sorry... IP? LUT?

EDIT: Thanks for the explanations, guys!

jwise0 · on Feb 19, 2016

For some reason, in the hardware design world, we use "IP" to mean "a self-contained functional block of hardware that we have developed". It probably derives from 'intellectual property' (i.e., 'we bought some IP from $VENDOR to handle HDMI', and then eventually 'we bought an IP from $VENDOR to handle HDMI').

A LUT is, more or less, the smallest logical element on an FPGA (a piece of programmable hardware) -- it's a Look-Up Table. They're not directly comparable from FPGA to FPGA, because some FPGAs have different sizes; in the early days, LUTs had 3 inputs and produced one output, but on modern FPGAs, 'LUT4's (4 input, 1 output) are the smallest that you'll reasonably get, and some FPGAs even use 'LUT6's (6 input, 1 output; sometimes divisible into 5 input, 2 output, or other subdivisions) as their basic logic element. But no matter how you slice it, 308 LUTs is impressively small, especially for 180MHz.

chetjan · on Feb 19, 2016

IP -> Intellectual Property, LUT -> Look Up Table. IP is often used by Intel to describe a "unit" of propriety technology, such a chip or just a adder design. Look up tables are in essence ROM that is the basis for how FPGAs work.

mastax · on Feb 19, 2016

IP = Intellectual Property

LUT = Look-Up Table

LUTs are the building blocks of FPGAs.

jloughry · on Feb 20, 2016

CPLD = Complex Programmable Logic Device

A chip half-way between a PAL (think of a dozen or so gates in a small package, you could choose how they were hooked up; it was fast and it was small) and an FPGA (relatively huge and expensive). Both programmable, but PALs usually only once.

davidvicky · on Feb 24, 2016

i'm curious why it only runs at 180MHz on a Kintex-7. The Kintex-7 can do 32-bit additions at 400-500MHz, so it's odd to see an 8088 running at less than half that. The article mentions that removing the cycle accurate constraint would allow it to run faster, so perhaps that's why.

http://1tour.vn http://en.1tour.vn http://1tour.vn/khach-san http://1tour.vn/tour http://blog.1tour.vn http://1tour.vn/khach-san/ha-noi http://1tour.vn/khach-san/sapa/ http://1tour.vn/khach-san/da-nang/ http://1tour.vn/khach-san/da-lat/ http://1tour.vn/khach-san/nha-trang/ http://1tour.vn/khach-san/vung-tau/ http://1tour.vn/khach-san/phan-thiet/ http://1tour.vn/khach-san/sai-gon/ http://1tour.vn/khach-san/phu-quoc/

bifrost · on Feb 19, 2016

This is pretty cool, and should remind everyone that the 8088 and even the Zilog80 CPUs are still relevant and used to this day.

duskwuff · on Feb 19, 2016

Z80? Yes. 8088? Eh, not really.

You may be confusing the 8088 CPU with the 8051 microcontroller, which is extremely common in embedded designs.

njharman · on Feb 19, 2016

I "learned" assembler (and a lot about CPU/Computer architecture) reading my dads Z80 Microprocessor book. The 80's were cool!

atemerev · on Feb 19, 2016

Incidentally, the smallest model organism with a nervous system, C. elegans, has 302 neurons.

Coincidence? Don't think so.

chriscappuccio · on Feb 19, 2016

Sweet Jesus

guiomie · on Feb 19, 2016

Why? I haven't played with an FPGA since University, so I'm really disconnected from this world. So what does this article mean?

vvanders · on Feb 19, 2016

308 is an incredibly small amount of LUTs, might even fit in a CPLD.

sliverstorm · on Feb 19, 2016

You could fit 200+ 8088's on the smallest version of the FPGA family they used.

cfallin · on Feb 19, 2016

Oh wow, now I really want to see a 200-core-8088-on-FPGA! (Yeah, I know, the necessary NoC and glue would add a ton of overhead. But still!)

sklogic · on Feb 19, 2016

Even better than 88s: http://fpga.org/grvi-phalanx/

davidvicky · on Feb 24, 2016

Robert Chao is what many of us dreamed we’d become when we first started engineering school. We loved technology. We knew we were intelligent. We wanted to build our skills and increase our knowledge. We wanted to apply that education along with our natural passion and talent to solve important problems. We craved that creative rush, that eureka moment when we realized we had a slightly better idea, an improvement, a challenge to the status quo. We wanted to change the world - one transistor at a time.

For most of us, Dr. Moore had other plans.

For the past four decades, Robert Chao has remained indifferent to the cataclysmic vortex of Moore’s Law. His first patent (4215281), filed in February 1978 on behalf of Supertex Semiconductor (of which he was a founder), was for a CMOS integrated circuit - a single-chip solution that provided the platform enabling the ubiquitous home smoke detector.

Gratsby · on Feb 19, 2016

I know at least a dozen of these words.