Note that this core runs at ~100 MHz to get equivalent performance (cycle accurate timings) to the original 4.77 MHz 8088. The big deal is the relatively small number of LUTs used, leaving plenty of room for more stuff.
The result is the MCL86, which is basically a 7-instruction, 32-bit micro-sequencer. Some of the micro-sequencer's instructions are specialized so as to allow it to rapidly decode instructions as well as nest function calls. With these seven instructions, I was able to microcode all of the 8086 opcodes in a relatively small number of micro-sequencer clocks.
A video of this running 8088mph would be awesome, they already have a number of videos of this running other stuff on a PC: https://www.youtube.com/channel/UC9B3TaEUon-araO2j7tp9jg EDIT: There is a video of it runnin 8088 mph that polpo linked!
Exactly. This is a "old is new" kind of design. Back when logic was so expensive as to make computers almost impossible, designs were build around a small number of execution units on a small number of buses and a comparatively compact microcode table that could share them sequentially. So you only needed one adder because the IP increment and ALU operation could share it on different clocks.
Then once there was space to put all the stuff needed for a single instruction on the die, we found ourselves clock-limited by long logic depth and started splitting the functions out across "pipeline" stages, which begat RISC, and we've never looked back.
But all that being said: this design is totally cheating. Sure, the logic takes only 308 LUTs. But the microcode is stored in 4 block RAMs, which a quick Google tells me are 36kbit a piece. That's a much more significant chunk of chip resources than is implied in the linked article.
Perhaps in terms of gates, but LUTs represent a much larger fraction of resources on an FPGA than the block RAMs. In other words, if you wanted to pack a lot of these onto an FPGA you'd run out of LUTs before block RAM.
I'm too lazy to look up numbers for the Kintex-7 part in question, but I'm almost certain you're wrong on this. A block RAM is a big chunk of die, and there are comparatively few of them to go around. A LUT is a tiny object (comparable in computation power to 10-50 dedicated transistors) and there are hundreds of thousands of them on the FPGA.
I'm willing to bet lots that 4/n_block_ram > 308/n_lut.
Yeah, I went and looked up. The details are hairy, becuase Xilinx. But a KC7K410T part, which is roughly their mid-range offering has 63550 slices, where IIRC a slice has two LUTs (pay no attention to their "logic cells" number -- that's a normalized thing scaled so as to be linear with the old 4-input LUT design from long ago) and 28620kbit of block RAMs, where each block is 36kbit.
So that design uses 308/(2*63550) =~ 0.2% of the logic resources on the FPGA, but 4/(28620/36) = 0.5% of the RAM.
Not nearly as imbalanced as it sounded to me originally, but still: the LUT numbers are spun by more than a factor of two. The design is more closely equivalent to "640 LUTs". Which interestingly is very comparable to the equivalent transistor count on the original part from Intel.
Block RAMs are 2-port. You can up it to 4 ports by doubling their clock rate, if the rest is merely crawling on 100mhz. Yet, each core would eat up one port for reading its microcode every clock cycle, so you can only have up to 4 cores for each group of 4 brams with stored microcode.
The same big microcode resulting in small use of logic resources trick is also often used in software emulation. For example this classic Bisqwits NES in C++ with extensive explanation:
Yes, I was thinking the same thing: couldn't this be made into say a 500 core 8086? And if so, would this thing then be parallel programmable? I mean.. sort of fascinating.. though probably in an idle sort of way since the speed penalty of using cores that slow would almost definitely wipe out the parallel gain?
Routing remains an issue when you scale up to that many IP cores. It's one of those aspects of FPGA design that doesn't really have an analog in software.
Yes, that's the most impressive bit. My best efforts in building a heavily microcoded tiny 16bit core resulted in a minimum of 700 ICE40 cells. Now I know that it's possible to go even further down.
There's more from the creator at http://www.eetimes.com/author.asp?section_id=216&doc_id=1328... including this nice tidbit:
The result is the MCL86, which is basically a 7-instruction, 32-bit micro-sequencer. Some of the micro-sequencer's instructions are specialized so as to allow it to rapidly decode instructions as well as nest function calls. With these seven instructions, I was able to microcode all of the 8086 opcodes in a relatively small number of micro-sequencer clocks.
A video of this running 8088mph would be awesome, they already have a number of videos of this running other stuff on a PC: https://www.youtube.com/channel/UC9B3TaEUon-araO2j7tp9jg EDIT: There is a video of it runnin 8088 mph that polpo linked!