From their MICRO'15 paper (section 5): "We use an in-house microarchitectural, e...

jeffbush · on July 7, 2016

I don't completely agree that fabbing a chip is "prohibitively expensive and insanely work-intensive." While it's not cheap, lots of university projects fab chips using multi project wafer services. For example, the Rocket core from Berkeley has been taped out 11 times: https://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-1... and can boot Linux.

But, I agree with your point that a simulator is a much easier step that gets most of the data.

kristianp · on July 7, 2016

Another example is the kilocore cpu in the news last month: https://www.ucdavis.edu/news/worlds-first-1000-processor-chi...

cfallin · on July 7, 2016

Fair point -- I forgot about MOSIS. Agreed, pretty reasonable for small test runs. I guess I've mostly heard of tapeouts when proving out a full design though, e.g. TRIPS from UT-Austin (and they had a team of ~20 students, so that was worth a number of dissertations). I can't imagine doing a tapeout while working on caching or branch prediction or other core comparch subfields like that -- the deliverable is an algorithm and/or a new idea, not a physical proof of concept, and the usual effort (a grad-student-year or two, at most) isn't high enough.

(All that said, seeing a real chip at the end of a project must be a heck of an exhilarating feeling of success...)

russdill · on July 7, 2016

Even companies worth billions that have their own fab simulate the chip first before fabbing it to test it and benchmark it.

majewsky · on July 7, 2016

Exactly. For a primary source, see this talk from 32C3 where an AMD CPU designer explained their process in detail: https://media.ccc.de/v/32c3-7171-when_hardware_must_just_wor...

AYBABTME · on July 7, 2016

Awesome! I guess that's what I meant with "compiled into VMs", but I thought that Verilog or some sort of declarative language could be used to specify AND generate a VM on which to simulate code execution.

But as you imply, that's mostly done by hand in C++ with frameworks?

cfallin · on July 7, 2016

You definitely wouldn't want to use Verilog to iterate quickly on a simulation model. It's a lot more work: in a software model, you can (for example) say the equivalent of "this instruction is a divide and takes 32 cycles", then mark the divider busy for the next 32 cycles in a reservation table and increment the "total cycles taken by this benchmark" counter. In the RTL you'd actually have to build a divider and integrate it into the pipeline.

This shortcut is possible because of a common technique of splitting functional and timing details apart: the functional emulator simply runs the instructions in a big interpreter loop and tracks the machine state as, say, QEMU or Bochs would, while the timing model is just cycle accounting given the instruction stream. In contrast, when you build a model in RTL, you're actually doing all the work that industry microarchitects do: you need to get right all the details of (say) speculative execution, or cache tag matching, or whatever, because your microarchitecture is implementing the code execution directly. That's a lot harder to do!

People do sometimes write RTL for their proposed microarchitectures, but that's usually done for power or timing (clock speed / critical path) results. And they usually model just whatever new thing (prediction table, synchronization widget, cache eviction logic) they propose, rather than the whole chip.

krupan · on July 7, 2016

Being a little pedantic, but you could use verilog to write a high level model as you describe. The language certainly doesn't restrict you to its synthesizable subset.

That being said, it's generally easier and cheaper (good verilog implementations aren't free) to use a general purpose language for what your describe.