Hacker News new | past | comments | ask | show | jobs | submit login

From their MICRO'15 paper (section 5): "We use an in-house microarchitectural, event-driven, sequential simulator based on Pin".

This is fairly standard for computer architecture research in academia: the results are based on a cycle-accurate model that simulates what the machine would do when running particular code. The field has a standard set of benchmarks (e.g. SPEC CPU2006 for serial code, and a bunch of different ones for parallel code) that are run on these simulators and are well-understood. The same is true for architecture teams in industry: they start with simulators and (a lot of) benchmarks before any silicon ever exists.

The reason for all of that is that actually fabbing a chip is prohibitively expensive and insanely work-intensive. A real chip has probably millions of lines of Verilog RTL, and a lot of custom layout to get anything reasonably performant (then > $1M for a mask set, multiple weeks or months to get the chips back from the fab, etc). In contrast, a good simulation model, worthy of publishable results, of an SoC with out-of-order cores, a cache hierarchy, and DRAM, is somewhere between 10K and 100K LoC of C++; the model for a new proposed feature is maybe a few KLoC of code on top of that. Once the simulator exists, a grad student can try out new features fairly easily. It's also much more analyzable and instrumentable: a chip is mostly a black box, modulo whatever debugging features you build in, while a simulator can easily dump a "pipe trace" of the pipeline state every cycle. The field has invented lots of different tools and ways of visualizing data to get a sense of what goes on inside the machine and what bottlenecks exist.

So basically, it's a software model, and the software model is much more informative, and easier to tweak and iterate on, than silicon while being "good enough" for trustworthy results.




I don't completely agree that fabbing a chip is "prohibitively expensive and insanely work-intensive." While it's not cheap, lots of university projects fab chips using multi project wafer services. For example, the Rocket core from Berkeley has been taped out 11 times: https://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-1... and can boot Linux.

But, I agree with your point that a simulator is a much easier step that gets most of the data.


Another example is the kilocore cpu in the news last month: https://www.ucdavis.edu/news/worlds-first-1000-processor-chi...


Fair point -- I forgot about MOSIS. Agreed, pretty reasonable for small test runs. I guess I've mostly heard of tapeouts when proving out a full design though, e.g. TRIPS from UT-Austin (and they had a team of ~20 students, so that was worth a number of dissertations). I can't imagine doing a tapeout while working on caching or branch prediction or other core comparch subfields like that -- the deliverable is an algorithm and/or a new idea, not a physical proof of concept, and the usual effort (a grad-student-year or two, at most) isn't high enough.

(All that said, seeing a real chip at the end of a project must be a heck of an exhilarating feeling of success...)


Even companies worth billions that have their own fab simulate the chip first before fabbing it to test it and benchmark it.


Exactly. For a primary source, see this talk from 32C3 where an AMD CPU designer explained their process in detail: https://media.ccc.de/v/32c3-7171-when_hardware_must_just_wor...


Awesome! I guess that's what I meant with "compiled into VMs", but I thought that Verilog or some sort of declarative language could be used to specify AND generate a VM on which to simulate code execution.

But as you imply, that's mostly done by hand in C++ with frameworks?


You definitely wouldn't want to use Verilog to iterate quickly on a simulation model. It's a lot more work: in a software model, you can (for example) say the equivalent of "this instruction is a divide and takes 32 cycles", then mark the divider busy for the next 32 cycles in a reservation table and increment the "total cycles taken by this benchmark" counter. In the RTL you'd actually have to build a divider and integrate it into the pipeline.

This shortcut is possible because of a common technique of splitting functional and timing details apart: the functional emulator simply runs the instructions in a big interpreter loop and tracks the machine state as, say, QEMU or Bochs would, while the timing model is just cycle accounting given the instruction stream. In contrast, when you build a model in RTL, you're actually doing all the work that industry microarchitects do: you need to get right all the details of (say) speculative execution, or cache tag matching, or whatever, because your microarchitecture is implementing the code execution directly. That's a lot harder to do!

People do sometimes write RTL for their proposed microarchitectures, but that's usually done for power or timing (clock speed / critical path) results. And they usually model just whatever new thing (prediction table, synchronization widget, cache eviction logic) they propose, rather than the whole chip.


Being a little pedantic, but you could use verilog to write a high level model as you describe. The language certainly doesn't restrict you to its synthesizable subset.

That being said, it's generally easier and cheaper (good verilog implementations aren't free) to use a general purpose language for what your describe.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: