MIT Takes Multicore in a Different Direction

morcheeba · on July 6, 2016

More info from 12/2015 -- a little older than the paper mentioned in the article.

Poster here: http://livinglab.mit.edu/wp-content/uploads/2015/12/poster.p...

Paper here: http://livinglab.mit.edu/wp-content/uploads/2015/12/2015.swa...

Slides: http://livinglab.mit.edu/wp-content/uploads/2015/12/2015.swa...

1024core · on July 6, 2016

FTA: "As far as we can tell, there is no actual hardware yet, so any claims of ease of programming and performance advantages are based on simulations and perhaps the enthusiasm of the researchers."

Take it for what it's worth.

AYBABTME · on July 7, 2016

I've been assuming that these kind of architecture thought experiment could be compiled into VMs that can then play compiled code against them, while counting simulated wall time - so that one could make relatively true claims with regard to architectures?

I'm not versed in the topic, but if that's true, then why wouldn't those numbers be worthy?

cfallin · on July 7, 2016

From their MICRO'15 paper (section 5): "We use an in-house microarchitectural, event-driven, sequential simulator based on Pin".

This is fairly standard for computer architecture research in academia: the results are based on a cycle-accurate model that simulates what the machine would do when running particular code. The field has a standard set of benchmarks (e.g. SPEC CPU2006 for serial code, and a bunch of different ones for parallel code) that are run on these simulators and are well-understood. The same is true for architecture teams in industry: they start with simulators and (a lot of) benchmarks before any silicon ever exists.

The reason for all of that is that actually fabbing a chip is prohibitively expensive and insanely work-intensive. A real chip has probably millions of lines of Verilog RTL, and a lot of custom layout to get anything reasonably performant (then > $1M for a mask set, multiple weeks or months to get the chips back from the fab, etc). In contrast, a good simulation model, worthy of publishable results, of an SoC with out-of-order cores, a cache hierarchy, and DRAM, is somewhere between 10K and 100K LoC of C++; the model for a new proposed feature is maybe a few KLoC of code on top of that. Once the simulator exists, a grad student can try out new features fairly easily. It's also much more analyzable and instrumentable: a chip is mostly a black box, modulo whatever debugging features you build in, while a simulator can easily dump a "pipe trace" of the pipeline state every cycle. The field has invented lots of different tools and ways of visualizing data to get a sense of what goes on inside the machine and what bottlenecks exist.

So basically, it's a software model, and the software model is much more informative, and easier to tweak and iterate on, than silicon while being "good enough" for trustworthy results.

jeffbush · on July 7, 2016

I don't completely agree that fabbing a chip is "prohibitively expensive and insanely work-intensive." While it's not cheap, lots of university projects fab chips using multi project wafer services. For example, the Rocket core from Berkeley has been taped out 11 times: https://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-1... and can boot Linux.

But, I agree with your point that a simulator is a much easier step that gets most of the data.

kristianp · on July 7, 2016

Another example is the kilocore cpu in the news last month: https://www.ucdavis.edu/news/worlds-first-1000-processor-chi...

cfallin · on July 7, 2016

Fair point -- I forgot about MOSIS. Agreed, pretty reasonable for small test runs. I guess I've mostly heard of tapeouts when proving out a full design though, e.g. TRIPS from UT-Austin (and they had a team of ~20 students, so that was worth a number of dissertations). I can't imagine doing a tapeout while working on caching or branch prediction or other core comparch subfields like that -- the deliverable is an algorithm and/or a new idea, not a physical proof of concept, and the usual effort (a grad-student-year or two, at most) isn't high enough.

(All that said, seeing a real chip at the end of a project must be a heck of an exhilarating feeling of success...)

russdill · on July 7, 2016

Even companies worth billions that have their own fab simulate the chip first before fabbing it to test it and benchmark it.

majewsky · on July 7, 2016

Exactly. For a primary source, see this talk from 32C3 where an AMD CPU designer explained their process in detail: https://media.ccc.de/v/32c3-7171-when_hardware_must_just_wor...

AYBABTME · on July 7, 2016

Awesome! I guess that's what I meant with "compiled into VMs", but I thought that Verilog or some sort of declarative language could be used to specify AND generate a VM on which to simulate code execution.

But as you imply, that's mostly done by hand in C++ with frameworks?

cfallin · on July 7, 2016

You definitely wouldn't want to use Verilog to iterate quickly on a simulation model. It's a lot more work: in a software model, you can (for example) say the equivalent of "this instruction is a divide and takes 32 cycles", then mark the divider busy for the next 32 cycles in a reservation table and increment the "total cycles taken by this benchmark" counter. In the RTL you'd actually have to build a divider and integrate it into the pipeline.

This shortcut is possible because of a common technique of splitting functional and timing details apart: the functional emulator simply runs the instructions in a big interpreter loop and tracks the machine state as, say, QEMU or Bochs would, while the timing model is just cycle accounting given the instruction stream. In contrast, when you build a model in RTL, you're actually doing all the work that industry microarchitects do: you need to get right all the details of (say) speculative execution, or cache tag matching, or whatever, because your microarchitecture is implementing the code execution directly. That's a lot harder to do!

People do sometimes write RTL for their proposed microarchitectures, but that's usually done for power or timing (clock speed / critical path) results. And they usually model just whatever new thing (prediction table, synchronization widget, cache eviction logic) they propose, rather than the whole chip.

krupan · on July 7, 2016

Being a little pedantic, but you could use verilog to write a high level model as you describe. The language certainly doesn't restrict you to its synthesizable subset.

That being said, it's generally easier and cheaper (good verilog implementations aren't free) to use a general purpose language for what your describe.

samirm · on July 6, 2016

thanks for that

samirm · on July 8, 2016

downvoted for saying thanks... what a toxic community

panic · on July 6, 2016

Here'a a more in-depth article: http://people.csail.mit.edu/sanchez/papers/2016.swarm.toppic...

and the actual paper introducing Swarm: http://livinglab.mit.edu/wp-content/uploads/2015/12/2015.swa...

rurban · on July 7, 2016

And the MIT news source: https://news.mit.edu/2016/parallel-programming-easy-0620

elihu · on July 6, 2016

So... it's hardware transactional memory with explicit priorities on the transactions so the hardware can retry and/or complete the highest priority task first if there's a collision?

macintux · on July 7, 2016

http://dilbert.com/strip/1995-02-26

mikepurvis · on July 7, 2016

"... parallel execution of rather small tasks"

Feels like another hardware direction that will require a lot compiler work to truly take advantage of. Anyone else getting Itanic vibes?

wumpus · on July 7, 2016

A lot of compiler people thought Itanic was dumb because it put things into hardware that the compiler could handle efficiently without hardware help. SWARM doesn't appear to make that mistake at all. It's unclear to me if Swarm can be take advantage of without explicit compiler directives, but that's a different issue.

johncolanduoni · on July 7, 2016

> If the programmer wants to parallelize a function, he/she must designate it as such and assign some weighted value to it that corresponds to an execution priority.

It looks like most of that ability is meant to be handed straight to the programmer, not the compiler.

bluejekyll · on July 7, 2016

I wonder if this could be handled more easily by making memory smarter. The timestamps they talk about are really just lamport clocks I think. If ram used that on writes for atomic updates, then you could throw a fault (similar to a numerical overflow). Then compilers/code could just give devs a way to directly handle this. An OS could be designed to use this to reschedule threads, etc.

Seems like this would be a more incremental option than requiring a full new architecture to be designed.

sebastianconcpt · on July 7, 2016

Can this circuitry be implemented in FGPA?