Dataflow computers: Their history and future (2008) [pdf]

nachiketkapre · on Oct 26, 2015

Cool review! We have been looking at custom dataflow for floating-point computations -- http://dx.doi.org/10.1109/TCAD.2011.2173199 and http://dx.doi.org/10.1109/DFM.2014.21 but the big challenge has been handling dataflow graphs larger than chip capacity can hold.. Caching doesn't really work well as the dataflow wavefront proceeds along multiple separate paths.

detaro · on Oct 25, 2015

[2008], the author has quite a few more specific papers in the field as well (just remove the filename from the URL to see a list)

yvdriess · on Oct 25, 2015

I really wonder if dataflow machines could genuinely be a feasible commercial processor.

Software is moving more and more towards this type of programming model, from the reactive stuff to even HPC. OoO cores are already small dataflow processors, but their instruction window does not seem to scale. Is x86/arm just too much of a semantic bottleneck? All the dependency information available to a compiler/runtime is lost when code-generating to x86 (same for ARM), yet a processor core needs to extract the same information again in hardware.

cfallin · on Oct 26, 2015

> but their instruction window does not seem to scale.

The issue with scaling instruction windows really has more to do with (i) the scheduling logic and (ii) the physical register file than it does with ISA semantics. In particular, the scheduler is usually quite expensive to scale, because it needs to compare each completed result against the not-yet-fulfilled sources on waiting instructions.

> All the dependency information available to a compiler/runtime is lost when code-generating to x86 (same for ARM), yet a processor core needs to extract the same information again in hardware.

I think this is a bit different than what you describe above w.r.t. software dataflow. When you talk about reactive programming in software, dependences are things like event streams or promises that become fulfilled. That's very different from (say) an individual 32-bit register value flowing from producer to consumer instruction, and the compiler usually only knows about the latter type of dependence. Perhaps you're envisioning a different type of reactive/dataflow programming framework though, like maybe chaining together arithmetic transformations or something similar lower-level?

---

All of that said, there has been some interesting work in compiling conventional software down to a more "dataflow-like" architecture. See e.g. the TRIPS project at UT Austin [1] from about 10 years ago -- the idea there was to build a fabric of execution units connected by a mesh interconnect on the chip, and then stream data through.

[1] https://www.cs.utexas.edu/~trips/

yvdriess · on Oct 26, 2015

I was indeed more thinking of the fine-grained dataflow or event-driven tasks, in the style of VAL/SISAL or very recently OCR (https://xstack.exascale-tech.com/wiki/index.php/Main_Page).

I heard there were some fundamental problems with TRIPS, but never got any detailed explanation. Any idea what it was? Is fine-grained dataflow still too much overhead to do in hardware or was there something else.

There were some dataflow-critique papers back in the 80-90s, but most of their points seem moot these days: low transistor occupancy (compared to the 90%+ of the simple RISC microprocessors) and high cost of associative memory needed for the token matchers. These days, we are working with billions instead of 10k transistors and commonly ship processors with megabytes of associative memory in the form of SRAM cache.

cfallin · on Oct 27, 2015

Cool, I have some more reading to do -- thanks for that link!

My understanding of TRIPS is that compiler complexity is an issue -- it depends on being able to chunk the program into hyperblocks with fairly high control locality (i.e., remain in one hyperblock for a while), because the transition cost is higher than a simple branch on a conventional superscalar. In the best case this can be great (32 ALUs worth of ILP, IIRC) but in the worst case (e.g. large code footprint) this can degrade terribly.

IIRC there are some more practical issues too -- e.g., the ISA is specific to the microarchitecture (because it explicitly maps to ALUs in the fabric), so binary compatibility is an issue.

All that said, still a really interesting idea, I think.

nickpsecurity · on Oct 26, 2015

Morrison has been doing and pushing the software for a while:

http://www.jpaulmorrison.com/fbp/

The hardware has been done repeatedly for special-purpose applications. Maxelor has a general-purpose one used as an accelerator.

https://www.maxeler.com/technology/dataflow-computing/

Not sure about pure dataflow by itself for general-purpose systems. Closest thing I've seen are VLIW-like modifications of regular processors. Not doing so well, though.

"All the dependency information available to a compiler/runtime is lost when code-generating to x86 (same for ARM), yet a processor core needs to extract the same information again in hardware."

You could embed it into the code or make it implicit in the ISA.

phkahler · on Oct 26, 2015

>> I really wonder if dataflow machines could genuinely be a feasible commercial processor.

Most processes that fit data flow diagrams well already run pretty fast on modern processors and even GPUs. That's not to say custom hardware wouldn't run it faster, but that those programs are already some of the fastest we've got. Why optimize hardware for stuff that already runs well?

yvdriess · on Oct 26, 2015

It works well if it code-generated down to efficient sequential code. It does not work well if you actually perform fine-grained dataflow execution: all individual operations can fire when their inputs become available. The former works well on our multicore processors, but the latter is essential to expose and exploit the large amount of parallelism (billion-way parallelism) needed to scale up. There is quite some overhead associated with managing (store/match/sync) data tokens, which is best handled in hardware. For instance, the Cray XMT (threadstorm processor) could do a thread context switch in one clock cycle. How many cycles is a thread switch on a modern intel core these days? Thousands?

gsg · on Oct 26, 2015

> extract the same information again in hardware

It isn't necessarily the same information. OoO engines happily speculate through branches without waiting for a test value to become available, forward values based on exact addresses rather than a static approximation thereof, etc. There are real advantages to doing things dynamically.

yvdriess · on Oct 26, 2015

So does dynamic compilation, such as LuaJIT and PyPy ;)

sklogic · on Oct 26, 2015

Pretty much all of the modern commercial processors are essentially dataflow. Any OoO core is dataflow inside.

leoc · on Oct 26, 2015

Write purely functional programs in continuation-passing style and you already have something that looks verry close to dynamic-graph dataflow programming, no?

chrisseaton · on Oct 26, 2015

If you mean a lazy functional program, then the difference is that those are generally demand-driven. You execute a thunk, and it executes thunks that it needs. In most dataflow models, computation is availability driven, in that thunks which can run will do, and when they complete they will enable other thunks which needed those values to now run.

leoc · on Oct 26, 2015

You're right, I meant 'strictly' rather than 'purely' there (hate that bit of jargon, btw). Unfortunately 'continuation-passing style' was probably the wrong choice of words too: I meant 1) the restricted language with only tail calls, no evaluable arguments etc., not 2) the method (thread the continuation through the code, etc.) used to encode arbitrary functional programs into 1). Naturally 2) produces programs which run in 1), but not usually idiomatic programs in 1) (while otoh idiomatic/arbitrary programs in 1) may be side-effect-free, but they're not usually really functional in the sense of equational reasoning/term-rewriting).

AFAIK you can encode either the lazy or eager variants of functional programs into 1), while 1) itself is effectively "super-eager" as it just doesn't permit any arguments that need evaluation except where explicitly delayed inside a lambda, right?