Forth Inventor Chuck Moore's $20 144 core CPU now in full production

ChuckMcM · on Nov 22, 2011

Well if you ever wanted to be 'out there' Chuck's processor would be a great place to start. Chuck Moore invented Forth and has been building machines that can run Forth efficiently for years. Think of it as a Turing machine that can do useful work. He has pushed the edge of computation per watt for years.

That being said, I've heard him talk about this chips for years and it is great to see it finally see the light of day. If you were familiar with the Transputer technology, this is a better take on that, if you are familar with FPGAs you can think of it as an FPGA with processors instead of CLBs.

Things that it can do are similar to what Intel is doing with Larrabee or some of the CUDA stuff that nVidia has done. It doesn't have a GDDR3 interface to a GB of memory so you can't custom build your own GPU, but you could do your own PhysX type engine with it.

It also makes a helluva differential cryptanalysis tool, or a signals analysis tool in general.

6ren · on Nov 22, 2011

CLB = configurable logic blocks

pnathan · on Nov 23, 2011

I have done my Master's work with the XCore (xmos.com), which is a direct descendant of the Transputer. It's programmed in an C-with-CSP-constructs called XC.

I found it very easy to write XC correctly. I think that the CSP/transputer idea is really the best way for most parallel programming to happen. MPI and pthreads allow more flexibility, but in most (say, 80%) of cases, that's not needed for your application.

watmough · on Nov 26, 2011

I believe the Go model is based on CSP.

A cursory look at Go reminded me of some time I spent writing OCCAM code.

joelthelion · on Nov 23, 2011

It might make a good bitcoin mining platform?

iwwr · on Nov 23, 2011

Not really, it's underpowered and mining lends itself better to vector-oriented processors (like GPUs) rather than CPUs.

zhazam · on Nov 23, 2011

At $20 you'd only need a single decade to make up the costs :)

scottyallen · on Nov 22, 2011

Not being much of a hardware geek, I'm having a hard time evaluating how fast each core is.

From the site: "With instruction times as low as 1400 picoseconds and consuming as little as 7 picojoules of energy, each of the 144 computers can do its work with unprecedented speed for a microcontroller and yet at unprecedentedly low energy cost, transitioning between running and suspended states in gate delay times. When suspended, each of the computers uses less than 100 nanowatts."

How does this instruction time compare to other modern processors?

It sounds to me like some of the benefit to be had here may be from the low amount of power necessary per core. Power and associated cooling is a MAJOR source of cost for datacenters, and so if this really is a significantly lower power consumption (again, I don't know how much power the alternatives use), then it could have a big impact on the cost of commodity computation power.

_delirium · on Nov 22, 2011

If 1400 picoseconds is the time it takes for a clock cycle (and therefore the minimum instruction time for 1-cycle instructions), that'd be about 700 MHz, which actually seems pretty high compared to what I would've guessed.

neuraxon77 · on Nov 23, 2011

As far as I know, the cores aren't clocked. Instead Chuck designed his own transistors using his own OKAD II VLSI tools to be efficient on the CMOS process node with switching speeds being dictated by the transistor type and interconnect electrical properties, then designed the cores so that they're only switching those transistors when they do actual work.

jfriedly · on Nov 23, 2011

From the site's Green Arrays Architecture paper [1]:

Our architecture explicitly omits a clock, saving energy and time among other benefits.

Also, they quoted instruction execution time as being as low as 1400 picoseconds, so 700MHz equivalence is probably the best case, not the average case.

Edit: They also say in the architecture paper that their goal is one billion operations per second, or 1GHz.

[1] http://www.greenarraychips.com/home/documents/greg/PB002-100...

Symmetry · on Nov 23, 2011

Ooh, awesome. Clockless computing seems like a really nifty idea, but its difficult. I imagine it might go mainstream when Moore's law finally runs out.

sliverstorm · on Nov 23, 2011

"difficult" does not even begin to encompass the half of it. It is really cool in theory, but in reality a ridiculously difficult challenge that existing tools are in no way whatsoever up for.

thesz · on Nov 23, 2011

Take a look at the Balsa design system: http://apt.cs.man.ac.uk/projects/tools/balsa/

My colleague uses it for research purposes and he said that it is pretty mature.

sliverstorm · on Nov 23, 2011

I suspect if the only issue was translating architecture to logic, we'd be doing it already

lgeek · on Nov 23, 2011

There are other asynchronous CPUs, including implementations of popular architectures like MIPS and ARM: http://en.wikipedia.org/wiki/Asynchronous_circuit#Asynchrono...

dkersten · on Nov 23, 2011

instruction times as low as 1400 picoseconds

As a random comparison, that's 1.4 nanoseconds per cycle. The PIC24H 16bit PIC microcontroller form Microchip runs at 40MHz, making it 12.5 nanoseconds per cycle. Now, I don't know enough about Chuck Moore's processors and haven't read the docs yet, so I don't know, for example, how many cycle a typical instruction would take, or how much work it takes to synchronise cores, but assuming 1 cycle per instruction, that would make these 144 core $20 processors 8.9 times faster per core. I know its not realistic to assume linear performance speedup from additional cores, but if we do 144 things at once, this makes them 1285 times faster than the PIC24H microcontroller, yet these processors only cost 4 to 6 (depending on amount of RAM) times the cost of a PIC24H microcontroller.

Note that the PIC24H microcontrollers are the more expensive, high performance (but not the highest) model of Microchips 16-bit PIC microcontrollers.

honkybozo · on Nov 23, 2011

Glad to see this thread. G144A12 does better than you'd expect for 18 bit ALUs doing 32 bit circular shifts and adds, but that costs enough extra instructions that for this particular algorithm at any Bitcoin-useful combination of throughput/energy/cost we can't compete with the genuine 32-bit ALUs of the bigger ATI GPUs. Ya can't be perfect for all problems all the time :) Nevertheless we'll be posting an app note eventually on SHA256 as an illustration of techniques in pipelining. The $20 price is for small quantities. Standard exponential decay curves apply for production quantities; we want to see our chips in people's products and are priced to encourage that. As for 20 somethings, nobody in our company gets a paycheck (yet) so someone has to be willing to work for nothing, but if you have a practical idea for an app note and want to work with us to get it done and published please email greg at greenarraychips dot com and let's discuss it. Thanks for your interest, folks - Greg Bailey, GreenArrays, Inc.

thesz · on Nov 23, 2011

I certainly should voice my opinion here.

I've done analysis of GA144 before: http://news.ycombinator.com/item?id=1810641

Most of the Chuck Moore design (I reviewed several, starting from M17) can be described by quote from Devil Wears Prada: "the same girl- stylish, slender, of course... worships the magazine. But so often, they turn out to be- I don't know- disappointing and, um... stupid". Chuck Moore designs are all slick, slender, stylish, worship Forth, but they turn out to be disappointing and stupid in the end. The only beneficiary often is Chuck Moore itself. You just cannot apply his experience to other places in the world of computing.

Let's see what we have here in GA144.

The memory inside all cores is way too small for general purpose programs, even if you split your program into 144 parts and spread them across cores. 64 words of RAM, ie, 128 bytes. 128 bytes times 144 - 18Kbytes. Same for the (program) ROM, and you should factor communication code in there. Communication cost affects RAM as well.

They offer no compiler from high-level language like C. You had to learn a specific dialect of Forth and some bizarre (albeit small) assembly language.

The only benefit for general population from this affair is the relative ease of the desing of asynchronous hardware.

http://en.wikipedia.org/wiki/Asynchronous_system

bitcracker · on Nov 23, 2011

Your are focusing on classic applications only.

The GA144 is so different from the classic way of computing that it requires new approaches of development. One very interesting feature of GA144 is that all 144 cores can share instructions over I/O lines. That means every core can send instructions to their neighbors who execute them directly without conversion.

I/O ist fast enough. I guess it should be possible to have (someday) an external interface to SRAM which circumvents the low memory problem.

> They offer no compiler from high-level language like C.

That's right, and that's the real weak spot of GA144. Almost every microcontroller board today comes with C or alike. Mr. Moore loves Forth but I doubt that there are many developers out there who like to be forced to learn Forth just for this single platform. I know Forth, it was one of my first languages I ever learned. It is perfectly suitable for embedded systems but you have to learn a lot to master it.

thesz · on Nov 23, 2011

I protyped dynamic dataflow machine which (in theory) could be scaled to hundreds of cores (corelets - something very small which does not have even jump command). In my experiments readying information to be sent accounts for hefty 30%+ of code.

http://thesz.mskhug.ru/svn/hhdl/previous/HSDF/CoreletTest.hs

The link above contains some simple "Hello, world!" program, in five "big instructions" which contains 21 corelet instruction in total. 8 of those 21 instructions are send and front advancing instructions - their only purpose is to establish communication between program parts. My machine sends pointers to "big instructions", up to 32 bytes long (up to 32 instructions if you're lucky) while GA144 could send only 4 instructions max.

30% of 4 instructions is 1 instruction. Another one instruction from those 4 is a loop or jump or something like that. So you have two instructions to perform program logic. And this is best case.

So I again express dislike of GA144 as a computing machine. And I again express my gratitude to Chuck Moore for proving that clockless design works.

bitcracker · on Nov 23, 2011

> In my experiments readying information to be sent accounts for hefty 30%+ of code.

Unfortunately I don't have time to dive into your design but AFAIK the GA144 doesn't need 30% preparation code because every instruction can be executed immediately by neighbor nodes.

That means (correct me if I am wrong) if core X has to evaluate a Forth function of say five arguments then it could pass all five arguments to its neighbors (without any preparation) by sending them the code addresses of the arguments, wait until they have finished and then use their results to compute the function result. These neighbor nodes themselves could evaluate (or delegate) subexpressions to other (free) nodes and so on.

This form of parallelization would require an efficient shared memory access. This problem needs to be solved because AFAIK I/O ports are accessible by the edge cores only. It doesn't make much sense to transport each shared data through several columns or rows of cores.

thesz · on Nov 23, 2011

I think you're wrong about many arguments for command send to other core.

You can send a word, ie, "big command" composed from four MISC command. One of MISC commands in "big command" can retrieve data from other core.

So most of the time you will wait to send a command or to receive some data.

Symmetry · on Nov 23, 2011

In past Forth threads I've complained the the Forth model for a computer seems just too at odds with how modern CPUs actually work for me to want to learn it. Well, I look at this thing and its like the soft draft of the future slipping underneath the door, whispering that maybe I should learn Forth after all.

_sh · on Nov 23, 2011

> maybe I should learn Forth after all

http://factorcode.org/

Make it easy on yourself.

wx77 · on Nov 23, 2011

That is probably making it harder on yourself as the recommended route for learning factor is to read a forth book and read the factor docs. (Unless something has changed and a book or big tutorial has been written)

With that said, Factor is really cool.

dkersten · on Nov 23, 2011

Its worth noting that a number of popular VM's, eg, the JVM, have Forth-like stack-based instruction sets.

derleth · on Nov 23, 2011

OTOH, LLVM is, if anything, a register machine, so as to better map to the register machines we're implementing in hardware right now.

barrkel · on Nov 23, 2011

In practice, the difference between LLVM and a stack machine like CLR or JVM is quite small.

LLVM SSA, modulo phi nodes, encodes expressions as DAGs in a completely straightforward sense: every operation names its arguments, and those names serve as unique references to a sub-DAG, since they cannot be reassigned. So given the final operation, you can follow the arguments recursively all the way through, completely transparently. Stack machines like the CLR and JVM encode expressions as trees serialized in a post-order - and to the extent that they use 'dup', they also encode expressions as DAGs.

A symbolic interpretation of a CLR or JVM instruction flow, or LLVM SSA instruction flow, can reconstruct the source expression DAG/tree; at which point you can re-encode it using the other approach.

And producing output targeting LLVM is even easier than it appears, because you don't need to worry about SSA yourself; just allocate stack locals (with alloca) and use the mem2reg pass to turn it into valid SSA.

dkersten · on Nov 23, 2011

Yes, LLVM is a register machine and CPUs are register machines. My point really was that one shouldn't avoid stack-based languages or virtual machines because CPUs don't work that way, since some of the most used and most popular virtual machines are stack machines and get by just fine. (AFAIK both JVM and CLR are stack-based)

kabdib · on Nov 23, 2011

The big difference between FORTH and other stack machines (e.g., the JVM) is the runtime.

FORTH doesn't have one to speak of. No dynamic memory management, no garbage collection. It's about as bare-bones as you can get.

PostScript, the JVM, the CLR and so forth are all backed by runtimes with powerful functionality.

aidenn0 · on Nov 23, 2011

ACtually LLVM isn't a virtual machine, it's a compiler intermediate representation.

derleth · on Nov 24, 2011

And when JVM bytecodes are compiled to machine code by the hotspot optimizer, what is the JVM?

aidenn0 · on Dec 7, 2011

JVM bytecode is runnable on the JVM. LLVM bytecode is runnable on what?

neopanz · on Nov 22, 2011

Love the guy and his passion for its chips, but this 144-computer chip looks like a solution in search of a problem: what is trying to solve? How's the inter-core communication handled? On the other hand, creating weird chips like these, just because you can, is awesome and stimulating a hacker's mind.

6ren · on Nov 22, 2011

Some of the industries greatest advances have began as a solution in search of a problem, such as the microprocessor (which its inventor, Intel, didn't think much of compared with memory chips where the real money was). Most fail of course.

If someone can make many-core, in this form, do something useful that can't be done elsewhere (unlike DSP and GPUs, which are already many-core), it will fundamentally upend computing.

queensnake · on Nov 23, 2011

> greatest advances have began as a solution in search of a problem

.. and the laser(!)

neuraxon77 · on Nov 23, 2011

"Between nodes there are 18 bit, bidirectional parallel buses called Comm Ports. The handshaking on these ports is trivially simple and the latency and jitter are low, on the order of gate delays, tens of picoseconds. When a node is reading or writing a Comm Port, it proceeds with no delay if its partner is ready to transfer data. If the partner is not yet ready, the node waits until it is. While waiting, a node’s power consumption drops, within picoseconds, to nothing but leakage. When the partner becomes ready the idle node resumes its work within picoseconds."

http://www.greenarraychips.com/home/documents/greg/WP002-100...

swdunlop · on Nov 22, 2011

It appears that CM's intent with the F18A core is to create an extremely low power processor that scales the number of cores to the complexity of the problem being addressed -- look at the reference sheets about what the requirements are to maintain core states.

The design makes more sense when you stop trying to think of it as an ARM or AVR competitor, and more like a replacement for FPGA's.

rbanffy · on Nov 23, 2011

> what is trying to solve?

Don't picture each core running C programs. Think the whole chip replacing something you would use an FPGA for, but more flexible, capable of adjusting its own behavior according to the environment.

101001010111 · on Nov 22, 2011

Not sure if it's a legitimate "problem" or not, but how about the availability of single chips, or 10-packs?

Assuming someone wanted to produce a prototype board, but didn't want to purchase 1000's of CPU chips, are there many CPU chip makers who offer low quantities like this?

csmeder · on Nov 22, 2011

You can buy a 10 pack of evaluation chips from Green Array's web site or single from schmartboard's website http://www.schmartboard.com/index.asp?page=products_csp&...

jxcole · on Nov 23, 2011

I think this sounds ridiculously cool and at $20 I could probably afford to play around with it. But I have no knowledge of hardware, just programming. To me chips just look like thin green rectangular prisms. How would I get it to actually, you know, do stuff?

ars · on Nov 23, 2011

You would probably want to buy the Evaluation Board which lets you control the machine using USB ports and an ASCII console.

The thing is, it's just a computer. The real value of this is hooking it to hardware to actually do something. So I suggest starting here instead: http://www.greenarraychips.com/home/documents/budget.html and learning how to work with hardware as well.

makmanalp · on Nov 23, 2011

you can do this for about $30: http://www.greenarraychips.com/home/documents/budget.html

"suggestion for a complete system"

defen · on Nov 23, 2011

Call up Woz, I hear he was good at that stuff.

neopanz · on Nov 22, 2011

I think the only thing Chuck is missing is a few 20-somethings hackers, bent on solving an 'impossible' problem with his chips.

csmeder · on Nov 22, 2011

That's why I'm posting it here.

Some possible ideas

====================

- Sound processing (think 144 cores preforming Fourier Transforms)

- Pico Satellites (This chip has a surprisingly extreme low power requirement).

- Wireless communication (small power requirement means small batteries)

- Computer Vision processing. Imagine toys or tools that can process visual information faster than an Xbox.

- Basically anything that can be done in parallel, that is suited for small size and low power, but doesn't require much on chip memory.

_juof · on Nov 22, 2011

Had a rough look on it's instruction set and didn't see anything resembling DSP instructions. So i'm not sure it would be great for all those uses.

On the other hand, picochip sells a 200-300 core DSP , that is being used for the wireless industry.

csmeder · on Nov 22, 2011

From this PDF http://www.greenarraychips.com/home/documents/greg/PB001-100... GreenArrays seems to think it would support DSP

"SUITABILITY: The GA144 is designed to support the largest and most demanding computing challenges that can be addressed with a modest sized die in a relatively inexpensive and easy to use package while still using well less than 650 mW in most practical applications. The geometry allows for generous numbers of parallel paths and/or pipeline stages, or for complex flowgraphs in control, simulation, or DSP applications. Clusters of nodes devoted to functions such as cryptographic algorithms are easily placed in the large array, and the cluster needed to control external memory and run a high level language from it is well out of the way but has good surface area for interaction with other functions. Use it also as a universal protoytping platform for applications destined to run on our smaller chips. "

sliverstorm · on Nov 23, 2011

There is a huge difference between supporting DSP and being good at DSP. My $0.50 MSP430, clocked at 32kHz, "supports" DSP.

dkersten · on Nov 23, 2011

At the very least, I'd expect instructions for things like saturated addition

pixelcloud · on Nov 23, 2011

I really hope these have a good application for DSP. I can imagine the sweet homebrew synths that people will be coming up with.

Poyeyo · on Nov 24, 2011

Video-games?

Can this replace a physics dedicated processor? What about latency?

femto · on Nov 23, 2011

GNU radio

kiba · on Nov 22, 2011

Let me get this straight: It's basically a microcontroller timed 144?

Can anyone explain to a nonhardware geek?

femto · on Nov 22, 2011

It's the future, representing a convergence between Field Programmable Gate Arrays (FPGAs) and the microprocessor.

Gate arrays are vast arrays of logic gates, which can be wired together in almost arbitrary patterns by a sea of "fuses", typically controlled by state stored in on-board SRAM. They are real time and blindingly fast due to their massive parallelism, achieving supercomputer type speeds when applied to the right type of problem and programmed well. They are more difficult to program than a microprocessor. One way of looking at an FPGA is as an array of tens of millions of very simple computing engines.

Over the years, the number of transistors on an FPGA has been rocketing up. Generally these transistors have been put to use by providing more and more simple logic blocks. We are now to the point where we have almost more gates than we know what to do with, and the chip is being dominated by interconnects. This has seen a move towards including a limited number of elaborate hard wired blocks, such as CPUs and multipliers, in addition to the array of logic.

The logical evolution is to stop providing more blocks, but make each block more complex as transistor counts go up. Eventually we will see arrays of tens of millions of microprocessors, rather than tens of millions of logic blocks. There will be no distinction between a multicore CPU and FPGA.

It's worth noting that the first Xilinx FPGAs, thirty years ago, provided arrays of around 144 logic blocks, similar to the processor count in this chip. Extrapolate 30 years and we will have an array of 10 million microprocessors.

VladRussian · on Nov 23, 2011

>We are now to the point where we have almost more gates than we know what to do with,

limits of von Neumann architecture (and its underlying mathematics of recursive functions) as a mental framework for our thinking about computing.

calebmpeterson · on Nov 23, 2011

You've piqued my curiosity; could you elaborate please?

tomjen3 · on Nov 23, 2011

I shall try, but this stuff gets really hairy, really really fast.

Von Neumann architecture is what almost all computers use today: you have (very roughly) an ALU (arithmetic logic unit) hooked up to a memory bank which stores both program data and the instructions the program consist of.

Now you can add a couple of cores to that, but you pretty soon start to run into problems -- threads which try to access the same data, race conditions, etc.

But the biggest problem is that under the Von Neumann architecture all memory is shared so any thread can access any other threads memory. This puts rather drastic limits on how much benefit you can get from new cores.

You also run into issues like the limited speed it is possible to access the main memory banks with, etc, etc. This is possible to compensate to some degree with caches, but they have their own problems.

But the fundamental problem with them is that they were from and of an era where the clock speed kept increasing and increasing.

Today we have a situation where transistors gets smaller and smaller. But if you are trying to use this to make a traditional CPU with these new transistors, all this gets you is a really small chip.

What we need is an architecture inspired by something else. Personally I am kinda hoping it will be some form of message sending -- you run a lot of small (green) threads which each have their own memory as well as the ability to send and receive packages of information to/from the other cores of on the CPU.

You can have access to a (comparatively large but slower) shared memory bank too (like RAM today).

I like it because it works well with how you would design a cluster of computers (where you cannot afford the illusion of shared memory), how computation is organized under the actor model (which I prefer to threads) and it would be possible to implement with not that much new changes in the CPU.

calebmpeterson · on Nov 24, 2011

That wasn't a "try" - that was a success. Thank you!

If I may attempt a paraphrase: CPU caches stop being a bandage for slow access to RAM and become a valuable first class citizen for each core of the CPU when coupled with the actor model.

Did I understand you correctly? Again, thank you.

tomjen3 · on Nov 24, 2011

Well you could do that today if you as a programmer could manually tell the system "please load addr x, y and z into the cache".

But if the cores of the CPU starts to communicate with the actor model then you wouldn't be using the memory close to the cores as a cache but as a storage area for messages that haven't been sent/processed yet as well as possible for thread local storage.

sliverstorm · on Nov 23, 2011

Unfortunately for the FPGA guys, it seems like in general the closer they get to CPUs the less compelling their product becomes.

femto · on Nov 23, 2011

I hesitate to divide the world into "FPGA guys" and "non-FPGA guys". It's a continuum of computing power; CPU<->DSP<->FPGA, and one moves up and down it as the task varies and technology changes.

If anything defines an "FPGA guy" is choice of language. Historically FPGAs have been programmed with an HDL, such as VHDL or Verilog, because these languages are concurrent meaning they handle parallel systems.

My prediction is that as mainstream multicore CPUs develop we will see the rise of concurrent versions of mainstream languages, which will supplant HDLs. Just as DSP's are now programmed in (almost) vanilla C/C++/..., we will see FPGAs being programmed using the same languages as micros and DSPs. It will then be possible to write code and (almost) seamlessly run it on any platform, depending on how fast it needs to run.

bisrig · on Nov 23, 2011

As an FPGA guy, I want to ask this - what software programming languages, models, etc. exist to support the description of fundamentally concurrent processing, whether task-parallel, data-parallel, hybrid, or "other" (whatever that may be) that will allow for the supplanting of HDLs? I know that there's been long-standing efforts to do C-to-HDL but to my knowledge the successes of this approach have been limited to relatively constrained solution spaces.

I guess my feeling is that there's a fundamental difference between the sequential execution inherent in something like C and the ability to describe concurrency that is fundamental to HDLs, and it's going to be a hard bridge to build.

Side question: are "SW guys" satisfied with the tools/languages available for multi-core development? I've always thought things like OpenMP and CUDA were steps in the right direction but still hacks. It seems to me like there's still problems to solve there as well.

marshray · on Nov 23, 2011

There are interesting things being done with Haskell: http://www.dougandjean.com/hngn.html http://raintown.org/lava/

For example, I've seen a meaningful subset of Haskell in which the compiler would only accept programs which would provably terminate. The compiler output could then be sent down the FPGA synthesis toolchain.

Side question: are "SW guys" satisfied with the tools/languages available for multi-core development?

Yes and no. The new C++11 memory concurrency model and the atomics and multithreading library support are a huge step in the right direction. But it's nothing particularly slick and it still feels clunky. The fancy tools still seem to be not-perfectly portable and open source.

I've always thought things like OpenMP and CUDA were steps in the right direction but still hacks.

I'm guilty of not having used OpenMP. I should try it.

CUDA seems well-designed and has a decently well-supported toolchain. But it's a vendor-specific technology. OpenCL seems to be almost as fast and supported by more vendors.

It seems to me like there's still problems to solve there as well.

I remember as a kid hearing my father talk about the computer they were building at the university for research on high-level parallelization. That was the Illiac IV, completed in 1976.

Jach · on Nov 23, 2011

I had some fun with MyHDL several months ago, which lets you compile Python code (using MyHDL) into Verilog (or VHDL). http://www.myhdl.org/doku.php (Some flip flop examples: http://www.myhdl.org/doku.php/cookbook:ff )

There's nothing inherently special about Verilog/VHDL as concurrent languages--I imagine a very slick-looking Clojure module could be built as well. MyHDL let me reason about the program much more easily than a Verilog equivalent, all around the block in the cases of synchronous execution, blocking/non-blocking statements and integers ( http://www.myhdl.org/doku.php/why ). Also, (mostly) painless simulation and tests!

Though I would tend to agree that a C to Verilog system would be a step back; MyHDL makes heavy use of decorators and generators and other goodies that come from the functional programming world. C doesn't really work well there; you can hack it in but at that point you might as well just use Verilog.

femto · on Nov 23, 2011

Sequential execution isn't necessarily inherent. C++, Smalltalk, VHDL and others share Simula as a common ancestor. Simula had elements of concurrency and event driven simulation in it. One could envisage an object oriented language, such as C++ or Smalltalk being extended to allow objects to execute in parallel and communicate via methods. Maybe every object would be derived from a root object that fundamentally supports concurrency?

I'm not saying that such languages currently exist, but I do think they will come into being, as programming comes to grips with multicore CPUs. It will be essentially HDL synthesis, with the synthesiser/compiler being smart enough to hide all scheduling issues from the user.

As someone with a foot in both camps, I'm not happy that I have to choose early in the design process whether to run my code on a CPU or FPGA.

mindslight · on Nov 23, 2011

PI calculus/actor model/'erlang style' concurrency.

In (synthesizable) verilog, you have tiny state machines communicating via explicit channels (clock+wires+buses), and functions that get turned into gates. A higher level language would give you ideal channels (mapping onto fixed hardware channels or synthesized to HDL). Depending on their complexity, the functions at each state-node would also be transformed to use more general blocks and intermediate states.

bisrig · on Nov 23, 2011

Do you have links to any tutorial, intros or such for the list that you enumerated above? I am very much in appreciation of your (and everyone else's) feedback - learning of a number of new options and approaches that are worth further research.

mindslight · on Nov 24, 2011

I haven't been in the hardware world for a while, so I don't have a great answer to your question. For concrete embedded systems tools, I'd look to see what springs up around chips like the OP (I think Parallax made a massively parallel chip a while back, too?). Hardware system design is already all about explicitly decomposed parallelism. The problems to be solved are really from the software perspective of not caring about how concurrent resources are automatically allocated, as long as the constraints are met. If you're looking to learn how to think about parallelism in that manner, I would learn to program in a popular high level language such as Erlang.

dkersten · on Nov 23, 2011

Side question: are "SW guys" satisfied with the tools/languages available for multi-core development?

I really like Intel's Threading Building Blocks for multicore programming in C++ and I really like Clojures take on time and concurrency, but overall, I feel language support is very limited and library support doesn't mesh as well with the languages as hoped. I would like to see a practical dataflow-centric language.

queensnake · on Nov 23, 2011

Communicating Sequential Processes, perhaps? http://en.wikipedia.org/wiki/Communicating_sequential_proces...

It's like Erlang, maybe not as HDL-like as you were thinking. But, I don't see why you couldn't model every little gate as a process (obviously impractical), so it theoretically could fit.

pnathan · on Nov 23, 2011

The CSP model spawned the Occam language, used by the transputer. A modern version is available at occam-pi.org.

sliverstorm · on Nov 23, 2011

what software programming languages, models, etc. exist to support the description of fundamentally concurrent processing, whether task-parallel, data-parallel, hybrid, or "other" that will allow for the supplanting of HDLs?

Not a SW guy, but I'm pretty sure the options can be boiled down to "pthreads or HDLs"

sliverstorm · on Nov 23, 2011

By "FPGA guys" I meant the guys who make FPGAs, not the guys who use them. There is still a pretty strong dividing line between FPGAs and DSP chips.

femto · on Nov 23, 2011

Sorry for the misinterpretation. If I was an FPGA manufacturer, I'd be trying to buy a company like Green Array and beat the CPU manufacturers to the middle ground. Agree about the current divide, though I think it will weaken and eventually disappear.

femto · on Nov 22, 2011

Correction: I've been lax in my terminology with logic gates and cells. Current big FPGAs contain around 2 million "logic cells", which typically represent between 1 and 10 logic gates each, depending on device programming.

zackmorris · on Nov 22, 2011

I've done many write-ups on this if you look under my profile. Basically the idea is that most of the big problems in software engineering can't be solved easily in a serial fashion with current methodologies, and many of the existing parallel solutions are either too expensive or two complicated to advance the state of the art. I wish I had this chip 15 years ago...

chalst · on Nov 22, 2011

This example gives something of an idea:

http://www.greenarraychips.com/home/documents/pub/AP002-OSC....

The picture I get is that the boundary between computing components and control/sensor electronics is dissolved.

protomyth · on Nov 22, 2011

The Propeller chip http://www.parallax.com/propeller/ is another embedded market chip in the same vein although nowhere near as many cores. Having something like this make for some interesting embedded designs.

Also, check the site for the arrayForth stuff.

On another note, it would be interesting if something like these existed in the 64-bit size. Larrabee is interesting, but if it was a simpler stack machine at around this price point then perhaps some work could be done on different ways to do parallelism.

_juof · on Nov 22, 2011

Tilera does a 64-bit CPU with 64/100cores.If they would offer access to it using a pay-per-hour model, that would be interesting.

dkersten · on Nov 23, 2011

I spoke to one of the Tilera guys a few years back about evaluating their boards for use in a telco project I was working on at the time, but gave up because I wasn't prepared to drop $15K on a development board just to evaluate it.

The technology itself was extremely interesting, though, and would probably have been a good fit for what I was doing. Also, the cost of the dev board included technical support and from speaking to the guy, I got the impression that they actually help you port your code to their platform.

neopanz · on Nov 22, 2011

May be if you use 100+ of these (~14400 'computers') you can build a machine with emerging behavior, brain-like?

sliverstorm · on Nov 23, 2011

If 15,000 computers was all you needed for AI, we'd have it by now. Major datacenters have way more machines than that, which are far more powerful than a milliwatt microcontroller to boot.

ypcx · on Nov 22, 2011

I knew it. The name "Moore" sounded suspicious to me immediately. Now I know his real name is Chuck Testa!!!

On a more serious note, what we need much more of is not the processing speed. What we need much more of is what I call "memory processability", which roughly means "how many times per second can you process the whole memory" - or something like that. Basically how much CPU is there per RAM. Indexing is a great hack, but a hack it still is. Memory and processing cells must be merged into a single, massively parallel chip.

thisrod · on Nov 22, 2011

That was done in the 80's: http://en.wikipedia.org/wiki/Connection_Machine . Richard Feynman had to help the designers with the hard bits.

Users recall them fondly. In the end, clusters became too cheap for custom hardware like this to compete.

itsameta4 · on Nov 23, 2011

>What we need much more of is what I call "memory processability", which roughly means "how many times per second can you process the whole memory"

I might be misinterpreting, but I think the phrase you're looking for is "memory bandwidth".

ww520 · on Nov 23, 2011

What are the software stacks that are available on the chip? Forth must be there. Anything else?

sausagefeet · on Nov 23, 2011

Is this something I would use in the place of something like an Arduino (for example, putting a brain on an RC car)?

rbanffy · on Nov 22, 2011

Wondering if it ciuld emulate an Apple Ii in real time and, within it, run GraFORTH.

mung7 · on Nov 23, 2011

This is all new to me. Does anyone know if these chips could be used for bitcoin mining?

However, I believe you would have to write completely new programs if it is possible.

Don_Wallace · on Nov 23, 2011

Imagine a Beowulf cluster of those.