The real question is "what do we do with a lot of CPUs without shared memory?" Such hardware has been built many times - Thinking Machines, Ncube, the PS2's Cell - and has not been too useful for general purpose computing.
One of the simplest cases for broad parallelism is search where most of the searches fail. The required intercommunication bandwidth is low - you start up all the machines and wait for somebody to report success. Bitcoin miners and crypto key searchers are extreme examples. Problems which can be hammered into that form parallelize well.
Some problems that distribute spatially parallelize well on such hardware. Weather prediction, wind tunnel simulation, nuclear explosion simulation - hydrodynamics problems like that can be handled in sections, with communications only for the edge elements. That's what supercomputers are used for.
Neural nets and deep learning, though - finally, there's a broad use case for such architectures. Google is using lots of GPU-type boards that way now. This is a key concept. Massive parallelism on a single problem is really hard to apply to the things people used to do on computers. Instead, it makes possible new things to do.
An important question is how much memory each CPU should have. Thinking Machines, NCube, and the Cell (256K each) clearly had too little. You need enough memory per CPU that the intercommunication doesn't dominate the problem.
The other big problem is how to organize software for this. We have two main models of parallel software - shared memory, and clusters on a network. Those are the two extremes for parallelism. Systems where the CPUs are more tightly coupled than in an cluster but less tightly coupled than in a shared memory multiprocessor fall outside common programming paradigms. This is a very hard problem, with no really good general-purpose solutions.
One interesting thing going forward (over the next decade) regarding deep learning and high-performance computing will be if we can circumvent the big serial bottleneck - that is, depth. The best benchmarks have generally all been set by networks with significantly increased levels of depth compared to the previous records. Gradient descent, however, requires propagating residuals back through every layer of the network (to tune the lowest layers, at any rate). This is inherently a serial task. Recurrent networks can be considered to be infinitely deep - in practice they're truncated to some finite level during training.
There are a couple of solutions to these problems that are being worked on - including various "learning-to-learn" approaches. One approach (DeepMind published, I believe) is to locally train an auxiliary network to predict the back-propagated residuals coming down from the top. It will be interesting to see if these approaches pan out, or if eventually we start to run into real serial bottlenecks again as we build ever-deeper networks.
Everything at web scale is embarrassingly parallel, at least on some fronts. There's still often the bottleneck of maintaining state (in a database, probably), but a heavily loaded server will have hundreds or thousands of processes or threads or some kind of async (probably epoll or something reasonably efficient on modern systems) to serve many clients. Which is why Erlang (and Elixir) has some very enthusiastic proponents.
Serving the web is, I think, a very big part of the "general computing" space, today.
Serving the web seems to split between a) IO-bound problems and b) databases. It is certainly not embarassingly parallel in the "would benefit from GPUs" sense.
Those thousands of threads just sit there waiting for the database to respond a lot of the time. The database is busy trying to keep things synchronized across nodes. It is just that the hard stuff is pushed down to the database layer.
Both can be true. Database development, both the way they're built and the way people use them, has been focused on scaling across machines and storage for most of the history of the web; it's possible it'll never quite catch up with CPU, but it will continue to move that way. I don't know what the next evolution of databases looks like (I'm definitely not a database expert), but if Google and facebook and Amazon and Microsoft and many others need it to scale beyond what is currently possible to keep up with CPU parallelization, they'll figure out how to make it scale.
Yeah, the Erlang ecosystem is so much better (for web) than everything else available today, it's not even funny.
Though it seems like there is some very major obstacle in people understanding it. It's so frustrating, I've already stopped advocating for Erlang/Elixir. Currently I'm concentrating in just reaping the benefits and doing my stuff.
Yep, I've built two very nice servers in erlang, but main problem - only I can extend and modify them in my company, my coworkers can't learn this. One made some inroads into this, but there's just too many cheap .net and java developers around for erlang to be successfull.
Upside for me - I'm irreplaceable. Downside - there's no one to help when something is wrong and company progress is throttled by my performance.
Elixir is partly fixing this problem. Though it's still not enough because the hard part is not the syntax (as most newcomers tend to blame their failure on), but the things you write with the language, which are the same for Elixir and Erlang.
If you are using immutable data structures/pure fp, then you can get a lot parallelism without the need for shared memory because the state is on the stack. As the article posits, you can just pass messages to the functions/objects and receive results back. Again, as the article suggests, the major problem is memory bandwidth. The trick is building a bus (busses?) that allows you to aggregate partial solutions.
This is why I'm very bullish on languages like Haskell in the future (say 10-15 years from now). It's interesting that the article suggest OO because I think the key to allowing massive parallelisation is immutability. Immutable OO and pure FP is essentially equivalent AFAICT, but the latter currently has nicer language support.
Having spent quite a bit of time designing massively parallel algorithms (concurrency starting at several thousand cores on up), computer scientists are often baffled when I tell them that FP and immutability don't help. In practice, it solves the wrong problem and creates a new one because massively parallel systems are almost always bandwidth bound (either memory or network) and making copies of everything just aggravates that. You can't build hardware busses to compensate or companies like Cray would have long ago.
If you look at effective highly scalable parallel codes of this nature, you see two big themes: pervasive latency hiding and designing the topology of the software to match the topology of the hardware. The requisite code to implement this is boring single-threaded C/C++ which is completely mutable but no one cares because "single-threaded". Immutability burns much needed bandwidth for no benefit. This among other reasons is why C and C++ are ubiquitous in HPC.
The challenge is that CS programs don't teach pervasive software latency hiding nor is much ink spilled on how you design algorithms to match the topology of the hardware, both of which are fairly deep and obscure theory.
We don't need new languages, we need more software engineers who understand the nature of massive parallelism on real hardware, which up until now is largely tribal knowledge among specialists that design such codes. (One of the single most important experiences I had as a software engineer was working on several different supercomputing architectures with a different cast of characters; there are important ideas in that guild about scaling code that are completely missing from mainstream computer science.)
Having spent a good bit of time in HPC from a practical and academic perspective, I'll second that claim.
One example I like to bring up is the textbook hotplate, where the temperature for a single cell is equal to the average of its temperature and the temperatures around it. Each iteration is embarrassingly parallel, but the result of the iteration is necessary to proceed. Instead of synchronizing the entire cluster between iterations, each node computes the results for an extra N cells in each direction, allowing N iterations to occur before synchronization is necessary (after the first iteration, the outermost cells are inaccurate, after the 2nd iteration, the outer two layers are inaccurate, etc. Eventually, you want to sync before the cells we care about are affected). This leads to massively diminished returns as you break the problem into smaller chunks. But even for smaller chunks, doubling or tripling the work per node is still faster than synchronizing.
My sole contribution to HPC was inventing a graph algorithm (circa 2009) for breadth-first search that didn't require the usual iterated barrier synchronization super-step you have with BSP algorithms. Instead, every node would run free with almost no synchronization but a clever error correction mechanism backed out all the incorrect traversals for a completely negligible cost (both space and time) relative to the ability to let the computation run as though it was embarrassingly parallel. It basically allowed every node to always be doing (mostly) constructive work without waiting on another node.
But yeah, burning a bit of CPU to eliminate synchronization is frequently a huge win. This is where knowing how much compute you can do within a given latency window is helpful.
I'm not about to join the FP circle jerk, but well-executed immutability can reduce memory bandwidth requirements rather than increasing them. This is because often in large C (and Java) codebases, the uninitiated will copy large structures and buffers when they don't need to, because they don't trust the other code[rs]. If you have a tree-like datastructure such as a nested map, immutability lets you copy the structure by reference even when modifying the contents.
That said, the added complexity in memory management means that if you can afford good programmers, it's better that they just try to avoid this practice in the first place. So HPC folks take the mutable approach.
I guess it's hard to say if an average, middle-of-the-road programmer will ever be able to productively work on parallel systems.
Very well put, I just have one addendum: Add Fortran to the list. If you program multidimensional array like data structures as flat arrays in C or C++, use Fortran for the kernels and Python for the glue instead. It's just as performant, has much more batteries included and is much more comfortable to achieve a high performance.
>> If you look at effective highly scalable parallel codes of this nature, you see two big themes: pervasive latency hiding and designing the topology of the software to match the topology of the hardware
Do you have any good reference to learn general HPC and those topics? Best I could find when I searched last time was Udacity's High Performance Computing mooc and its reference book "Introduction to Parallel Computing".
In certain circumstances, new languages can help software engineers to get high performance. For example, the lift project at the university of Edinburgh. It's a skeletal (think functional, but on the immutability/higher order functions rather than category theory) domain specific language for writing GPU kernels in. The team there have managed to use it to implement matrix-multiplication kernels that beat everything apart from NVIDIA's own assembly language implementations, just by using the safety and separation of concerns that a functional approach brings.
If you have high bandwidth, you can do many of the same things as with shared memory. HPC users have historically used message passing on moderately communication heavy codes.
True, it has not been used for "general purpouse computing" but historically shared memory multiprocessors hadn't been either. We are just now starting to see things like parallelism in web browsers, 10-20 years after shared memory MP became common on desktop. Software is just slow to change.
HPC is usually a balancing act between the performance you gain by adding another node and the time you lose passing messages around. You can't just keep adding nodes without introducing some form of shared memory because you will very quickly become limited by your bandwidth.
What is shared memory, if not message passing (with some particular semantics) implemented at the hardware level? I.e. messages == cache lines, semantics == cache coherency protocol. Particularly so when you go to (nowadays) multi-socket systems, where everything is NUMA.
There is a smallish sub-segment of the HPC market where customers really want large CC-NUMA systems, AFAIK SGI (whoever owns them these days, HPE?) is more or less the only supplier. But the vast majority of HPC is clusters using message passing.
If I were to choose some awfully general statement about HPC, rather than yours I'd choose "supercomputers are not tools for solving small problems faster, but rather tools that make solving larger problems possible". I.e. the entire weak vs. strong scaling thing etc.
This question, what to do with a 'shared nothing' machine, is the real heart of the question.
The original article talked about putting ARM-9 cores on a chip which is also overkill, your typical object neither cares nor needs any of the complex memory architecture an ARM-9, had they chosen a Cortex-M they could have put twice the number of cores on their chip with no loss of computability.
But as John points out, the challenge is how you actually program these things. One interesting proposal I heard was to compile code with every loop unrolled and every branch hard coded either one way or the other. So for every "branch" in a piece of code you would generate two binaries, one where it was taken and one where it wasn't. That means four if statements, 16 different binaries. (pruned for cases that couldn't happen though). And then you ran your input on every copy of the program simultaneously. [1] The issue though, and it became crystal clear to me as I was working my way through how Google processed logs and log data, was that feeding the output of one computation into the input of another became the limiter and the interconnect of the fabric was the real bottleneck. That analysis got a tremendous validation when the underlying network upgrades started getting processed. It made it possible to run the same sort of test on a group of machines connected in the 'old' way, and the same machines connected in the 'new' way. That demonstrated just how important inter-machine communication could be in making them productive.
So a model for optimizing both in parallelization (interdependencies in state) and scheduling (interdependencies in time) it essential for making a shared nothing cluster perform better than a very large shared memory machine.
[1] Branch prediction is good enough these days to make this not much of a win.
> Today, loop unrolling is usually a lose if it makes the loop much bigger.
Is this true?
For example, suppose I have a loop that does 1 million iterations and an increment of `i += 1`. If I unroll that to have an increment of `i += 10`, now I only have to do 100,000 iterations. That is 900,000 eliminated branches.
Modern superscalar CPUs are computing the branch at the same time they're doing the operation. They're also renaming registers, so that the reuse of the same register on the next iteration may actually use a different register internal to the CPU. It's entirely possible for five or so iterations of the loop to be running simultaneously.
Classic loop unrolling is more appropriate to machines with lots of registers, like a SPARC. There, you wrote load, load, load, load, operate, operate, operate, operate, store, store, store, store. All the loads, all the operates, and all the stores would overlap. AMD-64, unlike IA-32, has enough registers that you could do this. But it may not be a win on a deeply pipelined CPU.
The key thing missing is if it makes the loop much bigger. if
That example does not make the loop much bigger, since it's trivial to combine multiple of these additions to one bigger one (even assuming that extra checks for overflow behavior might be needed, depending on the data type). Many other operations do not compress like this, e.g. those that iterate over many elements in a large data structure, need temporary results, ... Then loop unrolling likely causes costly cache misses.
> The real question is "what do we do with a lot of CPUs without shared memory?"
At the extreme we would go biological, that's what. But then we're talking 10^n in the billions ranges (say n = 15) and unlikely that we would directly "program" these tiny computing engines.
I think hierarchical models (in terms of system organization, with specialized sub-systems or 'organs' if you will) are inevitable unless these little machines have recourse to some sort of breakthrough in communication that would permit arbitrary [efficient and non-local] meshing.
Nothing happens to object-oriented programming because it is entirely unrelated to multicore processors, except for the fact that object-oriented programming makes for terribly inefficient data structures that cannot handle threading very well.
If you've written software before, you would know that most of your codepaths require prior data to be processed, and thus having multiple cores won't make software any faster than before. Additionally, there is a cost to starting threads, so the benefit of making your software multi-threaded has to outweigh the cost of launching a thread, sending data to that thread, and getting results from that thread.
Pooling and channels are often used to mitigate the initial cost of creating threads, but it still remains that it takes time to send data to a thread and receive data from that thread, and there will always be that one main thread that needs to handle the majority of the logic and keep results in sync.
Basically, it would be easier to see powerful cores sitting adjacent to miniature parallel cores that are able to respond quickly to requests.
> Nothing happens to object-oriented programming because it is entirely unrelated to multicore processors, except for the fact that object-oriented programming makes for terribly inefficient data structures that cannot handle threading very well.
OO doesn't specify the data structures though, if you model your objects as shared-nothing message-oriented entities it does concurrency pretty well (by not doing threading at all).
Exactly. In fact, Alan Kay very specifically wrote his model was that of little computers talking to each other. Having heap-allocated chunks of memory multiplexed onto a single processor, whether sequentially or by time-sharing, is/was an implementation technique that was necessary because massively concurrent hardware wasn't possible at the time.
The question is whether this is still true, and if it is not, what the consequences would be.
AFAIUI from a recent talk by Alan Kay his original intention was that communication would be async (just as it sort-of is in real life)[1]. So it would basically be actors. Which sounds nice until you realize that this model means that everything needs some sort of flow control (or unbounded queues) to even be semi-reliable.
Don't get me wrong... it's a good model for semi-independent computing units (because sometimes cables really do get severed randomly), but it's definitely not good for simple procedure calls (which is what OOP methods really are in practice).
(I have exactly the same objection to the actor model as a foundational unit of computing. Theoretically, it's fine, but in practice it doesn't really work if it's all you can do. Similar to "pure" lambda calculus.)
[1] I must say I was really surprised by this, but then I haven't programmed any Smalltalk. Maybe it makes sense there. Or maybe he intended OOP at a higher level than just method calls (as in Smalltalk).
> (I have exactly the same objection to the actor model as a foundational unit of computing. Theoretically, it's fine, but in practice it doesn't really work if it's all you can do. Similar to "pure" lambda calculus.)
With these models, it's all about making compilers ever better at their specialized job of translating "abstract machinery" into the current physical-machinery paradigm of the day.
No idea about "actors" but "pure lambda calculus" works terrifically well thanks to 25 years of top-notch efforts invested in the Glasgow Haskell Compiler.
OO is really in the same boat in that it's not exactly ASM either, just has even more "compiler investment" behind it broadly speaking.
GHC isn't even close to "pure lambda calculus", sorry. For one thing, there's a lot of fuzziness around e.g. "seq", strict pattern matching vs. lazy pattern matching, etc. which leads far afield from LC. For another, there's "IO" which doesn't have any kind of formal semantics in Haskell, nevermind LC, so...
EDIT: I'm not saying it's not useful as a model, I'm just questioning whether it's practical. I think both LC and the Actor model are both interesting and valuable, I just don't think they should be the foundational model for ABSOLUTELY EVERYTHING as e.g. Alan Kay seems to be advocating.
EDIT#2: I should say... I realize that LC actually is Turing-Complete, as I believe Hewitt has also demonstrated that the Actor model is. However, I believe Hewitt assumes infinite queues. As always, infinity gives quite a bit of leeway.
Well I reckon you're using a "strict" definition of pure and I'm using a "lazy" one.. ;) Let's say Haskell-the-language is on the purer side of things and GHC-the-compiler by necessity closer to the messy not-so-pure real-world-er side of programs that run acceptably speedily reliably and sturdily most-of-the-time..
(IO monad "while used by the coder" seems plenty pure to me other than the background knowledge that eventually with a `main` entry point it will become hooked up with actual I/O.. but that's "semantics" hehe)
Alright, then -- I think we agree for all practical purposes :).
Still, I think it is important to realize that while "IO" is 'sort-of-a-monad'... the truth isn't quite that simple :). There's a lot of nastiness hidden behind it, especially FFI and exception-related things... ugh.
EDIT: My original point was mostly that maybe he hadn't thought this whole "async every method call!" thing through. That... leads to a mess, especially if every single "thing" you do is a "method call".
Well as people using Erlang can say, this is not a big problem. The real question is, how much of your code really need to be simple procedure call. The answer is, not that much. Interestingly.
> but it's definitely not good for simple procedure calls
From TFA:
"With each object having around 96KB of private memory to itself, we would probably be looking at coarser-grained objects, with pure data being passed between the objects (Objective-C or Erlang style) and possibly APL-like array extensions (see OOPAL)."
So coarser-grained objects/actors. And yes, Alan Kay very definitely intended OO at a higher level than "method calls".
Alan Kay:
"Java and C++ make you think that the new ideas are like the old ones. Java is the most distressing thing to hit computing since MS-DOS."
"I invented the term Object-Oriented and I can tell you I did not have C++ in mind."
In fact, he would probably object to the terminology "method calls", because it is just thinly wrapped procedural programming.
"Just a gentle reminder that I took some pains at the last OOPSLA to try to
remind everyone that Smalltalk is not only NOT its syntax or the class
library, it is not even about classes. I'm sorry that I long ago coined the
term "objects" for this topic because it gets many people to focus on the
lesser idea.
The big idea is "messaging" -- that is what the kernal of Smalltalk/Squeak
is all about (and it's something that was never quite completed in our
Xerox PARC phase). The Japanese have a small word -- ma -- for "that which
is in between" -- perhaps the nearest English equivalent is "interstitial".
The key in making great and growable systems is much more to design how its
modules communicate rather than what their internal properties and
behaviors should be. Think of the internet -- to live, it (a) has to allow
many different kinds of ideas and realizations that are beyond any single
standard and (b) to allow varying degrees of safe interoperability between
these ideas.
If you focus on just messaging -- and realize that a good metasystem can
late bind the various 2nd level architectures used in objects -- then much
of the language-, UI-, and OS based discussions on this thread are really
quite moot. This was why I complained at the last OOPSLA that -- whereas at
PARC we changed Smalltalk constantly, treating it always as a work in
progress -- when ST hit the larger world, it was pretty much taken as
"something just to be learned", as though it were Pascal or Algol.
Smalltalk-80 never really was mutated into the next better versions of OOP.
Given the current low state of programming in general, I think this is a
real mistake.
I think I recall also pointing out that it is vitally important not just to
have a complete metasystem, but to have fences that help guard the crossing
of metaboundaries. One of the simplest of these was one of the motivations
for my original excursions in the late sixties: the realization that
assignments are a metalevel change from functions, and therefore should not
be dealt with at the same level -- this was one of the motivations to
encapsulate these kinds of state changes, and not let them be done willy
nilly. I would say that a system that allowed other metathings to be done
in the ordinary course of programming (like changing what inheritance
means, or what is an instance) is a bad design. (I believe that systems
should allow these things, but the design should be such that there are
clear fences that have to be crossed when serious extensions are made.)
I would suggest that more progress could be made if the smart and talented
Squeak list would think more about what the next step in metaprogramming
should be -- how can we get great power, parsimony, AND security of meaning?"
> With each object having around 96KB of private memory to itself
For reference/comparison, by default with SMP and HiPE enabled an Erlang process is initialised at 338 words, including 233 words of heap memory (which contains the stack and will increase as needed) and the rest being process meta-information.
96KB private memory means you'd probably want to have "nursery" cores and process migration.
> Erlang process 338 words when spawned, including a heap of 233 words.
However note that this is with HiPE and SMP enabled (which afaik are now the default) as the "processes" guide notes[0] without both of these a process should be 309 words. Without SMP, I get 320 words (HiPE is a compilation flag and I don't feel like recompiling Erlang right now).
>if you model your objects as shared-nothing message-oriented entities it does concurrency pretty
That's actor model, but it still doesn't feel as natural as with functional languages - dealing with immutable data in OO languages is tedious and error prone
aka the original OO. OO as taught today is nothing like how Alan Kay imagined OO [1]. Alan Kay originally imagined OO as biological cells only able to communicate via message-passing. That's the actor model more or less.
> dealing with immutable data in OO languages is tedious and error prone
Again, pretty different OO. Still, I'm curious what you find error prone? Tedious, perhaps. Scala and Java+Lombok make immutable data pretty decent to deal with ergonomically if not quite as good as Clojure or other languages out of the box. Even PCollections for Java reduces much of the waste and is fairly nice ergonomically.
> aka the original OO. OO as taught today is nothing like how Alan Kay imagined OO [1]. Alan Kay originally imagined OO as biological cells only able to communicate via message-passing. That's the actor model more or less.
Async message-passing has a huge problem when it comes to practicality: In any resource-constrained system (i.e. no unbounded queues) messages can get lost. I don't know if you feel comfortable with method calls having a "best-effort" semantic (which is what happens in the actor model and biology[2]), but it makes me deeply uncomfortable as a foundational model. At the very least you need some way to do retries, but if even looping is a method call, then how do you ensure even the reliability of retries? You could do it to arbitrary precision, I think, given enough duplication + repetition of code[1], but I don't understand why we'd want to do something so messy at a language level when we already have hardware that's reliable to 99.999%+ levels already.
[1] This also assumes that the interpreter/compiler reading the code is 100% reliable... which it wouldn't be if everything is actor-based. You can see the problem.
EDIT: [2] This is a perhaps-interesting aside: In biology "best effort" is typically the best you can do, given the limitations of fluids/energy/etc. You send out enough molecules of the right type to hope that the intended recipient has a decent chance of detecting them. These things have been fine-tuned through a billion+ years of evolution, but in computer systems we typically don't have that kind of (even simulated) time to do "just enough".
Looking at the Erlang model, there's no view that code is reliable.
In fact the prevalent view is that code is unreliable and you should structure things to recover from failure at all points of the architecture. Messages can go missing. Calls can time out. Processes can crash for reasons you will never know about, and your architecture should be resilient against that.
Whilst this sounds extremely complex, in practice it boils down to letting everything low level fail, and having supervisors who can handle the failure.
Resource issues are a problem everywhere. For example, you want to save a file.
What happens if the file system crashes after it claims it has saved?
What happens if its a network connected drive, and it times out?
What happens if its a file on Amazon S3, and your network cable just got cut by a tractor?
You're going to have to deal with this at a higher level than the line of code which tried to save the file.
> You're going to have to deal with this at a higher level than the line of code which tried to save the file.
Exactly, but that's the point I'm trying to make, perhaps badly. My point is: Can a function call fail in Erlang, i.e. can it just "not return"? AFAIUI as long as you're inside a message handler, this cannot happen -- any function call will receive a return value. In other words: Function calls inside message handlers are synchronous.
In the Actor model conceived by Hewitt, this is not the case. Now, this would be a huge problem for any foundational theory, but he seems to get around it by just assuming infinte resources (queues), so it all works out in theory.
> My point is: Can a function call fail in Erlang, i.e. can it just "not return"?
Kinda? Function calls in Erlang are intra-process and synchronous, but the function could send and receive messages internally and never return because it never got a response and did not have a timeout setup.
When you're talking about actual messages between processes though, they are asynchronous and they can get lost (if only because the process you're sending a message to is on an other physical machine).
I think there are two separate issues here: unbounded queues vs finite resources and also the overhead of asynchronous message passing (or "don't make every particle its own actor").
Regarding applying backpressure, you're correct that Erlang doesn't have a silver bullet. Process mailboxes are unbounded and can exhaust resources. Or you can implement a buffer and drop messages based on some strategy.
As for making every 'operation' asynchronous, you could do it and not run into any additional unbounded queue or error handling problems, but it would add overhead without any advantage over concurrency via preemption.
Erlang has two kinds of 'function calls' (terminology used there is message passing):
- Synchronous calls where the caller will receive a response, and waits for X seconds until considering the message lost. Synchronous messages are blocking on the caller side, and block on the receiever while they are being processed.
- Asynchronous calls where the caller fires the message off and further semantics are undefined---the message could be ignored, lost, received, and acknowledged through a side channel, or a message back to the caller later on.
The Erlang runtime provides a framework--OTP as in online telecom protocol referencing its always on, phones must work at all times roots--that lets you set up supervisors, watchers, and handlers to deal with the problem of trying to send messages to processes that aren't there, child processes disappearing on you, and so forth.
Erlang's ecosystem aims for a 'soft realtime' setup, where usefulness degrades but is not totally lost---so the default option on failure is to retry until some kind of success threshold is reached.
That's more or less how erlang work: by async message-passing, and it's used to build extremely (99.9999999%) reliable softwares. A "best-effort" semantic is more than enough: when you really care if the message has been received (that is to say most of the time), you make a "synchronous" call, which is really just two async calls (one in each direction), with a timeout. Even if the message back is just an acknowledgement that the message was received.
In Erlang/OTP it's wrapped in an actor behaviour called "gen_server", and the default timeout is 5000 milliseconds. If it timeouts an exception is raised, and a common strategy is to simply let the process crash, and restart to try again [0].
On the other hand it's not async call all the way down: it's only the calls between actors that are asynchronous: if you write a loop [1] within your actor then it's synchronous. To write asynchronous code you need to spawn new actors (which is trivial, and will always by default result in an exception being raised if something get lost or timeout after 5 seconds).
Note that when you write Erlang code the granularity of an actor is absolutely not the same as when you write OOP code: you don't wrap every data structure in a process, rather processes tend to be about processing data, or abstracting an IO point (a file, a socket, etc.) since the semantic is perfect.
[0] The implementation of all of this takes more or less 1 line of code, this is really the defaults.
Erlang != OOP as imagined by Kay... as I understand Kay's intent.
Erlang == "independent reliable units of computing". (I think Erlang programmers call them "functions"? Or maybe "processes"? AFAIR only the "process" level is subject to message loss, yes?)
I'm sure you're right about most of what you wrote, but that's not the point I was getting at.
EDIT: Just to expound: In Kay's conception even e.g. "==" inside a function would be a method invocation and it would be asynchronous (presumably taking two continuations, one for "equal" and another for "not equal"). This absurdity is why I fundamentally disagree with the Actor model (as a model of computation), but also with extremism such as Kay's.
EDIT: Btw, I call bullshit on the 99.9999999% reliable service unless you can come up with a reference. I know that Erlang/OPT is known for legendary relaibility, but I'm unsure on what that means: Does it mean "no interruption" on any given stream (e.g. phone call). Does it mean "no more than X connections fail"?
It's hard to put words in Alan Kay's mouth for him on this point but what he usually talked about was the biological model, specifically in regards to cellular units. In computing this does translate to independent, reliable, and redundant units. I'm not sure Erlang really takes it to the logical extreme but it certainly goes that direction.
It's interesting that detailed implementations of the idea become people's definitions (i.e. taking method-based dispatch as the analog to intercellular interaction). With Erlang, the method dispatch becomes modal via the currently active (or lack thereof) receive block in any given process, and I wonder if this captures something too specific or if this difference is worth making at the level of Alan Kay's OOP.
With that in mind, I think you're overstating the orthodoxy. The symbolic representation and execution semantics of messaging models aren't exactly the same. There is no reason == needs to be a "method invocation". If we step back, there isn't really a good definition of what it really means in todays runtimes and compilers anyway. Things get inlined, fast paths get generated for common cases, &c.. His point was more that late binding allows the system to avoid upfront knowledge or hard coding of all useful configurations in a piece of software. Leaving optimization to compilers and runtime systems seems to be the status quo these days.
I definitely agree that it's pretty hard to guess his precise intent, but that's exactly why I'm extremely skeptical of regarding him as some sort of prophet[1] on what exactly OOP "is". If you're always vague about what you mean, it's pretty easy to be a guru/prophet/visionary. AFAICT if one is going to claim to be a visionary (indirectly or otherwise) on the Actor model, I think Hewitt actually has the better claim -- much as I dislike him.
EDIT: Just on a minor point:
> There is no reason == needs to be a "method invocation".
Well, except that's what the Actor model and, indeed, Smalltalk (Kay et al.'s project) wants us to believe. I, of course, agree completely with you on this point. It's just a function -- simple as that.[2]
[1] Yeah, yeah, I'm exaggerating, but he did kind of bring on himself by continuing on his quest :). I bear no particular ill will towards him, but I do think he's caught in a web of having-to-be-brilliant-based-on-prior-performance-but-does-not-have-much-to-contribute... as befalls many academics and 'visionaries' (e.g. Uncle Bob with his recent rants). (Disclosure: I used to be an academic... and dislike Unclue Bob because he's far more dogmatic than he deserves to be based on documented experience.)
[2] Weeeeelllll, except to be able to decide equality you need to break the abstraction barrier. Which leaves an abstraction-committed OOP programmer with a bit of a dilemma. Me... I just choose the "Eq" type class. Problem solved and I didn't even have to expose any internals. Win win!
Fair point. I guess I never see the prophet thing but his ideas are interesting so I tend to follow Kay's work, if only to blend it with many other things, not so closely aligned with his views (ex: Lamport being close to opposite of Kay).
I think the whole pedantic "Alan Kay's OOP" thing comes from his coinage of the term vs. the practical application stemming from the Simula school of object orientation. I have no horse in the race so I really wish one sense of the term would be renamed.
The thing is, in Alan Kay talks, he more point to independent interpreters that can be sequential locally. But the idea is that you can at any point of time, swap that locality for a non local one if you need it.
That is the point behind object/actors/processes all the way down. As you only talk through message passing, noone has to know if you are sync or not inside the actor. You have a contract.
It's kind of weird to say that you have a contract in a dynamically typed language. Not that it would help - types say nothing about performance or reliability.
Moving a local computation to a remote server often fails because clients have implicit assumptions about performance, but nobody is explicit about deadlines except in hard real-time programming. So in practice you get timeouts at best.
> Erlang != OOP as imagined by Kay... as I understand Kay's intent.
You should check out the video[0] of Joe Armstrong interviewing Alan Kay - it seemed to me that Erlang was in some respects closer to what Alan Kay intended for OOP than even smalltalk was.
Indeed I was not commenting on Kay's intent, that I am not qualified to talk about. I was just adding the IMHO important precision that Erlang certainly didn't implement "==" as message passing. After all, Erlang was never meant to be OOP as imagined by Kay, it just happens to be similar in some important ways. AFAIK Erlang wasn't even modelled after the actor model or anything: it's "just" the result of an endeavour to reach maximum fault tolerance.
Regarding the 99.9999999% reliability, it's from Joe Armstrong about a system called AXD301. Here is the quote [0]: "Erlang is used all over the world in high-tech projects where reliability counts. The Erlang flagship project (built by Ericsson, the Swedish telecom company) is the AXD301. This has over 2 million lines of Erlang. The AXD301 has achieved a NINE nines reliability (yes, you read that right, 99.9999999%). Let’s put this in context: 5 nines is reckoned to be good (5.2 minutes of downtime/year). 7 nines almost unachievable ... but we did 9."
to be precise : Erlang was modeled from Smalltalk and Prolog, with a bit of Lisp. They did not knew about Hewitt and the Actor model when they created it, and only learned about it later.
about the nine nines : it was a british telecom company that claimed it, by extrapolating from a 14 node network during 8 months. I prefer to say that erlang make it easier to reach five nines, then enable you to go to seven nines if you really need it.
That's a matter for a compiler. If it's possible and efficient, inline everything so "object" just becomes dispatch tables with data pointers.
The actor model is a great way to unify remote and local processes with identical syntax, and solves a ton of syntax related issues such as overriding the default '==' operator, etc.
If you are predominately working with immutable data, you are essentially doing functional programming. In which case you'd be much better off in a functional-first language. Using Scala entirely for FP has not worked out well in practice. Even the authors of "Functional programming in Scala" eventually abandoned it for Haskell. Scala has certainly raised awareness of FP, which is a big positive in its favour and the support it does have for FP is an improvement over Java.
I think that OO languages are more natural, although most people are writing their objects "wrong".
OOP more naturally models a multicore system, because it bundles state and behavior. That more closely modelsthe hardware: multiple CPUs, each of which address a non-uniform portion of memory. That is like methods an an object that operate on local/private state.
What's missing in some languages is an abstraction for "thread of control", i.e. your program counter.
But that can be added, and has been -- people are just not using it as widely as they should. Whereas I think functional languages are great for some things but awkward for others. I like algebraic data types but I'm beginning to think they are orthogonal to functional programming, as in Rust, which has them, but is an imperative language.
I expanded on this style of treating OO languages in a more functional style state here:
If OO is a better fit for parallel programming, why are functional programming models such as Hadoop or Spark, so much easier than message passing alternatives such as MPI? My answer is that they bring abstraction and composition, which classical OO does not easily allow, thanks to mutable state.
I think the key point is it's not either/or. The post I linked was saying that you can use a functional ARCHITECTURE (immutable data, explicit dependencies like monads) inside an OO language.
Hadoop is written in an OO language -- Java. But it has a functional architecture (MapReduce).
So really the two camps aren't that far apart. They both have something to bring to the table.
Opponents of OOP are thinking of all the spaghetti imperative code with globals dressed up as classes, which people have realized is a mistake, but there's still hundreds of millions of lines of code out there like that. And there are tons of "frameworks" which actually FORCE you to write in that style.
But yeah there is no real contradiction that I see. Thinking that you can write a for loop to update a hash table is as big a mistake as writing spaghetti objects.
I guess it depends heavily on what you define as an "OO language", but e.g. Scala has pretty decent support for Sum and Product types + pattern matching.
(If you don't do lenses or similar, it's still pretty tedious to deal with "deep updates", but that's coincidental. Lenses do solve this problem, although they're awful in practice in Scala.)
Totally agree if you're talking about e.g. Java or Python or even C#.
Scala does not have true sum types though, it uses subtyping to encode them. When I tried it, I found that pattern matching was often unsafe.
I am glad that Scala has raised awareness of FP, but its implementation of FP is heavily compromised by its focus on OO.
>dealing with immutable data in OO languages is tedious and error prone
I don't think this is a problem ingrained in OO. You could very easily conceive of an "OO" language that has separate constructs for objects (mutable) and data (immutable) and there would be no problem at all. Immutable data just happens to be common in functional languages but there's no reason you couldn't introduce it to an "OO" language.
I Erlang full time, I would say the actor model is better than OO, or a rather it like OO distilled. Probably what what Kay had in mind.
It is also functional with immutable data and variables. So it requires a different mindset, but I find it it fits well with the actor model. But it certainly is not coupled to it in general.
I think the author was specifically meaning types of OO, like Smalltalk, where each object sends and receives messages. That model works as-if each object is a separate thread.
But most implementation implements message sends/receives under the hood as method/function calls because it's far more efficient. The conceptual difference is small, and easy to support as method/function calls. The cost in overhead of putting them even in a different thread, much less on a different core or different CPU would be a huge problem for processes that aren't inherently possible to parallelise efficiently.
Performance isn't the only reason to dislike Smalltalk-style message passing. A far more important one is the ability to actually reason about your programs (i.e., make inferences about what they will do when run, without actually having to run them), besides just testing objects as black boxes. Traditional imperative programmers can use Hoare triples and predicate transformers. Functional programmers can use any of the various semantic models that have been constructed for all sorts of lambda calculi. What on Earth does the poor object-oriented programmer who wants to understand his program have?
Most interesting invariants span multiple data structures. For example, when you parallelize matrix multiplication, your invariant is product of the sequence of matrices, and what varies throughout the process is the length of the sequence. Presumably, for your sanity's sake, you want to consider each matrix a separate data structure.
In case you didn't understand the point to my first reply: Object invariants aren't powerful enough to describe how objects relate to one another, precisely because objects don't tell anything about their internal data structure to the rest of the world.
Fwiu swift is planning to implement actor style concurrency in a year or two. Which, if done properly, would be much much more interesting than what objective-c dynamic dispatch offers (see OTP for an example of what a full blown actor model offers).
And then we arrive at the conclusion that, on a machine where any active object that's executing something can be partitioned to a single CPU core, implementations do not need to compromise anymore.
The vat model addresses this directly by encouraging partitioning of algorithms into "plans", which interleave in execution while still being easily separable onto different threads.
But usually there is a run-time that manages the messages between the actors/objects (e.g. Erlang). Where does the run-time exist in such a machine? Would even be possible to have a run-time given the processor units exist at the lowest level?
For example, erlang queue messages into a 'mailbox' that is some out 'outside' of the actors.
I don't see where such message queues would reside in these microprocessors.
> object-oriented programming makes for terribly inefficient data structures that cannot handle threading very well.
The best approach I have seen to address this is the idea of transactional object as it is done in Realm (yes, the mobile database).
The idea that objects can freely be accessed on any thread without locks, with all changes protected by transactions, is so freeing, and the internal representation that Realm uses for the objects is apparently far more efficient than how they would natively be represented.
I really wonder why we are not seeing more of these kinds of transactional data structures. Seems like a natural next step for object oriented programming now that everything is turning multi-core.
That's sort of like software transactional memory, which has a reputation for being slow. However, limiting transactions to one object at a time may avoid most of the overhead; in fact I suspect it would often perform better than locking. On the other hand, that sounds tough to reason about… do you just always use persistent (immutable) data structures? Because those have quite a bit of overhead (if you don't need persistence anyway for some reason). If not, how do you prevent one thread from seeing another thread's transaction in progress?
> how do you prevent one thread from seeing another thread's transaction in progress?
In Realm this is done by each thread having its own immutable snapshot of the data which is only advanced to the latest state at very specific "safe" points in the code. The result of this is that data never changes underneath you (i.e. You never see changes from transactions in progress. You only see after they have been atomically applied).
I guess that you could think of Realm as an immutable data structure (as it is immutable outside of very specific conditions like inside write transactions and a state refreshes at "safe" points in the code. But the experience is totally different. It basically feels like you have plain old (mutable) objects, but they can be safely shared between threads without any of the overhead (mentally and performace wise) of locks.
My point wasn't about memory usage, it was about development cost. The transactional behaviour emerges by default from use of immutable structures.
Any transactional system will have memory performance characteristics that are very similar to using immutable structures, if not worse, so your point was, well, pointless.
This article relies on the assumption that OO has been designed to work with single core or "few" core processor.
It's a false assumption. OO has been designed to model the world in the hope that this would make expressing business logic easier.
It's not hard to imagine what would happen with these many cores, because we already have specialized systems with many cores: GPUs and CUDA.
And as every single time people talk about parallelism as a holy grail of performance, the truth is obvious to whomever actually uses this hardware: not many problems are highly parallelizable and parallelism is always harder than single threaded code -- it's harder to think about, and this has nothing to do with the programming paradigm.
>OO has been designed to model the world in the hope that this would make expressing business logic easier.
OO was designed for multi processing by sending msgs between multiple processes[0]. Like nature does. However I do not find it suitable for most parallel computing work because race conditions.
It think you are incorrect in your last statement. All functors are inherently parallelisable. This means that any map operation is parallelisable. Also as long as the operations are associative (which is a surprisingly large number of cases), your reduce operations will also be parallelisable. That means that any computation that can be represented by a map/reduce is usually fully parallelisable and always paralellisable in the map operation. The only operations that aren't paralellisable are those that depend on the previous values (and even then you can make judicious use of partial function application to cut down on the dependencies). Basically this is where you have effects where order matters and recursive calls.
I think the main reason we have difficulty imagining massively parallel computing systems is that we hold onto mutable state and imperative syntax. This pretty much forces us into a situation where order is significant. It seems to me that abandoning that idea is the key to freeing ourselves up, programming-wise. Personally, I don't find pure fp, or immutable OO that much harder than imperative. Different, for sure, but both have advantages and disadvantages.
No. As someone who has spent five years programming computing clusters, I have seen many problems that are fundamentally hard to parallelize. If it falls within mapreduce you are basically "done" no matter what language you use (no need to argue fp vs imperative) but very many interesting problems to solve doesn't.
I am a big believer in FP too, but for other reasons.
I can't edit my previous statement, but after spending more than a couple minutes thinking about it, I can see I was wrong. It's quite convenient to have downvotes to clue you in when you get things wrong, even if it is embarrassing ;-)
Having a dedicated processor for each object is a terrible idea in terms of energy.
Consider current high-core count Intel processors on Windows. Windows shuts down entire cores to save energy. Windows only powers enough cores to cover your current workload plus some margin. I assume other OSs work the same.
If each object has its own processor, then either (1) you must keep the core powered for the lifetime of the object or (2) the core must be energized each time the object receives a message, and re-energizing is not free.
I guess if the core could auto-detect its load (as current Intel processors do) that would help. But I think dynamically shuttling threads between cores and shutting down as much as possible (which is exactly what Windows does) is likely to be much more efficient.
So go ahead and have a thread-per-object for this hypothetical architecture, but let the OS decide when and where to assign cores.
You don't want this at the object level anyway. You want to have objects clumped into vats, which are related object graphs that are isolated from other vats but freely intermingle otherwise. Then, messages between vats invoke turns of delivery inside vats independent of other vats, and we can schedule one vat per processor.
Slightly related is the video game "TIS-100"[1] which has you programming (incredibly) constrained processors that are arranged in a grid, with communication between them. The programming puzzles are not very difficult (so far! I'm about halfway through) and pretty entertaining to puzzle out. There's a "meta game" of comments by "your uncle" who was trying to get it working again.
I think it's closer to the GreenArrays, Inc chips, really.
Setting aside for a moment the wisdom of using thousands of cores to enable object oriented programming (which I'm dubious of), it seems to me that the way modern processors handle communication and synchronization is mostly a historical accident, and we could probably do a lot better than just letting processes share memory willy-nilly and having the processor cache sort it out. Like maybe we could give each core a set of registers dedicated explicitly for communicating with its neighbors, bypassing the memory hierarchy entirely for small messages.
The point of the article was that there is so much extra parallel horsepower available that you would come out ahead even if overhead is something staggeringly ridiculous like 100x. That's what makes the idea interesting (though obviously not necessarily correct, it's just a thought-experiment).
Ther Connection Machine was able to handle arbitrary n-processors-to-n-processors "sends" in (log2 n) time, but only by interconnecting the processors in a (log2 n)-dimensional hypercube. That's a lot of interconnects...
Also, Connection Machine does not have independent processors that execute arbitrary code and do uncoordinated communications. In traditional CM, everything happens synchronously under control of one global microcode sequencer, so communication can be planned by some offline algorithm that finds optimal paths for any given N to M communication operation.
Also CM "processors" are not CPU's in any meaningful sense. Each CM's processor is essentially multiplexer and few bits of registers. Instruction is presented to the data inputs of multiplexer and processor state goes to the selection inputs with mux output being the next CPU state. In essence it's FPGA turned inside-out.
That aside, I wonder how hard proper software development would be for such a machine. It seems like multi-threaded software is something that we struggle to get right in our current low-core-count systems, so...
The author has reinvented the architecture used by Cell (PS3) and Intel IXP. Both of these architectures are dead specifically because they are too damn hard to program for compared with a multicore ARM/x86 chip.
GPUs would be the most successful modern implementation of this idea. There are opportunities with FPGAs, but GPU silicon is so far ahead (and still advancing fast) that you're usually better off designing for GPUs.
You could also consider Cavium parts (16-64 way ARM chips) which ship today in high-end network hardware.
The common lessons across all of these are that:
* Memory is slow compared with computation
* Put caches everywhere and have the machine decide what to cache
* Synchronisation is hard and puts tremendous load on your scarce memory resources
* It's much easier to do the same job on different data a million times than to do a million different jobs on the same data. In other words, high throughput is easier to achieve than low latency.
I'm not sure if those architectures are comparable to the one discussed in the article, except that both are highly parallel. GPUs and Cell are, as you mention, data-parallel.
The article talks about a much more "anarchistic" parallelism where thousands of different (in code and data) objects are each doing their thing, sending messages to each other when necessary. I guess that Erlang/Elixir's threads are closest currently, as mentioned in the article.
Cell's SPUs and IXP's microengines aren't data-parallel any more than a regular CPU. They're minimal CPUs with local RAM and fast connectivity between each other (usually a FIFO queue).
Every single one of the CPUs was independent and happy to run as many branches and/or memory accesses you want without significant performance penalty, unlike modern GPUs.
So yeah, you could put different objects on different CPUs if you want. Except that that's not where the bottleneck in either energy or computation is. Remember that that local RAM needs to be powered if it's to retain state (ignoring FeRAM), so CPUs are no longer free; you have to commit objects back to main DRAM before switching off the CPU. And so you've just reinvented caching and might as well just run on a fast few-core CPU anyway.
Message-oriented no-shared-state approaches (whether you call it Actors, objects, CSP, session types, or whatever) to concurrency have proved much easier to write correct software in than the traditional threads, locks, and shared memory approach. And it's the message-oriented approach that the article proposes basing the hardware design on.
Yes but Erlang provides a run-time that manages the messages between the actors/objects. Where does the run-time exist in such a machine? Would even be possible to have a run-time given the processor units exist at the lowest level?
Assuming process residency/pinning and an erlang-like model it could be pretty easy. Multi-threading I think would be rather bad as you'd have a lot of churn trying to maintain cache coherency, at least at the higher end (hundreds+ cores). At 18 cores it's nothing out of the ordinary, Xeon E5 v4 (broadwell) top out at 22 cores + HT (so 44 threads), and Sun's Ultrasparc T2 topped out at 8C8T (64 concurrent threads).
Though you'd probably want to read the old Connection Machine papers too, the CM1 had up to 65kCPU (although individual CM CPUs were much, much simpler than ARM cores).
The notion that multi-threading is hard to get right is a common one, but it does not match my experience. I've been running 100+ threads in C++ routinely for 20 years now.
So long as you are sensible (using mutexes for shared data structures, using queues or whatever for inter-thread messaging, using semaphores for inter-thread signaling, etc) it just isn't hard.
The only thing that briefly got me in trouble was when I tried my hand at lock-free data structures. Once they are working, they are fine, but it takes a few surprises before you really understand the A-B-A problem.
Yes, I suppose you can make multi-threading sound easy if you list all of the really easy problems. Mutexes, queues, and semaphores (or condition variables, if you have a modern code base) are all the easiest parts of multithreading.
But the "so long as you are sensible" advice is terribly, terribly wrong. You have to rethink how parts of your code are isolated from one another, and you have to avoid entirely new classes of errors—such as deadlocks, which can appear when you combine two well-tested and correct multithreaded components. And the number of threads has nothing to do with it.
This is just as applicable to any of the other tools to master multithreaded programming. Actors, as an example, are great so long as you are sensible and don't push your problems into your protocol too heavily. Unless, of course, you like complicated protocols.
It could allow us to represent data on the machine in a way that better matches the shape of the actual problem. So that data can be clustered together physically in two or three dimensions and can be processed locally. The current approach makes all data linear which is not a good fit for some problems.
Communicating sequential processes[0][1] more or less models that. Golang has some history with CSP (they are based on the concept of pipes and CSP was described with pipes before) while haskel/erlang and similar functional languages model the "sequential" part of processing data.
I guess to take this a bit further, the thought that came to mind for me is if we start processing every user on their own vm with own core and memory. Is that something anyone is doing today with web sites? It'd seemingly be horribly inefficient, but it's an interesting thought experiment.
It's not that inefficient if you are not working with real 'VMs', but with containers (or something similar). The unikernel projects mentioned by the peer below (Ling, MirageOS) can spin up a container per request, respond and then tear it down in milliseconds. They are pretty interesting from a security perspective, especially when you couple them with a read-only image - I imagine it would be pretty hard to attack something that only persists for the time it takes to handle a single request.
EDIT: Peer is below.
I imagine it would be pretty hard to attack something that only persists for the time it takes to handle a single request.
I don't see why. If the request is the attack (as it usually is), then it'll persist for just long enough to accomplish it. What kind of attacks do you see it avoiding?
I think the big benefit is that it avoids attacks that infect the server, because in this case, the server is "destroyed" when the request finishes. So a request that would maliciously upload "hackertools.php" would be useless, because the host (read: container) that the file is uploaded to is not the web server, but the container.
Yes, this is what I meant. It doesn't make the server any less vulnerable to an individual attack, but it makes it very hard to escalate it. Though there was a really interesting video about a security guy breaking out of Lambda recently and uncovering a persistent file system somewhere - will try to find it. Edit: Found it:
https://media.ccc.de/v/33c3-7865-gone_in_60_milliseconds
Communications overhead will not scale and your parallelism will be amdahl's law-bound. You're better off being gustafson's law-bound, and to use those processors for running more copies of the single-threaded program with different values.
It's not the cost of processing that holds us back so much as it's the cost of coordinating. Doing things in parallel and concurrently are where things get tricky. So an opportunity to trade off software complexity for transistors is interesting, of course!
FPGAs will get us most of the way there. We will want to express programs as concrete circuits and message passing. Something like http://www.clash-lang.org . Stream level operations are preferred over "Object" messages.
There is a tradeoff between large bandwidth large area circuits, and smaller circuits with less bandwidth. Inherently serial stuff goes on small circuits.
One of the simplest cases for broad parallelism is search where most of the searches fail. The required intercommunication bandwidth is low - you start up all the machines and wait for somebody to report success. Bitcoin miners and crypto key searchers are extreme examples. Problems which can be hammered into that form parallelize well.
Some problems that distribute spatially parallelize well on such hardware. Weather prediction, wind tunnel simulation, nuclear explosion simulation - hydrodynamics problems like that can be handled in sections, with communications only for the edge elements. That's what supercomputers are used for.
Neural nets and deep learning, though - finally, there's a broad use case for such architectures. Google is using lots of GPU-type boards that way now. This is a key concept. Massive parallelism on a single problem is really hard to apply to the things people used to do on computers. Instead, it makes possible new things to do.
An important question is how much memory each CPU should have. Thinking Machines, NCube, and the Cell (256K each) clearly had too little. You need enough memory per CPU that the intercommunication doesn't dominate the problem.
The other big problem is how to organize software for this. We have two main models of parallel software - shared memory, and clusters on a network. Those are the two extremes for parallelism. Systems where the CPUs are more tightly coupled than in an cluster but less tightly coupled than in a shared memory multiprocessor fall outside common programming paradigms. This is a very hard problem, with no really good general-purpose solutions.