Hacker News new | past | comments | ask | show | jobs | submit login

Sure, but, as I was (rather unpopularly) pointing out in another comment, that point was pretty hard to reach in 1982. Specifically the point where you've met both criteria: bigger computer is too cost prohibitive to get, and lots of smaller computers is easier. At the time of this lecture, parallel computers had a nasty tendency to achieve poorer real-world performance on practical applications than their sequential contemporaries, despite greater theoretical performance.

It's still kind of hard even now. To date in my career I've had more successes with improving existing systems' throughput by removing parallelism than I have by adding it. Amdahl's Law plus the memory hierarchy is one heck of a one-two punch.




In 1982 you still had "supercomputers" like

https://en.wikipedia.org/wiki/Cray_X-MP

because you could still make bipolar electronics that beat out mass-produced consumer electronics. By the mid 1990s even IBM abandoned bipolar mainframes and had to introduce parallelism so a cluster of (still slower) CMOS mainframes could replace a bipolar mainframe. This great book was written by someone who worked on this project

https://campi.cab.cnea.gov.ar/tocs/17291.pdf

and of course for large scale scientific computing it was clear that "clusters of rather ordinary nodes" like the

https://www.cscamm.umd.edu/facilities/computing/sp2/index.ht...

we had at Cornell were going to win (ours was way bigger) because they were scalable. (e.g. the way Cray himself saw it, a conventional supercomputer had to live within a small enough space that the cycle time was not unduly limited by the speed of light so that kind of supercomputer had to become physically smaller, not larger, to get faster)

Now for very specialized tasks like codebreaking, ASICs are a good answer and you'd probably stuff a large number of them into expansion cards into rather ordinary computers and clusters today possibly also have some ASICs for glue and communications such as

https://blogs.nvidia.com/blog/whats-a-dpu-data-processing-un...

----

The problem I see with people who attempt parallelism for the first time is that the task size has to be smaller than the overhead to transfer tasks between cores or nodes. That is, if you are processing most CSV files you can't round-robin assign rows to threads but 10,000 row chunks are probably fine. You usually get good results over a large range of chunk size but chunking is essential if you want most parallel jobs to really get a speedup. I find it frustrating as hell to see so many blog posts pushing the idea that some programming scheme like Actors is going to solve your problems and meeting people that treat chunking as a mere optimization you'll apply after the fact. My inclination is you can get the project done faster (human time) if you build in chunking right away but I've learned you just have to let people learn that lesson for themselves.


To your last point, it's been interesting to watch people struggle to effectively use technologies like Hadoop and Spark now that we've all moved to the cloud.

Originally, the whole point of the Hadoop architecture was that the data were pre-chunked and already sitting on the local storage of your compute nodes, so that the overhead to transfer at least that first map task was effectively zero, and your big data transfer cost was collecting all the (hopefully much smaller than your input data) results of that into one place in the reduce step.

Now we're in the cloud and the original data's all sitting in object storage. So shoving all your raw data through a tiny small slow network interface is an essential first step of any job, and it's not nearly so easy to get speedups that were as impressive as what people were doing 15 years ago.

That said I wouldn't want to go back. HDFS clusters were such a PITA to work with and I'm not the one paying the monthly AWS bill.


> The problem I see with people who attempt parallelism for the first time is that the task size has to be smaller than the overhead to transfer tasks between cores or nodes.

My big sticking point is that for some key classes of tasks, it's not clear that this is even possible. I've seen no credible reason to think that throwing more processors at the problem will ever build that one tool-generated template-heavy C++ file (IYKYK) in under a minute, or accurately simulate an old game console with a useful "fast forward" button, or fit an FPGA design before I decide to take a long coffee-and-HN break.

To be fair, some things that do parallelize well (e.g. large-scale finite element analysis, web servers) are extremely important. It's not as though these techniques and architectures and research projects are simply a waste of time. It's just that, like so many others before it, parallelism has been hyped for the past decade as "the" new computing paradigm that we've got to shove absolutely everything into, and I don't believe it.


It isn't for a great many tasks. Basically, whenever you're computing f(g(x)), you can't execute f and g concurrently.

What you can do is run g and h currently in something that looks like f(g(), h()). And you can vectorize.

A lot of early multiprocessor computers only gave you that last option. They had a special mode where you'd send exactly the same instructions to all of the CPUs, and the CPUs would be mapped to different memory. So in many respects it was more like a primitive version of SSE instructions than it is to what modern multiprocessor computers do.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: