I find it interesting that the language that best deals with parallelism (IMO) was invented significantly before we moved in the multi-core direction.
The programming language landscape evolved and developed in the face of multi-code - async being a classic example. But the language that's often held up as the best solution to any given parallelism problem is Erlang. Erlang was built as a good programming model for concurrency on a single core, and then when multi-core came along, SMP was 'just' a VM enhancement and no programs needed changing at all to get the full advantage of all the cores (barring some special cases).
> no programs needed changing at all to get the full advantage of all the cores (barring some special cases)
Some programs just don't have any available parallelism in them - no matter how many processes you split them up into. You need to build parallel algorithms and data structures. That's still often a research-level topic.
Erlang's not going to magic parallelism out of thin air in a program that doesn't have any.
Kind of strawmanning here, no? Obviously sequential tasks are run sequentially. The OP isn't saying "magic" happens. Just, Erlang was written from the beginning to encourage parallelized code. Its whole programming model is around that. Unlike pretty much every language it was contemporary with, which assumed that the default was a single unit of execution (and that code running in parallel was the exception).
Think of task scheduling; the normal Algol/C family of languages would say "create a priority queue for your tasks, wait until the top of the queue is ready to run, pull it off, run it, repeat". Very sequential thinking. Of course, you then end up with issues around what happens if the processes take non-negligible amounts of time, what happens if you need to add new items into the queue, etc. At the lowest level that might be solved with interrupts, at a higher level it might be threads or separate processes, but you still have to worry about concurrent data access (locks and the like around the queue). But both the general approach, and the 'best practice' guidelines, especially prior to multi-core, (due to the difficulties of shared data), were/are to minimize concurrency.
Erlang, you just spin up every task as its own process. The process sleeps until it's time to do something, then it does it. The language encourages you to solve the problem in a parallel manner, and it did so from the beginning.
So, yes, you "need to build parallel algorithms and data structures". The point is, Erlang took advantage of them, as a language, even before the hardware could. The way you would write Erlang for a single core was the way you would write Erlang for many cores, which was unique for a language at the time, and the OP is arguing that not only was Erlang unique in doing that, but that there really isn't a better model for handling parallelism from a language and correctness standpoint (presumably; from a performance perspective m:n threading, of which Erlang actors are a form, is generally agreed upon to lose some efficiency).
> no programs needed changing at all to get the full advantage of all the cores (barring some special cases)
Those 'special cases' are all the algorithms which aren't inherently parallel, which is most of them! How many algorithms inherently have a large amount of natural parallelism? I'd say it's very few.
For example no matter how you write MD5 on Erlang... it's not going to run in parallel is it?
As I said before - 'Obviously sequential tasks are run sequentially.'
Innately sequential algorithms will still be sequential. Erlang already gave developers the tools, and incentive, to write things in a parallelized fashion. If a developer had a need that was directly sequential, it will, of course, still be sequential. If they had a need that would benefit from parallelism, they had the tools, the context, and the encouragement from the language itself to write it in parallel.
Again, I think you're strawmanning here. You kind of implied it yourself with your earlier statement, "Erlang's not going to magic parallelism out of thin air in a program that doesn't have any". Saying that implies you think that's the OP's claim.
As I said before, the OP wasn't claiming magic happens, and algorithms written serially suddenly become parallel. You can, of course, still write sequential code in Erlang, same as every language, because execution takes place in time. Just that the natural, ergonomic way of writing the language, of solving problems in the language, has always been parallel.
And if you mean "well, I run MD5 checksums on individual blocks of the file, and parallelize it that way" - yes, that is going to be one of those special cases the OP mentioned, if you didn't already optimize it that way due to memory pressure concerns. But if it's "run an md5 checksum against every file in this directory and check it against a manifest", the ergonomic approach in Erlang pre-multicore is to scatter/gather it across actors (since any one file might fail in strange ways and you want to isolate that failure and handle it predictably with a supervisor), and that would parallelize "for free", with no concurrency concerns. Unlike every other contemporary language.
> they had the tools, the context, and the encouragement from the language itself to write it in parallel
No I think Erlang gives you the tools, context, and encouragement to write it concurrently. After you have concurrency, you may or may not have parallelism.
And if you don't you will have to restructure your program to get it, even if your program was already highly concurrent. That's contrary to what the person I was replying to said.
If you actually have concurrency, and you have the hardware able to handle it, why do you not have parallelism?
I.e., simply spinning up an actor obviously does not imply concurrency; the classic Erlang 'ping pong' example shows processes waiting to receive a message and responding when they do. Not actually concurrent, despite multiple actors and multiple processor cores it's a sequential program, not a concurrent one.
Likewise, if you have 5 processes all adding a list of numbers (independently and sequentially, one element at a time), but only 4 cores, you would have only 4 processes running in parallel at a time; a hardware limit. Up the cores, you can have all 5.
Can you give me an example of a concurrent program that is not also parallel, due to non-hardware reasons, in Erlang? Because switching the goalposts around - as you say, Erlang encourages you to write concurrent programs, even before CPUs allowed true parallelism, and because of that, once the CPU allowed true parallelism, it only required a VM tweak and suddenly Erlang programs were running in parallel. But you say no, that that wasn't the case; just because you wrote things to be concurrent, and now the hardware has changed to allow them to be in parallel, they aren't necessarily parallel. I'd love an example of that.
Concurrency produces parallelism when the concurrent portions can actually run at the same time. That is, between synchronization points which force sequential behavior.
It's very feasible to write a concurrent program that overuses (versus what is strictly needed) synchronization points in order to produce something which offers a better codebase (by some measure) but provides no parallelism. You have to remove/minimize these synchronization points in order to obtain parallelism. It's also possible that by design these synchronization points can't be eliminated.
Erlang's message passing is inherently asynchronous, but you could conceive of similar synchronization points in a program that will reduce the potential parallelism or eliminate it depending on the overall structure. For instance:
You still have a concurrent design (which can be easier to reason about), but because you're waiting on a response you end up without any parallelism no matter how many cores you throw at it.
Right, when I said "actually concurrent", I meant the code -could- run simultaneously (or more formally, that the items of execution aren't necessarily causally related). Any sequential operation could be spread out over an arbitrary number of Erlang processes, without it actually being concurrent. That design isn't concurrent, you just are involving a bunch of processes that wait.
That said, you are right that a synch point around a controlled resource (i.e., "we only want to support processing 5 requests at a time") would still have concurrent tasks (i.e., ten requests come in; there is no casual relation between when they run). Of course...that's an intentional point of synchronization necessary for correctness; it still strikes me as a strawman to say that you don't get parallelization for concurrent code, just because you forced a bottleneck somewhere.
> If you actually have concurrency, and you have the hardware able to handle it, why do you not have parallelism?
You can have concurrent tasks but with dependencies between them which means there is no parallelism available.
> Can you give me an example of a concurrent program that is not also parallel, due to non-hardware reasons, in Erlang?
For example....
> the classic Erlang 'ping pong' example shows processes waiting to receive a message and responding when they do. Not actually concurrent, despite multiple actors and multiple processor cores it's a sequential program, not a concurrent one.
No that is actually concurrent. These are two concurrent tasks. They just don't have any parallelism between them because there is a sequential dependency and only one is able to make progress at any given time.
Concurrent and parallel are not the same thing, if you didn't know. Concurrency is having work in progress on multiple tasks at the same time. You don't need to actually be advancing more than one task at the same time for it to be concurrent.
"You can have concurrent tasks but with dependencies between them which means there is no parallelism available." - that isn't really concurrent then, is it? Concurrency implies running two things concurrently; you're just creating things to wait. No useful work is being done. You broke a sequential task across processes. Just because those processes are stuck on a receive loop doesn't mean they're doing anything. Might as well say your Java program is concurrent because you spun up a second thread that just waits.
That was what I was getting at with my question; arbitrary processes that do their work in a serial fashion...isn't a concurrent program. It's a badly written serial one.
Concurrency does not require that concurrent tasks are able to make independent progress at the same time, just that they are in some point of progress at the same time.
Back in 2012 when I was working on my PhD I wrote a paper about an example algorithm with very little inherent parallelism, where even if you have multiple concurrent tasks all making forward progress you may find it extremely hard to actually produce results in parallel, if you want to do a deep dive into this problem.
Just to add some support here, concurrency definitely has very little to do with actual independence of progress. It has far more to do with encapsulation of knowledge (for an active agent, not a passive entity like an object [^]), and how different agents coordinate to exchange that knowledge.
An echo server + client is a concurrent system, despite having only one meaningful scheduling of actions over time (i.e. no interleaving or parallelism). Serialization and parallelism are both global properties, as they require a perspective "from above" to observe, while concurrency is a property local to each task.
I can appreciate everyone is saying upthread, but I've definitely found it valuable to think about concurrency in these terms.
[^] Even this is overselling it. You can easily have two logical tasks and one physical agent responsible for executing both of them, and this is still a concurrent system. Dataflow programming is a lot like this -- you're not really concerned with the global parallelism of the whole system, you're just describing a series of tasks over local knowledge that propagate results downstream.
Yes, any time there are shared resources, concurrent tasks can hit synchronization points, and end up bottlenecking on them. Of course, I'd contend that a point of synchronization is a form of serialization, but I agree that the tasks being synchronized would still be described as concurrent (since they don't causally affect their ordering). But such a synchronization is a necessary part of the algorithm you're choosing to use (even prior to SMP support), or it would be completely unnatural to include it. I don't know that it really invalidates the OP's point, that the language didn't remove the bottlenecks you introduced?
Who says "most" algorithms aren't inherently parallel? How do you even enumerate that set? What counts as an "algorithm" anyway? Parallel processing is the norm in biology. And even if we grant that most of the algorithms humans have devised to run on our von Neumann machines are sequential - doesn't that say more about us, than about some intrinsic property of algorithms?
The writing was on the wall as early as the beginning of the Pentium 4 era.
There was this assumption that clock speed would just continue to grow at the insane rates it did in the 80s and 90s, and then reality caught up when the Netburst architecture didn't scale well enough to avoid heat issues and crazy insane pipelines. When Intel rolled out the Core architecture, they went back to a P6-derived design, a microarchitecture that dated back to 1995 - effectively an admission that Netburst had failed.
And initial forays into many cores was decades earlier then that. Aside from big iron, going back to the early 60s, there was some business class devices like those made by Sequent:
But the language that's often held up as the best solution to any given parallelism problem is Erlang.
Occam implements a lot of similar ideas very neatly too. The code looks a lot like programming with Go channels and CSP. I am surprised it hasn't caught on better in the multi-core world, it being older than even Erlang.
I had to double check, the last Occam release was in 1994. I think the language was way to niche, and global communication way too poor still (compared to the '00s and '10s with ubiquitous or near ubiquitous internet access) to catch on. It was nearly a decade later that the first consumer multicore x86 processors started coming out (2003 if I've tracked down the right dates). And multicore didn't become common for a few more years and then the baseline near the end of the '00s, beginning of the '10s.
That time gap and relative unknown of Occam pretty much doomed it even if it would have been wildly useful. By that point you had lots of other languages to come out, and Google pushing Go.
The layers over layers of software bloat increased with the hardware speed. The general trade-off is: We make the software as slow as possible to have the shortest development time with the least educated programmers. "Hey web dev intern, click me a new UI in the next two hours! Hurry!"
This intern-clicks-a-new-UI works because we have a dozen layers of incomprehension-supporting-technologies. You don't need to know how most parts of the machine are working, Several libraries, VMs, and frameworks will make you a bed of roses.
My point is that we a overdoing it with the convenience for the developers. Today there is way too much complexity and bloat in our systems. And there are not enough programmers trained to handle memory management and similar low-level tasks. Or their bosses wouldn't allow it because the deadline, you know.
I think the general trend is bad because there is no free lunch. No silver bullet. Everything is a trade-off. For example C is still important because C's trade-off between programming convenience and runtime efficiency is very good. You pay a lot and you get a lot.
This is also true for parallel programming. To write highly efficient parallel code you need skill and education. No silver bullet tooling will make this less "hard".
And here I see the irony. Faster CPUs were used to have lower educated devs delivering quicker. More parallel CPUs need higher skilled devs working slower to utilize the chip's full potential.
Two types of engineer now exist:
Those that are largely "code monkeys" implementing features using high level tooling and guard rails and those that are responsible for building that tooling and low level components. FWIW I've met plenty of engineers even from FAANG companies with fairly limited exposure to writing performant software.
Over time, the former bucket has grown significantly as barriers to entry have dropped but this has led to extremely bloated and inefficient codebases. Deployment patterns like micro services running on container orchestration platforms mean that it's easier to scale bad code horizontally pretty easily and these "high level frameworks" are generally "good enough" in terms of latency per-request. So the only time efficiency ever becomes a concern for most companies are when cost becomes an issue.
It'll be interesting to see how this all unfolds. I wouldn't be surprised if the huge incoming supply of unskilled engineers doesn't cause compensation to drop significantly in general.
To save money on infrastructure you need to pay your engineers more. Companies are choosing to hire cheaper engineers as they are the ones that are harder to replace.
The whole software vs hardware and developer vs runtime efficiency discussion is way more nuanced. No, the layers as they are currently are at least partially hindering productivity, because they all have bugs, rather painful limitations and are constantly changing under your feet. There is so much complexity, you don't have a chance to understand it (and the implications of it) completely from the click in a client's web browser all the way to the instructions executed on the server. This is not convenient and nobody (not even Google, Facebook and others) can handle the complexity, which manifests itself in bugs, weird behaviour, surprising security holes, other rare defects and more every day.
You don't want to manage memory manually unless you absolutely have to. Almost certainly, it is going to be way more demanding, you will make mistakes and the performance will not really be much better if you do whole system benchmarks for like 95% of applications. It is like using mining equipment and explosives to hang a picture on your concrete wall, it is just totally over the top. I am also not denying, there are use cases for something, where most of the responsibility is on you e.g. embedded or infrastructure software. Most people just aren't aware that they need the skill and understanding to reliably and most importantly responsibly wield it. In both cases, there is considerable room for better, more robust tools, and better methods/ practices and a great reduction in developer hubris.
Some have used better hardware for saving on development effort to some degree. There are applications that really need the extra performance to help users do their work in a more convenient way - not just developers. Some of the performance is lost due to systemic inefficiencies such as the web browser and most related technologies. E.g SVG rendering in web browsers is a mine field of bugs and bad performance, so you do the rendering yourself using canvas and JavaScript, which visually is the same thing but there is a lot more work for the developer and e.g. printing/ export into PDF (where you cannot just recalculate) is where you still kind of want the vector graphics capability in some form. Virtualization has enabled slicing of the cores to serve more customers with low cost and readily available compute. We also have way more users and different security expectations now than we used to in the past, so a single machine has to handle and check more stuff or we can use a single machine for stuff that would need a data centre or wouldn't be possible/ feasible before.
Based on what I can tell... the programmers who made huge contributions to parallel compute are those programmers who grossly understood the hardware.
It didn't matter if they were Java programmers (Azul), Lisp Programmers (1980s CM2: Steel / Hille), GPU programmers, or C programmers. If you understood the machine, you figured out a fast and concurrent solution.
The free lunch isn't quite over, although significant advancements were made in parallel computing... it turns out that CPUs have been able to "auto-parallelize" your sequential code all along. Just not nearly as efficiently as explicitly parallel methodologies.
In 2005, your typical CPU was a 2.2GHz Athlon 64 3700+. In 2021, your typical CPU is a Ryzen 5700x at 3.8 GHz.
Single-threaded performance is far better than 72% faster however. The Ryzen 5700x has far more L3 cache, far more instructions-per-clock, far more execution resources than the ol' 2005 era Athlon.
In fact, server-class EPYC systems are commonly in the 2GHz range, because low-frequency saves a lot on power and servers want lower power usage. Today's EPYCs are still far faster per-core than the Athlons of old.
-------------
This is because your "single thread" is executed more-and-more in parallel today. Thanks to the magic of "dependency cutting" compilers, the compiler + CPU auto-parallelizes your code and runs them on the 8+ execution pipelines found on modern CPU "cores".
Traditionally, the CPUs in the 90s had a singular pipeline. But the 90s and 00s brought forth out of order execution, as well as parallel execution pipelines (aka: superscalar execution). That means 2 or more pipelines execute your "sequential" code, yes, in parallel. Modern cores have more than 8 pipelines and are more than capable of 4+ or 6+ operations per clock tick.
This is less efficient than explicit, programmer given parallelism. But it is far easier to accomplish. The Apple M1 continues this tradition of wider execution. I'm not sure if "sequential" is dead quite yet (even if there's a huge amount of machine working to auto-translate sequential into parallel technically... our code is largely written in a "sequential" fashion)
-------------
But the big advancements after 2005 was the rise of GPGPU compute. It was always known that SIMD (aka: GPUs) were the most parallel systems, from the late 1980s and early 90s the SIMD supercomputers always had the most FLOPs.
OpenCL and CUDA really took parallelism more / SIMD mainstream. And indeed: these SIMD systems (be it AMD MI100 or NVidia A100) are far more efficient and far higher compute capabilities than anything else.
The only "competitor" on the supercomputer scale is the Fugaku supercomputer, with SVE (512-bit SIMD) ARM using HBM RAM (same as the high-end GPUs). SIMD seems like the obvious parallel compute methodology if you really need tons and tons of compute power.
It seems to me that where we really ended up was distributed systems. We solve problems by not just making our code concurrent to use more cores, but by also making it use more computers.
There seems to be no way to efficiently replay concurrent programs in a deterministic fashion on multiple cores. Nondeterminism makes parallelism and concurrency inherently hard and unfriendly to new comers. It becomes even more difficult in recent years due to architecture decisions: weak memory order makes things worse.
Supposing you are going to write nontrivial concurrent programs like toy Raft, I believe that looking through RPC logs will be the most painful thing.
In contrast, on a single core, gdb is good enough. And there are also advanced examples like VMware's fault-tolerant VM and FundationDB's deterministic simulation. If we can debug concurrent programs without dirty tricks, just like single-threaded ones, I guess utilizing concurrency will be as handy as calling a function.
what about if you throw in async IO into the mix? seems like you'd have to have some kind of vmware-like virtualization where you record all IO interrupts into a linearalizable log
When you throw async I/O in the mix it’s no longer just your application you’re debugging, it’s the whole OS with other processes, the OS kernel with these drivers and DMAs, and external hardware that’s not a code at all.
Can be tricky to debug or implement, but when I need that I normally use async-await in C#. The debugger in the IDE is multi-threaded, and does support these tasks.
"concurrency is still widely regarded as 'too hard'."
Is it? There isn't going to be an official declaration from the Masters of Computer Science that "2019 was the year concurrency ceased being Too Hard." or anything.
My perception is that it is steadily becoming less and less notable for a program to be "concurrent". Tooling is becoming better. Common practices are becoming better. (In fact, you could arguably take common practices back to the 1990s and even with the tools of the day, tame concurrency. While the tooling had its issues too, I would assert the problem was more the practices than the tooling.) Understanding of how to use it reasonably safely is steadily spreading, and runtimes that make it safer yet are starting to get attention.
I'm not sure I've seen a case where there was a problem that ought to be using concurrency, but nobody involved could figure out any way to deal with it or was too afraid to open that door in a long time. There's still plenty of cases where it doesn't matter even now, of course, because one core is a lot of computing power on its own. But it seems to be that for everyone out there who would benefit from concurrency, they're mostly capable of using it nowadays. Not necessarily without issue, but that's an unfair bar; you can still get yourself in concurrency trouble in Haskell or Erlang, but it's a lot easier than it used to be to get it right.
I do feel similarly, even though I wouldn't classify myself as a great engineer.
I've been writing concurrent software in managed languages, such as Java and C#[0], from the very beginning of my career, up until today. The level of multithreading has varied, but it's always been there. For anything beyond basic CRUD it pretty much becomes a requirement, both on desktop and on the web[1].
That doesn't mean I've never had a tricky race condition to debug (and, yes, they're hard to debug) during development, but I've never shipped a concurrency related bug to production[2].
The canonical examples of concurrency gone wrong are things like giving somebody a deadly radiation dose from a medical device but, in terms of serious software bugs, I do wonder how common concurrency bugs are relative to other types of bug, and whether they're really more serious in aggregate than those other types of bug.
[0] Admittedly these languages make it a lot easier to avoid shooting yourself in the foot than C and C++ do.
[1] Also worth bearing in mind that an inherent property of most, if not all, distributed software is that it's also concurrent: the moment you have multiple processes running indepedently or interdependently you also usually have a concurrent system, with the potential for "distributed race conditions" to occur. I.e., if you have a SPA that also has a non-trivial back-end, you have a concurrent system - just spread across different processes, generally on different machines.
[2] In the context of in-process concurrency. Distributed concurrency is a different matter.
There's been a lot of work done in C++ concurrency. Some higher level libraries that leverage recent (primitive) additions are emerging and they look pretty cool.
The hard part is to improve perceived performance or resource overhead for general software, where each task outcome is tightly coupled with app state, eg. a parallelize a slow task on button click and have guarantee that final state is coherent.
Nothing out of reach, but doing that without compromising too much on maintainability is still an opened question.
> I'm not sure if this is justified (e.g. concurrency is inherently too hard to be viable), or due to the lack of tooling/conventions/education.
I think it's the tooling. Rust's modelling of concurrency using it's type system (the Send and Sync traits) make concurrency pretty straightforward for most use cases. You still have to be super-careful when creating the core abstractions using unsafe code, but once you have them they can easily be shared as libraries and it's a compile error to violate the invariants. And this means that most projects will never have to write the hard parts themselves and get concurrency for close to free.
The rise of async programming in backend web dev is making some people even more confused about models. For instance, many senior engineers out there don't understand the difference between sync multithreaded and async single threaded.
I see a lot of potential in pipeline concurrency, as seen in dataflow (DF) and flow-based programming (FBP). That is, modeling computation as pipelines where one component sends data to the next component via message passing. As long as there is enough data it will be possible for multiple components in the chain to work concurrently.
The benefits are that no other synchronization is needed than the data sent between processes, and race conditions are ruled out as long as only one process is allowed to process a data item at a time (this is the rule in FBP).
The main blockers I think is that it requires quite a rethink of the architecture of software. I see this rethink happening in larger, especially distributed systems, which are modeled a lot around these principles already, using systems such as Kafka and message queues to communicate, which more or less forces people to model computations around the data flow.
I think the same could happen inside monolithic applications too, with the right tooling. The concurrency primitives in Go are superbly suited to this in my experience, given that you work with the right paradigm, which I've been writing about before [1, 2], and started making a micro-unframework for [3] (though the latter one will be possible to make so much nicer after we get generics in Go).
But then, I also think there are some lessons to be learned about the right granularity for processes and data in the pipeline. Due to the overhead of message passing, it will not make sense performance-wise to use dataflow for the very finest-grain data.
Perhaps this in a sense parallels what we see with distributed computing, where there is a certain breaking point before which it isn't really worth it to go with distributed computing, because of all the overhead, both performance-wise and complexity-wise.
Most of the computers in the world are either dedicated embedded controllers or end user devices. Concurrency in embedded controllers is pretty much an ordinary thing and has been since the days of the 6502/Z80/8080. For end user devices the kind of concurrency that matters to the end user is also not extraordinary, plenty of things happen in the background when one is browsing, word processing, listening to music, etc.
So that leaves concurrency inside applications and that just isn't something that affects most of the end users. There really isn't much for a word processor to actually do while the user is thinking about which key to press so it can do those few things that there was not time for during the keypress.
Mostly what is needed is more efficient code. Niklaus Wirth was complaining that code was getting slower more quickly than hardware was getting faster forty years in 1995 and it seems that he is still right.
> So that leaves concurrency inside applications and that just isn't something that affects most of the end users. There really isn't much for a word processor to actually do while the user is thinking about which key to press so it can do those few things that there was not time for during the keypress.
This obviously depends on the application. Whenever you need to wait for an app to finish an operation that is not related to I/O, there is some potential for improvement. If a CPU-bound operation makes you wait for more than, say, a minute, it's almost definitely a candidate for optimization. Whether multithreading is a good solution or not depends on each case - when you need to communicate/lock a lot, it might not make sense. A good part of the solution is figuring out if it makes sense, how to partition the work etc.; the other part of this hard work is implementing and debugging it.
Memory models are subtle. Temporal reasoning in context of h/w, e.g. multi-core coherence, and language runtime MM is non-trivial. So there is a baseline level of difficulty baked into the domain. Education certainly is necessary, but here can only inform of what needs to be considered, known pitfalls, patterns of concurrency, etc.
As to OP, well it better be viable, because we certainly need to deal with it. So better tooling and conventions encapsulated in expert developed libraries. The education level required will naturally fall into the categories for those who will develop the tools/libraries, and those that will use them.
My 5 cents would be wrong abstraction and wrong tooling, let me elaborate a bit of them:
- Current abstractions are mostly based on POSIX threads and C++/Java memory models. I think they are poorly representing what is actually happening in hardware. For example Acquire barrier in C++ makes it really hard to understand that in hardware it equals to a flush of invalidation queue of the core, to sanity check your understanding try answering the question "do you need memory barriers if you have multithreaded application (2+ threads) running a lock-free algorithm on a single core?", correct answer is no, because same core always sees it's own writes as they would happen in program order, even if OOO pipeline would reorder them. Or threads, they seem to be an entity that can either run or stop, but in hardware there are no threads, CPU just jumps to a different point in memory (albeit through rings 3->0->3). Heck even whole memory allocation story, we have generations of developers thinking about memory allocation and it's safety, yet hardware doesn't have that concept at all, memory range mapping concept would be the closest to what MMU actually does. Hence the impedance mismatch between hardware and current low level abstractions created a lot of developers who "know" how all of this works but doesn't actually know, and a bit afraid to crush through layers. I want more engineers to not be afraid and be comfortable with all low level bits even if they would never touch them, because one day you will find a bug like broken compare-exchange implementation in LLVM or similar.
- Tooling is way off, the main problem with multithreading is that it's all "in runtime" and dynamic, for example if I'm making a lockfree hashmap, the only way for me to get into the edgecases of my algorithm (like two threads trying to acquire same token or something) is to run a bruteforce test between multiple threads and wait until it actually happens. Bruteforce-test-development scales very poorly, and testing something like consensus algorithms for hundreds of threads is just a nightmare of complexity of test fixtures involved. Then you get into ok, so how much testing is enough? How do you measure coverage? Lines of code? Branches? Threads-per-line? When are you sure that your algorithm is correct? Don't get me wrong, I've seen it multiple times, simple 100 lines of code passing tons of reviews only for me to find a race condition (algorithmical one) half a year later, and now it's deployed everywhere and very costly to fix. Another way would be to skip all of that and start modeling your algorithms first, TLA+ is one of the better tools for that out there, prove that your model is correct, and then implement it. Using something like TLA+ can make your multithreading coding a breeze in any language.
And probably absence of transactional memory also contributes greatly, passing 8 or 16 bytes around atomically is easy, but try 24 or 32? Now you need to build out an insanely complicated algorithm that involves a lot of mathematics just to prove that it's correct.
One thing I’d love to do is a Smalltalk implementation where every message is processed in a separate thread. Could be a nice educational tool, as well as a great excuse to push workstations with hundreds of cores.
Interesting idea. One could imagine a kind of Actor-Smalltalk where each object is actually its own "thread" (whatever that means at the VM level). The issue then becomes the asynchronous nature of the beast and how to "respond to" senders -- as you know, in current ST-80 like systems, returning from a method is the same as "responding," but this would no longer be the case.
Another idea is this: in the ST-80 VM, the only "real" thing at the OS level is SmallInteger. Everything else is a true object reference. One could imagine expanding the base integer size to 128 bits and having all references also described at that size. The reason to do this would be that IPv6 uses 128-bit addresses, and therefore we could bind object references to IP addresses at a lower level.
That aside, the fact that the Opensmalltalk VM is made inside of Smalltalk means that -- in theory -- a team could iterate to that point using the environment itself.
Snarkiness aside, that would be interesting. Not just as an educational tool, but it might be useful in the same way as erlang is (large, fault-tolerant, massively parallel systems). But with OOP instead of FP (Elixir attempts to be the "friendly"/more conventional version of erlang, but it is still very much a functional language).
The nice thing about doing a Smalltalk is that there is a colossal amount of software built on top of it. I'd love to know how much of that corpus could operate like this. I don't recall ever using locks, but I also never ran ST on a multicore machine either. It's just that the message-passing idea matches a many-core machine so well it's a shame not to play with it.
2005, when the most important platform was the desktop, when the virtualization was yet to be mature, and when the dominant system programming language is C++.
Today CPU cores are sliced by cloud vendors to sell out in smaller portion, and the phones are hesitant to go many-core as it will eat your battery in light-speed. Dark silicon is spent for domain specific circuits like AI, media or networking instead of generic cores.
Parallelism is still very hard problem in theory, but its practical need isn't as prevalent as we thought on a decade-plus ago, partly thanks for the cloud and mobile. For most of us, at least the parallelism is kind of solved-by-someone-else problem. It is left for the small number of experts.
Concurrency is still there, but the situation is much better than before (async/await, immutable data types, actors...)
I recall transactional memory being pitched to take a bite out of lock overheads associated with multithreading. (as opposed to lockless algorithms). Has it become mainstream?
In 2021 on Windows, is safe to assume the GPU does at least D3D 11.0. The last GPU which does not is Intel Sandy Bridge, discontinued in 2013. D3D 11.0 supports lots of features about GPGPU. The main issue with D3D11 is FP64 support being optional, the hardware may or may not support.
> and easier to work with
I don’t find neither DirectCompute 11, nor CUDA, exceptionally hard to work with. D3D12 and Vulkan are indeed hard, IMO.
> I don’t find neither DirectCompute 11, nor CUDA, exceptionally hard to work with.
It still sucks that, if you're not careful or unlucky, GPUs may interfere with normal video operation during setup.
Also on one machine I'm getting "GPU lost" error messages every once in a while during a long computation, and I have no other option than to reboot my machine.
Further, the form factor sucks. Every card comes with 4 or more DVI connectors which I don't need.
If parallel is the future, then give me something that fits on the mainboard directly, and which is more or less a commodity. Not the proprietary crap that vendors are currently shoving at us.
> on one machine I'm getting "GPU lost" error messages every once in a while during a long computation, and I have no other option than to reboot my machine.
If that’s your computer, just change the registry setting disabling the TDR. By default, Windows uses 2 seconds to limit max.pipeline latency. When exceeded, the OS resets the GPU, restarts the driver, and logs that “device lost” in the event log.
If that’s your customer’s computer you have to be more creative and fix your compute shaders. When you have a lot of things to compute and/or you detect underpowered hardware, split your Dispatch() calls into a series of smaller ones. As a nice side effect, this minimizes effect of your application on 3D rendering things running on the same GPU by other programs.
Don’t submit them all at once. Submit two initially then one at a time using ID3D11Query to track completion of these compute shaders. ID3D11Fence can do that better (can sleep on WaitForSingleObject saving electricity) but fences are less compatible unfortunately. That’s an optional feature introduced in Win10 post-release in some update, and e.g. VMWare SVGA 3D doesn’t support fences.
> the form factor sucks. Every card comes with 4 or more DVI connectors which I don't need.
Some datacenter GPUs, and newer mining GPUs come without connectors. The mainstream commodity cards to have connectors because the primary market for them is supposed to be gamers.
> something that fits on the mainboard directly
Unlikely to happen because thermals. Modern high-end GPUs consume more electricity than comparable CPUs, e.g. RTX 3090 needs 350W, similarly priced EPYC 7453 needs 225W.
> my point is that the GPU really is not the "first-class citizen" that the CPU is.
On Windows GPU is the first-class citizen since Vista. In Vista, Microsoft started to use D3D10 for their desktop compositor. In Windows 7 they have upgraded to Direct3D 11.
The transition wasn’t smooth. Only gamers had 3D GPUs before Vista, many people needed new computers. Technically, Microsoft had to change driver model to support a few required features.
On the bright side, now that XP->Win7 transition is long in the past, and 3D GPUs are used for everything on Windows. All web browsers are using D3D to render stuff, albeit not directly, through the higher-level libraries like Direct2D and DirectWrite.
It’s very similar situation with GPU compute on Linux. The kernel and driver support has arrived by now, but the higher-level user mode things are still missing.
P.S. If you’re on Linux, pretty sure you can still refactor your compute shaders the same way and they will work fine afterwards. You obviously don’t have ID3D11Query/ID3D11Fence but most GPU APIs have replacements: VkFence(3) in Vulkan, EGL_KHR_fence_sync/GL_OES_EGL_sync/VG_KHR_EGL_sync extensions for GL and GLES, etc.
I find the distinction far less interesting than most people. I thing it's easier to think of parallelism simply as an interesting special case of concurrency, and to think of runtimes and systems that are "concurrent" but can't run literally simultaneously, like Javascript or Python, as simply accidents of history not worth specially writing into the definitions of our terms. Every year "concurrent" code that can't be scheduled to run on multiple CPUs simultaneously is less and less interesting. And I don't anticipate any new language coming out in the future that will try to be "concurrent but can only run on one CPU", so this meaning is just going to fade into history as a particular quirk of some legacy runtimes.
No, I do not consider any runtime that can't run on multiple CPUs simultaneously to have "good" concurrency support. It would, at best, be bad concurrency support. Better than "no" support, sure, but not good in 2021. If a new language came out with that support, nobody would call it "good" support, they'd call it failing to even raise the table stakes a new language needs nowadays.
I have usually seen the distinction between concurrency and parallelism being drawn for precisely the opposite reason: parallelism without concurrency is a relatively commonly used niche, it is much simpler than concurrency, and it has special tools that make sense just for it.
For example, CUDA exposes a parallel programming model with low support for actual concurrency. Similarly, OpenMP is mostly useful for parallel computations that rarely need coordination. C#'s Parallel package is another example.
By contrast, concurrency often rears its ugly head even in single-threaded workflows, even ones that are part of larger multi-threaded programs.
What's your definition of running simultaneously because CPython's approach to running on multiple cores is multiprocessing which works honestly fine. The tooling to do it is pretty slick where you can ignore a lot of the typical pain of IPC.
Because if "good" concurrency support means single-process multi-threaded on multiple cores running with enough locks that you can have multiple threads executing code simultaneously in a shared memory space then a lot of languages are going to fall down or punt all responsibility for doing that safely to the programmer which might as well be no support.
Shared memory spaces. Your "slick tooling" is one of the hacks I mentioned that history will forget, by the standard of "if a new language emerged that tried to call that 'concurrency' nobody would take it seriously".
I should say that "hack" here isn't necessarily a perjorative. There are reasons for communities to create and deploy those. There are plenty of cases where existing code can be leveraged to work better than it could without it, and that's the relevant standard for whether something is useful, not whether or not in a parallel universe the code could have been written in some completely different way or whether in some hypothetical sense if you could push a button and rewrite something for free you'd end up with something better. Hacks can be good, and every language will pick them up at some point as history evolves around it and some of the core assumptions a language/runtime made go out of date. But it's still a hack.
While OS process boundaries do provide some nice features, they are also in general overkill. See Erlang and Pony for some alternate riffs on the idea, and if you look at it hard enough and understand it deeply enough, even what Rust does can provide some similar benefits without raising a full OS process boundary between bits of code.
You're saying that shared memory concurrency is the future? I think you've got it completely wrong. Shared memory concurrency is the past. It was thought to be a good idea in the 80s and 90s, but we now know that it's both hard to program, and that it works poorly with highly-parallel, high-performance hardware. In the future, we'll see more and more programming systems which don't provide shared memory concurrency at all.
I believe there will always be situations (though specific) where a shared-memory concurrency model is the most efficient and performant use of available resources. For this reason alone, shared-memory concurrency will always have a place. That said, I generally agree that the isolated parallel memory model is preferable for simplicity's sake.
Shared memory space, not necessarily shared memory. I referenced Erlang & Pony after all so it's not like I'm unaware of the issues here. You don't need a full OS process boundary to separate things; it is a very crude and blunt instrument and there are much better options available. Separation is nice but paying to cross OS boundaries with every message kills performance dead. A full OS process boundary between all concurrently-running threads would be insane overkill for a Rust, Erlang or Pony program, and in practice, with best practices and the use of some tooling, even for Go, even if it is just old-fashioned shared-memory at its core.
"Memory space" is usually called "address space". And I think you're confused - context switching is expensive but there's no context switching involved when you're running in parallel across N cores. We're talking about parallelism, not concurrency.
No, we're talking about what I'm talking about. I've started the whole thread and participated all the way down to here. I also find myself wondering if you understand what we're talking about. I know what I'm saying about terminology is a bit controversial, but the nature of what existing run time systems are doing is not. They've been doing it for decades.
Parallelism in JavaScript is simple using web workers. of course the tricky part as with all parallel applications is managing the complexity you create yourself. The only languages that seem to do a good job at handling this for you are challenging in other ways like Haskell and Erlang.
I don't think web workers are a good design. They are maybe simple, but lack in flexibility and capability. E.g. if I am not mistaken, you cannot control animations from a web worker. You also have to communicate with them using strings, therefore there is considerable overhead.
JavaScript and the related ecosystem in the browser lack a number of things, that are then patched over using additional complexity such as web assembly, bug handling/ feature workarounds in applications, mobile apps that you basically have to develop, because the web isn't useable/ performant enough for some stuff. Of course we have some extra experience now and it is easy to criticize in hindsight. E.g. the decision to make JavaScript less LISPy was perhaps a good decision for it's large adoption but longer term the homoiconicity could have been used for easier generation instead of e.g. stuff like WebAssembly. We also know, that stuff like Clojure(Script) is possible now. We also know, that we really want a native 64bit integer type instead of hodge-podge solutions with reduced precision of 53 bits or big decimal with worse/ complicated performance characteristics.
For the last 12 or so years, the web and related technologies became an application platform and a video delivery network with demands that rival native applications. We even use JavaScript on the server side through Node.JS. This has enabled tremendous things to happen but it is also perhaps the least stable platform to develop for. The web actively breaks some of the more complex applications that it has enabled precisely because the foundations are not solid enough. The current state of affairs is we keep piling on new and shiny APIs and technologies, that perhaps have considerable merit but we don't really fix the broken things in a way normal devs could keep up. I mean, how do you imagine to keep up, if even GMail, Google Maps, YouTube, Facebook and all its web apps, even major news sites and other applications backed by big companies still have rather obvious bugs in these products? I guess, "move fast and break things [for the later generations to fix all this mess]" it is.
Perhaps in 2005 we were still able to call across browser frames and abusing the one thread per frame model.
async / await isn't really concurrency. It's a mechanism for picking up the job after something else has been done (perhaps like a callback). In a way it's been an advantage of Javascript: one or two threads do all the work in a timely way without all that messy thread scheduling and context switches.
What is Python missing in your eyes? Cooperative multitasking is a relatively recent thing to be in vogue again, and python has had good support for multithreading and multiprocessing since forever.
Without in-process multi-threading with OS threads, writing performant parallel code is nearly impossible. Most of the time you have to drop down to C/C++, disable the GIL and implement parallelism there.
Every time someone brings up the GIL and performance, the answer is pretty much always "use a different language for the places you need especially high performance". All this sturm and drang about Python not being performant enough is missing the point. It's not supposed to be the answer to special cases where maximum performance is required! It's a general-purpose language that is built around community and easy readability and elimination of boilerplate, and be good-enough for a vast majority of usecases. Just like any language, there are usecases where it is not appropriate. Clearly, it has succeeded given its significant reach and longevity.
Most high-level languages implement their performance critical functions at a lower level. However, languages that rely on a GIL (Python and Ruby) or isolated memory spaces (V8) have an additional hurdle that if you want a model of concurrency with many threads acting simultaneously on a single shared memory space you have to do additional work.
For Python you either have to have your library pause execution of Python bytecode and do multithreaded things on your memory space, allocate an isolated memory arena for your multithreaded work, copy data in and out of it, and then spawn non-Python threads (see PEP 554 for an IRL example of this idea with the Python interpreter itself), or copy data to another process which can do the multithreaded things.
With PEP 554 (right now using the C API) I personally have no issue with multithreaded Python since the overhead of copying memory between interpreters is fast enough for my work but it is overhead.
There is nothing about the Python language spec (as gleaned from looking at how CPython works) that forces it to be slow or not handle parallelism. In fact the addition of the massive amounts of library code and language changes to support async show that Python isn't even immune to preventing added complexity.
The programming language landscape evolved and developed in the face of multi-code - async being a classic example. But the language that's often held up as the best solution to any given parallelism problem is Erlang. Erlang was built as a good programming model for concurrency on a single core, and then when multi-core came along, SMP was 'just' a VM enhancement and no programs needed changing at all to get the full advantage of all the cores (barring some special cases).