I mean, the law part is still completely obvious. If you have a part that scales proportionally with workers (p) and a part that does not scale with workers (1 - p), then only the part that scales proportionally with workers scales with the number of workers (p / s), it is pretty much a tautology.
The only reason why it is non-obvious how to take advantage of it in practice is that problems in the real world very rarely break down so cleanly into a completely serial piece and a trivially, arbitrarily parallelizable piece and even if you are lucky enough to find such a problem at a certain scale, the very act of breaking down the problem almost always eventually results in other previously ignorable constant or scaling factors becoming important either as you chop the problem into smaller pieces or as you scale to bigger problems. The difficulty lies entirely in posing the problem to be "solved", not the application of the law itself.
There's a lot of things that are obvious when stated, but for various reasons people don't really grasp their ramifications. Put another way, it seems often difficult to grasp just how significant an already superficially internalized concept really is.
E.g. pigeonhole principle -> no perfect compression (among other similar proofs)
Another a favorite one I have is understanding just how "weird" it is that pi, a constant, exists. Take a rope and tie it around the Earth so that it fits snugly on the Earth's surface. How much more length of rope does it take so that the entire rope circling the whole of the Earth can be lifted 10 feet off its surface? ~31 feet (this was at least surprising to me the first time I was asked this question, which was long after I first learned about pi).
It doesn't have anything to do with a pi though. If you had a cube-shaped earth and the same rope, you would need 80ft of additional rope (instead of 62ft for a spherical planet).
We're talking about a rope around the equator, right? Circumference = 2piradius is the same as circumference/2 = pi*radius so three more feet of circumference is 1.5 more feet of radius.
Yes. This is a constant issue when non-scientists discuss scientific theories. A law is just a convenient equation that works sometimes, in certain conditions.
Except this is closer to a mathematical theorem than a law like say, boyle's law. You might be able to re-write an algorithm so it has a larger portion that's parallelizable, but that's the only way to get around it. You can't just keep adding parallel computers and enter a qualitatively different regime.
It's a model that tells how something will behave under certain conditions. It works if the program being run can be neatly separated into purely serial and purely parallel (infinitely scalable) parts. In reality it breaks down at some point because nothing scales forever and there is some communication overhead.
It is similar to the ideal gas law, for example, which can be proved from the axioms of classical thermodynamics and yet fails when the molecules start interacting.
Laws span the whole spectrum from purely empirical to provable within some theories' frameworks.
I predicted that because some computations are hard or impossible to parallelize, attention would shift to solving real-world problems with computations that can be parallelized.
Maybe this is one factor in ML, deep learning neural nets vs. symbolic?
However, engineers tend not to develop techniques without actual hardware [1]. And CPU core-count has remained much lower than I expected [2]. Therefore, my prediction has not come to pass, yet. But it totally will!
1. In myself, I can imagine a maybe-possible situation one or two steps away from present reality, but not see the consequences... but the instant I learn it is definitely possible, I can see consequences.
2. Yes, GPU core-count is sky-high, but is limited by memory access.
I'm saying that techniques that are easy to parallelize will become more popular.
Of course he's right, that if you can make a serial technique faster, you should do that.
I see a way to support his position: in past decades we got free speedups from silicon - then we hit a kind of "Peak Silicon". Just as for Peak Oil, it makes sense to try techniques that weren't worth it before.
Personally, I think the first step should be to give up all the horrific layers of bloat-on-bloat; but instead we got (e.g.) faster JVMs and JIT JS compilation.
But what about the step after that? He thinks we should mine the tailings (and he's surely right there's a lot to be found there).
I think multi-core seems the only extendable solution (til I guess computers are the size of houses again). But because it's so difficult to parallelize code, we'll instead change our approach, and favour techniques that are easy to parallelize. Arguably, this is currently happening with DL.
I've only skimmed the links you gave, so I don't have a deep appreciation of his position.
Formal problems often have theoretical limits, and solutions approaching them. But for "real-world problems", you can often change the problem, and even change the context of where that problem originates - perhaps eliminating that specific problem entirely [1].
So we'll change problems/contexts to ones that are easily parallelizable.
In that dispelling-myths paper, Patrick Madden notes that a less efficient but faster-because-parallel method still uses more energy. A good point, but I'll note that because energy consumption is superlinear with frequency (e.g. a 2 core method at half the clock speed of a 1 core method will use less energy, assuming perfect parallelization), a parallel solution can be both faster and more energy efficient. GPU compute shows this e.g. bitcoin mining.
His last paragraph begins:
Despite these challenges, there is no choice but to forge ahead with parallelism
...which doesn't sound like disagreement. He's mainly criticising breathy papers.
Finally... why is he known by the half-life protagonist? Does he look like him, or is there some thematic connection?
1. Maybe the entire context can be formalized and solved, so you can't "change the problem", but even then, real-world problems tend to keep changing, because of changing customer demands, competition, technology, legislation etc.
>Personally, I think the first step should be to give up all the horrific layers of bloat-on-bloat; but instead we got (e.g.) faster JVMs and JIT JS compilation.
I think this is already happening. The trending languages I see in HN are Rust, nim, zig right now. We're going back to native, and in Rust's case also with fearless concurrency as a paradigm.
Right, but how does that compare to the incumbents? The JVM and .NET CLR don't seem to be going anywhere.
In this day and age I don't think the abstraction layer is the issue. Modern bytecode VMs are already able to be within an order of magnitude of native code, and I'd imagine there's still plenty of room there for improvement (WASM comes to mind). Unless you're on constrained hardware, or developing directly against hardware (rather than using an operating system, which is itself a horrific layer of bloat-on-bloat), a bytecode VM is still perfectly reasonable even for soft-real-time applications (let alone for applications that are throughput-sensitive rather than latency-sensitive).
>Modern bytecode VMs are already able to be within an order of magnitude of native code
Looking at the slowing of hardware advances this could be a 10 year lag in performance. I think that's quite significant.
I think byte code VMs have their place, and for many applications that may be enough, but if you have high performance needs it may be worth checking if the price is one you're willing to pay.
The significance of a "10 year lag in performance" has wildly different meanings between the present day and, say, 10 years ago. 2010 (serial) performance on 2020 hardware would be hardly perceptible. 2000 performance on 2010 would be much more noticeable. 1990 performance on 2000 hardware would be catastrophic.
And yet, Java, C#, Erlang, Perl, Python, Ruby, Tcl... all of these languages and their VMs are decades old now. And yeah, the languages toward the end of that list ain't exactly speed demons, but the ones toward the front are already commonly used for high-performance applications with pretty dang good success, and even the ones toward the back of the pack are there for other reasons beyond just the VM (global interpreter locks, parsing overhead, things like that).
The VM, that is to say, ain't the issue. Hell, things like SPIR-V demonstrate that VMs are perfectly fine even for GPU computation. There's a price, sure, but eliminating that price is about as premature of an optimization at it gets for all but the most constrained of environments. There are almost certainly worse bottlenecks - and those bottlenecks are probably around I/O and memory, knowing most applications (and a bytecode VM often helps here, since the opcodes can be tuned for minimum size, often giving even RISC CPU opcodes a run for their money).
If it makes your life harder, yes, it's premature optimization. But Rust, Nim and Zig are quite high level languages, they don't drag me down in terms of developer performance.
With C/C++ I agree, you'll spend weeks and years debugging depending on how long lived your application is, that you wouldn't have in Java or C#.
In the top 10 you will always find bytecode/vm based solutions not far from C/Rust implementations. Of course, these are highly optimized, but the assumption that native is always better is not always true.
I've seen a few of them, the highly optimized ones usually fall back to some form of manual memory management, which then becomes way more complicated than in a language like Rust which has all this by default.
What I find so beautiful about the native/close to the metal languages is that you can have clean code that's also near optimal, without optimizing a lot. Compared to the lengths people go to get some Python code to run fast it's the easier way.
> which then becomes way more complicated than in a language like Rust which has all this by default.
Um, no:
1. Manual memory management doesn't have to be complex at all - and is arguably much easier to reason about than garbage collection, even if it's more error prone / less guaranteed-to-be-safe
2. Rust's manual memory management (i.e. the default kind of memory management in Rust) is of at least the same complexity as any other sort of manual memory management - if not more due to the need to be explicit around lifetimes and borrowing.
The memory management strategy is regardless orthogonal to whether there's a VM involved; there are garbage-collected native languages (Lisp, D) and not-garbage-collected VMs (WASM, or languages like Perl and Tcl if you count reference-counting as "not-garbage-collected") galore.
... how high did you expect CPU core count to be at this point with close to uniform memory access costs, and what changes did you expect to enable that degree of concurrent bus access without catastrophic stalling due to contention?
I was thinking of GPU cores having random write access, but yes that has concurrent access problems.
BTW how do GPUs solve the concurrent bus access you mention, with many cores are writing pixels?
Also, is this GPU-style processing, where each core writes a separate result (a texture/display pixel or transformed vector) but can read randomly (from textures), the only approach to parallelization that works? It's a form of scatter-gather.
I thought about Amdahl's Law a lot when I started Sauce Labs. To make browser test automation faster, the initial plan was to throw more hardware at the problem by getting more servers and running tests in parallel. But we were always looking for other things to optimize because we knew "throw more hardware at the problem" was going to stop working at some point. One example was virtual machine startup speed. In early days on EC2, Windows VMs took 10-ish minutes to boot up, I think part of the reason was because they had to do a disk image transfer first. But if you can keep lots of redundant copies of disk images in lots of places, you can reduce that waste of time on pre-boot network data transfer and ride the Amdahl's Law curve a little longer. Another trick is having the VM "pre-booted" by having a VM snapshot to jump to avoid the entire boot process. Of course, getting this much low-level control over the boot process to optimize for parallelability means building out your own cloud, which is another story for another day...
One thing that was taught to me in university related to this, is that in optimisation, "every little bit" does NOT help. This insight applies to everything, and is the cause of great frustration. The world should know about this law and that optimising the 3% case is useless in any field, if there's an 80% case staring you in the face.
Ahmdal's law describes the scaling behavior of systems with contention ("serialized workload portion"). This was generalized to the Universal Scalability Law by Neil Gunther which additionally takes into account "cross talk" for coordination between nodes. Systems with cross talk always run into configurations where adding more nodes will decrease the throughput of a system.
Rule 42a.1: if someone has created a novel thread, the community should support with up votes even if they disagree with the conclusions or assertions.
Is there are separate rule (or phrase/idiom) one can quote for the "Serial Programs" part of Amdahl's Law?
This is one area where teams waste a lot of time; spending too much effort improving an area which doesn't matter so much. Amdahl's law could be quoted, but the articles generally skew heavily towards parallelism.
This concept generalizes really well to many things outside of computing. For example, this is why adding more engineers to a team can slow things down.
Amdahl's law is more simplistic than that. You're suggesting that the benefit of adding an additional worker is negative. Amdahl's law is not representative of most systems; your generalization seems better. The marginal benefit as a function of resources depends on the system in question. It isn't just a flat positive percentage.
I’m not so sure of that. Amdahls law tells you the speed up to expect when adding more cores to the parallelized portion of an algorithm. Whereas adding a new member to a team slows down one or more members for a matter of weeks/months because the members are spending time they would otherwise spend producing teaching the new member.
The insight from Amdahl's law is what led to the creation of hybrid processors like the Cell (used in the PlayStation 3 console). They quickly became irrelevant as GPGPU became more mainstream.
> ... if a program needs 20 hours to complete using a single thread, but a one-hour portion of the program cannot be parallelized ... then regardless of how many threads are devoted to a parallelized execution of this program, the minimum execution time cannot be less than one hour.
It really does seem obvious. You can either do the sync part before or after the parallelized part. Since you cannot have infinite horizontal compute, the parallelized part will take some ε > 0 of time to finish, and the total execution time will be 1 + ε.
I had a special topics course in parallel computing in university - Amdahl's law is probably the single biggest takeaway from that course.
It's amazing how applicable it is to more general things in life as well - can't remember examples right now but I've reasoned about many day-to-day things and realised Amdahl's law applies
It also applies to chemistry, physics, and everything else. If you have a reaction that depends on step A and B, and A must happen before B, but B can be sped up arbitrarily with diminishing returns by introducing an enzyme, then the whole reaction will still take slightly longer than the time it takes to complete A.
In my research group we're sometimes benchmarking parallel systems (databases with parallelizable queries). Whenever there's a student thesis, I force them to explain Amdahl's Law in their thesis, and contextualize their work regarding this Law.
From this most students correctly get the idea that they have to analyze the parts of the workload seperately, as their parallelization behaviour is not uniform.
For example, loading the data from a spinning disk - given one disk - behaves differently than, for example, the evaluation of a query once the data has been loaded.
I often use a variation of this line of thinking for determining if something is worth optimizing.
Sometimes it can be hard to tell from profiling output or flame graphs just how much time is spent somewhere. Before trying to make something faster the first thing I do is turn it into a no-op.
If avoiding doing something entirely makes the program run 2% faster, then making that run in half the time is only going to give an overall 1% improvement.
I don't know if there's a name for this variation. It's just the same thing applied to different components and not specifically related to parallelism.
Technically that's still just Amdahl's law, the statement has nothing to do with parallelism. I'm not sure when it became so heavily associated with parallelism.
"In computer architecture, Amdahl's law (or Amdahl's argument[1]) is a formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. It is named after computer scientist Gene Amdahl, and was presented at the AFIPS Spring Joint Computer Conference in 1967."
Maybe I'm missing something here but isn't this mind-numbingly obvious? I've never read this before but this is exactly how my mental model for parallelisation has always worked.
...later I became a developer. Turns out it's not obvious.