When it comes to the Go language, bare-metal performance certainly seemed to be something Pike was giving a shit about when he spoke last week at OSCON and the Emerging Languages Camp I organized. Go in its current fairly young stage has performance that's not too terrible; they've mostly optimized for raw complication speed so far, which is interesting, and uniquely suited to Google's development problems. Pike has said that he wants Go to be a replacement for other systems languages. That's going to mean competing with those languages, performance-wise.
I think Clojure's concurrency model ends up more concisely and correctly expressing the "parallelism of the real world" when you consider the dimension of time. Rich Hickey has done some really important thinking there.
Finally, I think you'll find that anyone with a fixed budget for servers isn't going to think that making the most of "local physical concurrency" on every machine in their cluster is a "fool's errand". Hardware is cheaper than it used to be, but it isn't free, and deploying and maintaining it is costly and time consuming. If you can make the most of your hardware and operations investements with a little more thought-work, why not do so?
The first public release of Go made little use of physical concurrency, though that's undoubtedly improved considerably given that the model allows for it. His earlier languages didn't go there, probably because they didn't have nearly as many people working on them, and physical concurrency wasn't the focus of the research. Go is more performant and provides more static assurances than the previous iterations, but it still has global GC. Algorithms and Correctness are still the primary focal points.
I find it funny that you're hung up on local physical concurrency — for me that's the prime signifier of "Scaling in the Small"! If you're going to have to distribute your workload across multiple machines, why not just run multiple copies of your single-cpu process on each machine, and let your network-level work distribution mechanism handle it?
We're not talking about shared memory supercomputing using OpenMP and MPI on special network topologies or NUMA hardware, just commodity machines running HTTP app servers in front of data stores. We aren't curing cancer, and our workload is already embarrassingly parallel (responding to discrete requests).
I've actually disabled some intra-request concurrency (some ImageMagick operations are multithreaded by default) on a system I'm working on now because it makes the workload wildly inconsistent — when independent requests try to take advantage of all the available CPUs, the latencies spike for everybody. It's just more software to have to monitor, and the ideal returns are slim.
It's not so much that I'm "hung up" on local physical concurrency, just that I don't see any reason to ignore easy gains. You can write maintainable, correct, concurrent programs that scale across cores today if you use technologies other than Node. So why wouldn't you?
Anyway, that's a good reply, and your ImageMagick case study is interesting. It just goes to show how individualized "scaling" really is. Thanks for the taking time.
I haven't actually had a project to use Node on yet, though I have an affinity for its Purity of Essence approach, and I've used Twisted a fair deal.
If it's going to actually scale or have high-availability, a system is already going to have to have a socket pool to distribute across machines. Since that already works to distribute across processes on one machine, why bother adding another layer to pool native threads? It just adds to the complexity budget.
It could easily get you much lower (1/N) latencies on unloaded systems, but in most cases at high volume the gains aren't going to be very big compared to running another N-1 single-threaded processes per box and it would take more concerted effort to keep the request latencies consistent.
It seems obvious that it's worth compromising the model and adding locks or STM to get a N00% gain for a solitary request, but what if that's only a 10% gain when you're pumping out volume? Do stop to consider that not everyone wants their individual app processes to scale across cores — actual simultaneous execution within one address space comes at a cost.
There are probably some cases at scale where the gains from thread-pooling are substantial, but I could see a lot of them being where the work wasn't very event-ish to start with, like big-data batch-flavored stuff where Hadoop would work great (especially from data locality).
some ImageMagick operations are multithreaded by default
And a really icky default at that, for the performance reasons you mention and with the added bonus of a chance of hitting some pointless bug. I've been turning it off ever since I spent a pleasant evening staring at the stacks of wedged apaches that ended with a pile of ImageMagick functions culminating in futex-something-or-other.
It gets ickier too — IM has flags for limiting its memory usage, but instead of modifying the algorithm, it just implements its own VM system and swaps to tempfiles directly from userspace when it hits the constraint.
It is, undoubtedly, one of the more appalling pieces of software in wide use. Just to complete the wild tangent, GraphicsMagick is an interface-compatible fork that claims better performance and less cruft (while still weighing in at above 280 klocs). Flickr uses it so presumably it only gives you two brain aneurysms instead of 4.7.
I also can't help but to gush about varnish a little (a handy thing to put in front of an image processing server, for one thing), which has a number of post-1975 features beyond its memory management such as an actually useful configuration language, statistics aggregation and reporting and the inexplicably uncommon ability to manage and modify configuration without a restart.
I think Clojure's concurrency model ends up more concisely and correctly expressing the "parallelism of the real world" when you consider the dimension of time. Rich Hickey has done some really important thinking there.
Finally, I think you'll find that anyone with a fixed budget for servers isn't going to think that making the most of "local physical concurrency" on every machine in their cluster is a "fool's errand". Hardware is cheaper than it used to be, but it isn't free, and deploying and maintaining it is costly and time consuming. If you can make the most of your hardware and operations investements with a little more thought-work, why not do so?