Rachel presumably wrote her server in a reasonable language like C++ (though I d...

jefurii · on May 8, 2020

> I feel like with modern technologies like ... we ought to be able to do a lot of things that we couldn't even imagine doing in 2005...

As a self-taught programmer I would say that what all these less efficient bit easier to learn technologies have done is enable people like me who evidently are not geniuses like yourself to write software. Should programming always be an ivory tower thing?

closeparen · on May 8, 2020

It does take geniuses to design and operate the incredibly complex cloud-native distributed systems we all insist on doing now.

It’s fair to point out how far you can get just programming one computer using traditional and well understood concepts like sockets and threads. And how weird it is that we live in a world where Kubernetes is mainstream and fun but threads are esoteric.

kragen · on May 8, 2020

I'm not a genius. I started programming in BASIC. I haven't had a lot of schooling: I haven't had a computer science class since I was 12, and I didn't finish high school. I just didn't stop learning.

Adding unnecessary complexity doesn't always make things easier to learn.

_4ziu · on May 8, 2020

Very well put. It's also a matter of use cases. These people seemingly implement servers for big companies. I just make shitty websites. We aren't their target audience yet it comes off as 'this is what everyone should be doing'.

rcxdude · on May 8, 2020

I think part of the point of this is that this approach isn't particularly complicated (writing in assembly is unecessary, but everything else described in both the article and the comment above is basically the simplest way to make a webserver).

highfrequency · on May 8, 2020

> single-threaded nonvectorized C wastes on the order of 97% of your computer's computational power

Can you elaborate on what this means exactly? For example, is there some reasonable C code that runs 33 times slower than some other ideal code? In what sense are we wasting 97% of our computer's computational power?

augustt · on May 8, 2020

A good example of getting a ~3000x speedup from naive matrix multiplication in C here (slides 20 onward): https://ocw.mit.edu/courses/electrical-engineering-and-compu...

Includes a 9-level nested for loop, which is always great to see.

kragen · on May 9, 2020

Thank you very much for posting this!

Roughly that 3000× is 18× from multithreading, 3× from SIMD instructions, 15× from tuning access patterns for locality of reference, and 3× for turning on compiler optimization options. This is a really great slide deck!

I was assuming "single-threaded nonvectorized C" already had compiler optimization turned on and locality of reference taken into account. As the slide deck notes, you can get some vectorization out of your compiler — but usually it requires thinking like a FORTRAN programmer.

So I think in this case reasonable C code runs about 54× slower than Leiserson's final code. However, you could probably get a bigger speedup in this particular case with GPGPU. Other cases may be more difficult to get a GPU speedup, but get a bigger SIMD speedup. So I think my 97% is generally in the ballpark.

A big problem is that we can't apply this level of human effort to optimizing every subroutine. We need better languages.

mratsim · on May 10, 2020

That's why you have people working on Halide, Taichi, DaCe, Tiramisu.

- https://halide-lang.org/

- http://taichi.graphics/

- http://spcl.inf.ethz.ch/Research/DAPP/

- http://tiramisu-compiler.org/

This way you can have a researcher implementing the algorithm (say bilinear filtering) and a HPC expert who tunes it with parallelism, SIMD, tiling.

I wrote an overview of most DSL for high performance or image processing in this issue: https://github.com/mratsim/Arraymancer/issues/347#issuecomme...

kragen · on May 10, 2020

This is great! Which of these do you think could be extended to general-purpose programming without the HPC expert? Taichi and DAPP seem to be aimed at that goal, but you seem to be implying they don't reach it yet?

mratsim · on May 11, 2020

You can use them without the HPC expert, Halide for example has a good autotuner and has been used by Google and Adobe to create image filters for mobile devices.

kragen · on May 12, 2020

Thank you!

umvi · on May 8, 2020

I think it's referring to simd instructions [0].

I'm still kind of a newb myself but from what I understand these are special CPU instructions that allow you execute the same instruction in parallel against multiple data points. This allows you to eke out a lot more performance. It's how simdjson[1] is able to outperform all other C++ json parsers.

[0] https://en.wikipedia.org/wiki/SIMD

[1] https://github.com/simdjson/simdjson

kragen · on May 8, 2020

8 cores times 4 SIMD lanes is a 32× speedup; that's where "97%" comes from, as explained in the note I linked to.

It's pretty variable: some things we haven't figured out how to speed up with SIMD, sometimes we have a GPU, sometimes we can get 8 or 16 SIMD lanes out of SSE3 or AVX128 or 32 of them out of AVX256, sometimes you only have four cores, sometimes make -j is enough parallelism to win you back the 8× factor from the cores (though not SIMD and GPGPU). But I think 97% is a good ballpark estimate in general.

sagarm · on May 8, 2020

A realistic speedup to expect from vectorization is probably closer to 10x. Other overheads start to dominate once you've optimized your inner loops.

It's also not reasonable to expect to vectorize everything; for example a web server is unlikely to benefit from vectorization.

kragen · on May 8, 2020

Just to clarify, I was only estimating a speedup of 4× from vectorization, while the other 8× comes from multithreading.

Fifteen years ago we thought regular expression matching and compilers were unlikely to benefit from vectorization, but now we have Hyperscan and Co-dfns, so they did. So I think it's likely that we will figure out ways to do a wider variety of computations in vectorized ways, now that the rewards are so great.

cellularmitosis · on May 8, 2020

A great talk which gets you thinking in this mindset is "Data-Oriented Design and C++" by Mike Acton https://www.youtube.com/watch?v=rX0ItVEVjHc

As an example, if you are checking a boolean flag (1 bit) on an object, and it ends up being a cache miss (and x86_64 cache line size is 64 bytes), then your computer just went through all the expense of pulling in 512 bits from RAM yet it only used 1 of them. You are achieving 0.2% of the machine's possible throughput.

neurostimulant · on May 8, 2020

I imagine if you could make the most out of vector instruction set in your code (where they can operate on a vector of data at once instead of one by one), you'll get a huge performance boost for "free". GP seem to be working on a vm that let you do that (a lot of it was flying over my head though, need some coffee).

pjmlp · on May 8, 2020

Despite all my rants here about C, on my travel netbook I switched to XFCE as I could not stand the performance impact of all JavaScript and Python based extensions on GNOME.

rantwasp · on May 8, 2020

if you're going to serve static files, nginx absolutely leaves this in the dust (think 50k rps on a beefy machine)

kragen · on May 8, 2020

I would expect so, but did you mean 500k or something? "50k rps on a beefy machine" sounds like about the same as, or maybe even a bit slower than, 20k–30k on this 2011 laptop, which was how fast httpdito was last time I measured it.

rantwasp · on May 8, 2020

went back and looked at it. a webserver written in asm for the lols is okay but my point was that you probably want a proven, battle ready web server (along the lines of nginx or apache) if running something in production. So, 50k rps on a vanilla well used/well maintened server > 50k rps on an experiment (and don't get this wrong, it's pretty impressive for what it is)

kragen · on May 8, 2020

Yeah, I definitely wouldn't advise anyone to run httpdito in production. It's so lacking in observability that it doesn't even log hits, it doesn't do timeouts, and its MIME types are configured by writing in assembly language. But it shows that some surprising things are possible. And it can be handy to serve up some static pages from your laptop to your phone or whatever.

rantwasp · on May 8, 2020

https://gist.github.com/willurd/5720255 Each of these commands will run an ad hoc http static server in your current (or specified) directory, available at http://localhost:8000.

kragen · on May 8, 2020

Yeah, httpdito is similar to those, but maybe with less security holes and definitely with less dependencies.

bdavis__ · on May 8, 2020

of course it does !!