Hacker News new | past | comments | ask | show | jobs | submit login

Rachel presumably wrote her server in a reasonable language like C++ (though I don't see a link to her source), but when I wrote httpdito⁰ ¹ ² I wrote it in assembly, and it can handle 2048 concurrent connections on similarly outdated hardware despite spawning an OS process per connection, more than one concurrent connection per byte of executable†. (It could handle more, but I had to set a limit somewhere.) It just serves files from the filesystem. It of course doesn't use epoll, but maybe it should — instead of Rachel's 50k requests per second, it can only handle about 20k or 30k on my old laptop. IIRC I wrote it in one night.

It might sound like I'm trying to steal her thunder, but mostly what I'm trying to say is she is right. Listen to her. Here is further evidence that she is right.

As I wrote in https://gitlab.com/kragen/derctuo/blob/master/vector-vm.md, single-threaded nonvectorized C wastes on the order of 97% of your computer's computational power, and typical interpreted languages like Python waste about 99.9% of it. There's a huge amount of potential that's going untapped.

I feel like with modern technologies like LuaJIT, LevelDB, ØMQ, FlatBuffers, ISPC, seL4, and of course modern Linux, we ought to be able to do a lot of things that we couldn't even imagine doing in 2005, because they would have been far too inefficient. But our imaginations are still too limited, and industry is not doing a very good job of imagining things.

http://canonical.org/~kragen/sw/dev3/server.s

¹ http://canonical.org/~kragen/sw/dev3/httpdito-readme

² https://news.ycombinator.com/item?id=6908064

† It's actually bloated up to 2060 bytes now because I added PDF and CSS content-types to it, but you can git clone the .git subdirectory and check out the older versions that were under 2000 bytes.




> I feel like with modern technologies like ... we ought to be able to do a lot of things that we couldn't even imagine doing in 2005...

As a self-taught programmer I would say that what all these less efficient bit easier to learn technologies have done is enable people like me who evidently are not geniuses like yourself to write software. Should programming always be an ivory tower thing?


It does take geniuses to design and operate the incredibly complex cloud-native distributed systems we all insist on doing now.

It’s fair to point out how far you can get just programming one computer using traditional and well understood concepts like sockets and threads. And how weird it is that we live in a world where Kubernetes is mainstream and fun but threads are esoteric.


I'm not a genius. I started programming in BASIC. I haven't had a lot of schooling: I haven't had a computer science class since I was 12, and I didn't finish high school. I just didn't stop learning.

Adding unnecessary complexity doesn't always make things easier to learn.


Very well put. It's also a matter of use cases. These people seemingly implement servers for big companies. I just make shitty websites. We aren't their target audience yet it comes off as 'this is what everyone should be doing'.


I think part of the point of this is that this approach isn't particularly complicated (writing in assembly is unecessary, but everything else described in both the article and the comment above is basically the simplest way to make a webserver).


> single-threaded nonvectorized C wastes on the order of 97% of your computer's computational power

Can you elaborate on what this means exactly? For example, is there some reasonable C code that runs 33 times slower than some other ideal code? In what sense are we wasting 97% of our computer's computational power?


A good example of getting a ~3000x speedup from naive matrix multiplication in C here (slides 20 onward): https://ocw.mit.edu/courses/electrical-engineering-and-compu...

Includes a 9-level nested for loop, which is always great to see.


Thank you very much for posting this!

Roughly that 3000× is 18× from multithreading, 3× from SIMD instructions, 15× from tuning access patterns for locality of reference, and 3× for turning on compiler optimization options. This is a really great slide deck!

I was assuming "single-threaded nonvectorized C" already had compiler optimization turned on and locality of reference taken into account. As the slide deck notes, you can get some vectorization out of your compiler — but usually it requires thinking like a FORTRAN programmer.

So I think in this case reasonable C code runs about 54× slower than Leiserson's final code. However, you could probably get a bigger speedup in this particular case with GPGPU. Other cases may be more difficult to get a GPU speedup, but get a bigger SIMD speedup. So I think my 97% is generally in the ballpark.

A big problem is that we can't apply this level of human effort to optimizing every subroutine. We need better languages.


That's why you have people working on Halide, Taichi, DaCe, Tiramisu.

- https://halide-lang.org/

- http://taichi.graphics/

- http://spcl.inf.ethz.ch/Research/DAPP/

- http://tiramisu-compiler.org/

This way you can have a researcher implementing the algorithm (say bilinear filtering) and a HPC expert who tunes it with parallelism, SIMD, tiling.

I wrote an overview of most DSL for high performance or image processing in this issue: https://github.com/mratsim/Arraymancer/issues/347#issuecomme...


This is great! Which of these do you think could be extended to general-purpose programming without the HPC expert? Taichi and DAPP seem to be aimed at that goal, but you seem to be implying they don't reach it yet?


You can use them without the HPC expert, Halide for example has a good autotuner and has been used by Google and Adobe to create image filters for mobile devices.


Thank you!


I think it's referring to simd instructions [0].

I'm still kind of a newb myself but from what I understand these are special CPU instructions that allow you execute the same instruction in parallel against multiple data points. This allows you to eke out a lot more performance. It's how simdjson[1] is able to outperform all other C++ json parsers.

[0] https://en.wikipedia.org/wiki/SIMD

[1] https://github.com/simdjson/simdjson


8 cores times 4 SIMD lanes is a 32× speedup; that's where "97%" comes from, as explained in the note I linked to.

It's pretty variable: some things we haven't figured out how to speed up with SIMD, sometimes we have a GPU, sometimes we can get 8 or 16 SIMD lanes out of SSE3 or AVX128 or 32 of them out of AVX256, sometimes you only have four cores, sometimes make -j is enough parallelism to win you back the 8× factor from the cores (though not SIMD and GPGPU). But I think 97% is a good ballpark estimate in general.


A realistic speedup to expect from vectorization is probably closer to 10x. Other overheads start to dominate once you've optimized your inner loops.

It's also not reasonable to expect to vectorize everything; for example a web server is unlikely to benefit from vectorization.


Just to clarify, I was only estimating a speedup of 4× from vectorization, while the other 8× comes from multithreading.

Fifteen years ago we thought regular expression matching and compilers were unlikely to benefit from vectorization, but now we have Hyperscan and Co-dfns, so they did. So I think it's likely that we will figure out ways to do a wider variety of computations in vectorized ways, now that the rewards are so great.


A great talk which gets you thinking in this mindset is "Data-Oriented Design and C++" by Mike Acton https://www.youtube.com/watch?v=rX0ItVEVjHc

As an example, if you are checking a boolean flag (1 bit) on an object, and it ends up being a cache miss (and x86_64 cache line size is 64 bytes), then your computer just went through all the expense of pulling in 512 bits from RAM yet it only used 1 of them. You are achieving 0.2% of the machine's possible throughput.


I imagine if you could make the most out of vector instruction set in your code (where they can operate on a vector of data at once instead of one by one), you'll get a huge performance boost for "free". GP seem to be working on a vm that let you do that (a lot of it was flying over my head though, need some coffee).


Despite all my rants here about C, on my travel netbook I switched to XFCE as I could not stand the performance impact of all JavaScript and Python based extensions on GNOME.


if you're going to serve static files, nginx absolutely leaves this in the dust (think 50k rps on a beefy machine)


I would expect so, but did you mean 500k or something? "50k rps on a beefy machine" sounds like about the same as, or maybe even a bit slower than, 20k–30k on this 2011 laptop, which was how fast httpdito was last time I measured it.


went back and looked at it. a webserver written in asm for the lols is okay but my point was that you probably want a proven, battle ready web server (along the lines of nginx or apache) if running something in production. So, 50k rps on a vanilla well used/well maintened server > 50k rps on an experiment (and don't get this wrong, it's pretty impressive for what it is)


Yeah, I definitely wouldn't advise anyone to run httpdito in production. It's so lacking in observability that it doesn't even log hits, it doesn't do timeouts, and its MIME types are configured by writing in assembly language. But it shows that some surprising things are possible. And it can be handy to serve up some static pages from your laptop to your phone or whatever.


https://gist.github.com/willurd/5720255 Each of these commands will run an ad hoc http static server in your current (or specified) directory, available at http://localhost:8000.


Yeah, httpdito is similar to those, but maybe with less security holes and definitely with less dependencies.


of course it does !!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: