I find these optimizations fascinating. Anyone familiar with Lemire likely knows about this, but I listened to a podcast episode[1] with him a few days ago and learned about `simdjson`, the tool he authored that parses JSON at 25x the speed of the standard C++ library.[2][3] It's worth looking at if you're into this sort of thing.
yep, great fan of his work here. thanks for sharing that podcast.
that kind of optimization requires you to know your machine architecture quite well. SIMD optimizations aren't new. but it's always amazing to see these performance increases on a single machine!
our CPUs and GPUs are quite amazing. we have decided, as a field, that we can get enough virtual CPUs, GPUs, or RAM on-demand. and that we shouldn't concern ourselves with things happening at that level.
it's actually a situation that saddens me sometimes.
Simd convinced me to take college courses in algorithms and to learn higher maths. Things like image decoding rely heavily on doing transformations, and you just have to know the math behind it to an exact point to be able to effectively turn scalar math to vector effectively.
Ontop of this you have to identify what can and cannot be vectorized and how it can be integrated.
Working in simd isn't too hard In itself once you get down to the assembly and toss out all the extra stuff compilers add. If you look at how ffmpeg does it, they just assemble the hand written assembly as the C function itself. Arm64 is very nice to do this in because it has an elegant way in defining vectors and instructions for them.
After listening to the JSON episode, I've been spending the last few days coding up a SIMD implementation of LZ77 substring matching, for use in DEFLATE.
I used Zig, which has first-class SIMD support. As in, no need to go down to the assembly or use intrinsics, or even use a library. I just got it working and haven't had time to profile it, however (I'm new to profiling code).
If you like this stuff data oriented design[1] is a nice framework that isn't arch specific(but can be extended to be if you need to make more gains that are arch aware). Back when I worked in the PS3/X360 days engines written for PS3 had better cache locality(SPUs forced hard limits vs hitting a cache miss) and ran faster on the X360 as result well when ported.
You can do do fun things like using radix sort with bucket sizes that line up well with L2/L3 caches(saw a multi order of magnitude speedup for that one) and data aware layouts that net 10-30x speedups for the same data. Many RMDBs use similar approaches from what I understand as well.
I was recently disappointed and frustrated to learn that GCC will be enabling vectorization at -O2 soon. The realization that you have to specify both -O3 AND an architecture to get it to use AVX basically invalidated a bunch of benchmarking and testing I had done.
What's the point of building with x86-64-v3 if all your code is built at -O2 without vectorization enabled? Doh!
It's a tricky situation. CPUs reduce their clock speed when you use vector instructions so you can't enable everything by default. You have to use LTO+PGO or the compiler won't know if your code needs to be vectorised at the expense of slowing down everything else. That's one reason why O3 is sometimes slower than O2 or Os
If so, Ive been following it for a couple years, but I put it out of my mind recently after moving to AMD. I could sware it was an intel only project, but a quick scan of the that git suggests I'm wrong. So either I'm totally missremembering, or AMD support was added later.
Anyway, I cant wait to try that out again. I wonder why most projects don't just use this as their default json parser now?
For some reason, writing a less-that-completey-naive JSON parser for C++ seems to be a right of passage for all C++ developers who eventually go on to great things.
1. https://corecursive.com/frontiers-of-performance-with-daniel...
2. https://www.youtube.com/watch?v=wlvKAT7SZIQ
3. https://arxiv.org/pdf/1902.08318.pdf