Hacker News new | past | comments | ask | show | jobs | submit login
1BRC merykitty's magic SWAR: 8 lines of code explained in 3k words (questdb.io)
257 points by signa11 8 months ago | hide | past | favorite | 83 comments



What an amazing step by step explanation!

More than 2 years ago I found that byte array view var handles are quite suitable to cook efficient SWAR routines with Java/Scala.

See a lot of other examples of SWAR usage, like parsing Base16/64 strings, java.time.* and number values directly from byte arrays:

https://github.com/plokhotnyuk/jsoniter-scala/blob/master/js...


Great article and given the context of the code this is a brilliant solution, but note that this assumes that the data is well-formed. Much of the value in an effective, battled-worn parser is efficient error checking and recovery.


It would be interesting to see a breakdown of the ways in which ill-formed inputs could affect the output. And how much work it would be to detect such inputs and return some kind of sentinel error value - in the same style as the current code.

(Not quite interesting enough for me to do it myself though ;-)


Multiplying digit bitfields by their respective powers of 10 and shift/adding via MUL is a (well?) known technique, see Lemire https://lemire.me/blog/2023/11/28/parsing-8-bit-integers-qui... .


As per the article: SWAR is "SIMD Within A Register"


If people like stuff like this, the simdjson paper uses similar techniques, is extremely well-written and has great examples.

Paper: https://arxiv.org/abs/1902.08318

Github: https://github.com/simdjson/simdjson


This isn't SWAR though but I get how it might appeal.


Can someone explain to me how the BRC isn't bottlenecked on I/O? I don't understand how the CPU is the bottleneck.


Local disk I/O is no longer the bottleneck on modern systems: https://benhoyt.com/writings/io-is-no-longer-the-bottleneck/

In addition, the official 1BRC explicitly evaluated results on a RAM disk to avoid I/O speed entirely: https://github.com/gunnarmorling/1brc?tab=readme-ov-file#eva... "Programs are run from a RAM disk (i.o. the IO overhead for loading the file from disk is not relevant)"


Unfortunately there are still cases where local disk I/O can be a serious bottleneck that you do have to be aware of:

1. Cloud VMs sometimes have very slow access even to devices advertised as local disks, depending on which cloud

2. Windows

The latter often catches people out because they develop intuitions about file I/O speed on Linux, which is extremely fast, and then their app runs 10x slower on Windows due to opening file handles being much slower (and sometimes they don't sign their software and virus scanners make things much slower again).


Apparently, the cause of the long standing windows disk IO problem was discovered a month or so ago, and MS were said to be working on a fix.

Whether it'll be constrained to Win 11 is yet to be seen.


That's interesting. A project at work is affected by Windows slow open() calls (wrt to Linux/Mac) but we haven't found a strong solution rather than "avoid open() as much as you can".


It's likely Windows Defender, which blocks on read I/O to scan files. Verify by adding a folder to its exclusion list, though this isn't helpful if it's a desktop app. The difference is most noticeable when you're reading many small files.


Really? Where did you hear that? Conventional wisdom is based on a post by an MS employee years ago that described the performance issues as architectural.

MS does have something called "dev drive" in Win11. Dev Drive is what ReFS turned into and is an entirely different filesystem that isn't NTFS, which is better optimized for UNIX-style access patterns. The idea is that core Windows remains slow, but developers (the only people who care about file IO performance apparently) can format another partition and store their source/builds there.


I was surprised by this recently when optimizing a build process that uses an intermediate write-to-disk step. I replaced the intermediate filesystem API with an in-memory one and it was not measurably faster. Not even by a single millisecond.


How much data were you writing? If you don't fill the OS's page cache and the SSD controller's DRAM cache, and you're not blocking on fsync() or O_DIRECT or some other explicit flushing mechanism, then you're not going to see much of a difference in throughput.


Only a few MB written then read, so that is a likely explanation.


Ahh, thanks, I didn't know about the ramdisk. Very interesting about I/O not being the bottleneck, though.


For some background: here is an interview with Daniel Lemire who built an entire career based on the observation that I/O is often not the bottlenck. https://corecursive.com/frontiers-of-performance-with-daniel...


I haven't looked at the problem closely enough to answer, but could we start from the other direction: what makes you think that memory I/O would be the bottleneck?

From my limited understanding, we sequentially bring a large text file into L1 and then do a single read for each value. On most processors we can do two of these per cycle. The slow part will be bringing it into L1 from RAM, but sequential reads are pretty fast.

We then do some processing on each read. At a glance, maybe 4 cycles worth in this optimized version? Then we need to write the result somewhere, presumably with a random read (or two?) first. Is this the part you are thinking is going to be the I/O bottleneck?

I'm not saying it's obviously CPU limited, but it doesn't seem obvious that it wouldn't be.

Edit: I hadn't considered that you might have meant "disk I/O". As others have said, that's not really a factor here.


> maybe 4 cycles worth in this optimized version?

It's quite a bit more than that, just the code discussed in the post is around 20 instructions, and there's a bunch more concerns like finding the delimiter between the name and the temperature, and hashtable operations. All put together, it comes to around 80 cycles per row.

When explaining the timing of 1.5 seconds, one must take into account that it's parallelized across 8 CPU cores.


You are right. In my defense, I meant to say "about 4 cycles per byte of input" but in my editing I messed this up. I'd just deleted a sentence talking about the number of bytes per cycle we could bring in from RAM, but was worried my estimate was outdated. I started trying to research the current answer, then gave up and deleted it, leaving the other sentence wrong.


Sorry, yes, I meant disk I/O, I should have clarified.


The test is executed with memfs. The file and everything is in ram at startup.


Since the dataset is small enough so that it fits into the Linux kernel page cache, and since the benchmark is repeated for 5 consecutive times, first iteration of the benchmark will be bottlenecked by the disk I/O but the remaining 4 will not - e.g. all data will be in RAM (page-cache).


Not being IO bound is part of the test design in this case. https://github.com/gunnarmorling/1brc?tab=readme-ov-file#eva...

"Programs are run from a RAM disk (i.o. the IO overhead for loading the file from disk is not relevant)"


Used to do SWAR on the 68000 to good effect. 4 bytes operated on in parallel in a single instruction. Tricky to handle the overflow IIRC. I really like this article!


> the real mystery is how a single guy working alone could come up with all this in just a few days of casually doing an online challenge with a T-shirt and a coffee mug as a reward.

Why is that a mystery? There's still people out there who actually know how to program a CPU and actually understand what they're doing. The real mystery is the lack of deeper understanding of most people who call themselves programmers, combined with the fact that they don't appear to be knowing that they're seriously lacking.


You don't need to do such SWAR tricks in C# - it offers first-class cross-platform SIMD API instead.

Evidently, it works really well which can be observed on the example of a C# solution being the fastest at 1BRC that has been publicly seen so far: https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...


Can you vectorize this with SSE. Most of the magic could be done with vector of four 32 bit integers.

The question is if the cost of setting up the initial vector and extracting the result isnt prohibitive.


It can, and various other 1BRC implementations have (though I doubt hotspot would manage to do it itself, never mind that most 1BRC submissions were run with Graal for the lower startup overhead). The 32×32→64-bit multiplication poses an issue on the default SSE2 as it has no 32-bit or 64-bit multiplication, but SSE4.1 adds exactly the needed thing, pmuldq (although, as that returns a 64-bit result, you'd need two of such to process a full vector of 32-bit ints).


Did the challenge limit submissions to only using SSE2? Seems odd, given the prevalence of SSE4.2 support.

PMULUDQ is in SSE2, though I haven't checked if that's usable for the problem here. There's also PMULLD in SSE4.1 if you only need a 32-bit result. But for summing digits, perhaps SSE2's PMADDWD could be sufficient?


The official 1BRC was Java-only, so no using any architecture-specific SIMD at all; the test system did have AVX2 though, and that's what most non-competing native solutions (including mine) targeted.

Completely forgot about pmuludq, that works too for SSE2. But a 32-bit result is insufficient for the magic number method, needs to be at least 36-bit. I originally used vpmaddubsw+vpmaddwd, but switched to vpmuldq for the reduced register pressure, and I was already only parsing 4 numbers in ymm registers so the 64-bit result didn't affect me (after parsing 4 temperatures I immediately did the respective hashmap stuff for each).


The temperature fields are interleaved with name fields, so I don't think you'd get any extra benefit from SSE. Also, since the temperature field is variable-sized, it would probably not pay off even if it was stored by column.

SSE was successfully applied to finding the delimiter between the name and the temperature, though.


It can still be beneficial - yes you need to load the temperatures individually, but there's enough operations afterward to make it worth it.


I imagine code like this gets auto-vectorized either right from the start or after Hotspot detects a hotspot


I wonder what you mean here. What code exactly would get auto-vectorized? Parsing the temperature surely not, since it's not in an array of fixed-size fields.


once you've done the work to implement parsing as branchless bitwise operations on integers I think that code can be extended automatically to operate on larger registers


I think it would work if that was the only code in the loop. But the loop spans several more nontrivial operations, including hashtable insertion.


This algorithm could be done 32x in parallel with AVX instructions...

But Java doesn't really support SIMD use of the CPU except in a very few specific cases, which is sad.


Not quite 32x as there's a 3-5 bytes of input per number, and a range of -999…999 for the output (and you'd generally want to load those consecutive bytes at the same time). 4x or 8x is certainly possible though; and you can even do that in Java with jdk.incubator.vector to some extent, although the available operations are somewhat limited, and Hotspot's codegen leaves enough to be desired.


But there are 1 billion rows... So plenty of parallelism. Your 32 lanes can be working on totally different parts of the dataset.


Processing 32 inputs at 1-byte granularity means that you'll need to expand each of those 32 inputs across 5 registers, and means you lose the ability to trivially shift them along (would need 15 blends to align the 5 input bytes to 3 output bytes for the 32×i8 approach, vs a single vpsllvd or some shuffle for 8×i32), or do any arithmetic wider than 8-bit (i.e. no 64-bit multiply at the core of the method in the article). More lanes just for the sake of it makes no sense (not counting unrolling as you can unroll anything to any amount (give or take register pressure)).


Also, fwiw, re: your gather/scatter approach at https://news.ycombinator.com/item?id=38867074:

I believe you can't get away with just 16+16-bit per state transition case - you have to validate that the input is what the entry was for, and your proposed 16-bit fields don't include anything for that. (having a full 2^32-element table won't help you, and would be a massive 16GB by itself anyways).

I don't see how 2^16 counters can work, as there are clearly 400×400 = 160000 different name×temperature combinations to track; lowering to 160 temperatures is awful. (maybe you could squeeze that together with the output state bits though, which are somewhat fewer, but doesn't help much).

Which leaves with just the option of increasing to 8 bytes per entry.

That ends up as a massive table as e.g. a window of "4\nAb" has 400×40 possible input states, and is 10×141 cases, for a total of 22560000 transitions, 180MB; and there are more than just this window, and you need some empty space to reduce collisions (though some enormous L3-s maybe could still cover it. maybe.)

The current best native solution works at ~2 cycles per input byte per core, so your method per 4 bytes has to beat 8 cycles/element. Your described solution is (cycle counts on Intel, as Zen 4 has much slower gather/scatter):

1. gather the next block of 4 bytes for each element; let's say 0.33c/elt as these are all in L1.

2. gather 8-byte values from the state transition table; as the table is massive and random access, that's limited by the RAM to cache interface; though you only need 8 bytes, the CPU will fetch 64B cache lines, so even with DDR5 at 64GB/s that's just 1B reads per second, or 3c/elt at 3GHz.

3. vpconflictd for when elements hit the same histogram bucket (~1.2c/elt on Intel; 0.08c/elt on Zen 4; also needs some fallback for when the collisions do happen (luckily unlikely if you mask out states that don't terminate the record); also means that these steps 3&4&5 cannot be interleaved without a separate histogram table for each).

4. gather the histogram previous value (0.5c/elt?).

5. scatter the incremented histogram values (0.7c/elt?).

Furthermore, to handle the rarer temperatures/collisions (the distribution of names is uniform so there's no point in handling only a subset of those), you need some fallback mechanism. Targeting 400 temperatures gives you a 5% chance of a failed lookup per entry (i.e., for a 16-element register, there's a 56% chance that at least one will hit a bad temperature), and you must restore the failed lane to a working one very fast, or otherwise all lanes will end up dying quite quickly. Some options:

6.a. do a scalar ctz+blsr loop over a mask of the failed entries (a bunch of cycles and very branch-mispredicty);

6.b. increase to ~600 targeted temperatures, for only 0.2% chance of a failed lookup (still 3.1% over a 16-element vector), and do a scalar fallback when it does happen (still somewhat mispredicty);

6.c. store all failed entries to a buffer (scatter, 0.7c/elt, probably even with most entries being masked off? could be wrong though); and then of course whatever code for actually handling these 5% of more complex entries.

So that's somewhere around 0.33 + 3 + 1.2 + 0.5 + 0.7 + 0.7 = 6.43 cycles per 4 bytes, not far from the 8 cycles of current solutions, and that's not counting any of the intermediate arithmetic, assuming ideal cache & RAM performance (and that the CPU can actually OoO over all of the RAM latency and not flop on the gather/scatter of the histogram not having known addresses for a while), and doesn't leave much for the table initialization.


Can you provide more details on how that would work? Given that the input is CSV rows and not an array of integers alone, and the fact that getting to the next row is dependent on finding the delimiter in the current row.


There isn't really a dependency there. You can scan for a bunch of delimiters and save a bunch of row start and end positions and process the rows from there.

If (unlike the solutions that won this challenge) you would like to use a hash that consumes all the bytes in the input, then scanning for all the delimiters and bucketizing strings by length also lets you avoid branch mispredictions caused by random lengths.


You could choose 32 start points approximately evenly through the file, and process starting at each point in parallel.


This article helped me understand bit fiddling better than any before it. Thanks to the author and the author of the really novel Java solution to the 1BRC challenge.


I remember the first time I read about this technique used by the Linux kernel to find the \0 at the end of the string. If course XORing with another char will detect that char instead.

But before you run to implement it as a drop-in replacement for strlen() all over your code, please note that in real-world cases the string length is not guaranteed to be a multiple of 8. When you read the last long, you might overrun your memory buffer and trigger a page fault.


> When you read the last long, you might overrun your memory buffer and trigger a page fault.

You know that page boundaries are always on a `PAGE_SIZE` alignment, which you can get with `getpagesize()` or `sysconf(_SC_PAGESIZE)`.

...but overrunning the declared size of a memory allocation is still UB, even if you don't trigger a page fault, so YMMV ;-)


Also, as I learned recently: UB minor faults are allowed to change under you. You're not guaranteed to get the same value next time you access them, even if nothing else in your process is touching that memory.


Anything is allowed to happen if your program invokes UB. All bets are off (according to the C standard/abstract machine) for the rest of the runtime of your program, from the first time it happens.

In fact, because the compiler is allowed to re-order your program provided that the end result of any non-UB computation is the same (the "as-if" rule[0]), if part of your program invokes UB, it's possible for that to "reach backwards" and invalidate computations that appear to happen before the UB is/was invoked (by a straightforward reading of the source code).

[0] https://en.wikipedia.org/wiki/As-if_rule


My point is: we normally speak of UB as a C-source-specific thing. But some UB exists even in compiled code from any language.


That's because UB has a well-defined meaning in the C programming language specification, for over 35 years now. And C has been one of the most widespread programming languages in use over that time.

One of the important things about UB and C is that C is a language defined by a standard, not any one compiler - and there are a lot of compilers. If a C compiler does something weird with your source code, we can say whether the compiler is correct or not, or whether your code is at fault. For many languages, that is not the case, and "whatever the compiler does to your code" is de facto correct, and "whatever the compiled version of your code happens to do on any given platform" is well-defined, even if it wasn't what you meant.

I'd argue that if a compiled language does not define what it means by UB, it doesn't have "UB".

But also, if a person talks about "UB", it is reasonable for a listener to assume they are by default talking about the well-defined term in the C programming language standard - unless they specify otherwise. Because that's the most enduring definition of the term we have.


We used to use the BCD opcodes for this kind of thing. Masking off the 0x30, shift digits (if you want packed BCD)

I can't imagine that has been efficient for decades tho


The BCD instructions are one of the few things that was dropped in the x86-64 transition, so it wouldn't work at all anymore.


they'd never been extended to work on more than 8 bits, even in the 8086; they only really existed for 8080 compatibility, and arguably the 8080 primarily had them to ease the path from earlier intel processors designed for, if i'm not mistaken, literal pocket calculators

in pocket calculators you have to display the result in decimal after performing a single arithmetic operation, so it's much more efficient to do the arithmetic in bcd than to convert from decimal to binary, perform the arithmetic operation, and then convert back


Yeah, if you wanted numbers greater than 99 you had to use them in the x87 (if you had one)


Agner says that AAA has latency 5 on Cannon Lake, so using that instruction is a bit faster than doing the operations manually. But if you vectorize (or use SWAR) I imagine you can start to beat the legacy instructions with larger numbers.


This state space is pretty small - is there a super-optimizer that will just spit out a similar result?


Pretty cool. I'll have to come back to this

But can't someone just ask merry kitty where they got the idea from?


Not saying the idea came from this, but if you were coding in late 70s to mid 80s, you may well have written lots of code using these techniques because 6502 assembly was fun and manageable enough to just write your software in it, especially if you had a Beagle Bros book at hand.


Anyone who coded in the demoscene too was writing oodles of weird bit-twiddling guff like this, from experience...


I think something like this was a lot more common knowledge around 20 years ago, and now seems like mystic arts. It just takes a mindset of thinking that everything is bytes and binary, and after you've seen a couple SWAR snippets of stuff like popcount or first significant bit you're able do it these inferences yourself.


Why does they compute and throw away nextLineStart?

long nextLineStart = ...


That’s explained at the end of the post: in the real code, that’s the most efficient place to calculate it, and it was passed back out as well.


That was some key context lost in the simplification of the original code[1] for this blog post, though. The blog post's code returns the temperature (throwing away the next line), whereas the original adds it to a map and returns the next line.

[1] https://github.com/gunnarmorling/1brc/blob/dfec2cdbe6a0334cf...


There are many variations of the original code used in different solutions. Many of them return the temperature like the variant used in the post, but they split out the part that finds the decimal dot into a separate function. Then you can reuse that twice: to finish parsing the temperature, and to advance the cursor to the next row.


For context on why you should read this: this is a well written article that explains in detail how some bitwise operations work to parse a temperature (think -10.3, or 3.4) with no conditionals (no if-statements or switches). This results in very fast parsing. The language used is Java but it could just as easily be C or Swift.


Result is fast parsing with no error checking.

If the input wasn't in expected format (e.g. different decimal separator), the result would be incorrect.

Fun ss an exercise, but hard to understand, maintain.


> the real mystery is how a single guy working alone could come up with all this in just a few days

Really not a mystery. This is not that complex for people with low level programming experience (embedded, (older) games, demoscene, performance systems, etc.).

Knowing these tricks and knowing how to apply them is basically a requirement on these areas.


Bit twiddling hacks are increasingly becoming a lost art even in the embedded/low level space. There's relatively few people around with any day-to-day need for them and I've found myself deleting a lot of the legacy ones I find because the compiler situation has changed so much from when they were originally written.

I think this is an excellent skills showcase, even if I don't think it's magical.


Back when the word "nanotechnology" used to mean itty-bitty robots* rather than some extremely fine chemical processes done in sequence, or extremely small particle size, I had wondered if we could get to the point of said itty-bity robots before all of the brilliant people who made playable games for restricted consoles like the Atari had retired.

* More specifically, miniaturized robots consisting of one or more robot arms with multiple degrees of freedom, a CPU, some kind of storage (perhaps something like wound-up DNA serving as tape on reels), executing programs written by humans.


That concepts sounds like fantasy, cells are shaped like cells because 'itty-bitty robots', and most other configurations, don't work thermodynamically or physically or chemically at such small scales.


Cells are shaped like cells (which is to say ugly bags of mostly water) because they are self-perpetuating chemical reactions in solution. We usually conduct our chemical reactions in vessels that are round along at least one axis, maybe almost two for our round-bottomed flasks. This is natural because we're enclosing a maximum volume of our solution with a minimal surface (cell membrane or borosilicate glass).

On a macro scale, life tends to be bilaterally or radially symmetrical, but our robots are not necessarily like that, just considering even factory robots which are often an arm and little else. So, at the micro scale, I don't think they have to resemble "life" there, either. I'm hardly suggesting some kind of tinkertoy with one atom here and another atom there and the strut being, instead of wood, a covalent bond. No, I think we would need more atoms than that.

Frankly, we haven't much tried to scale down our robotics. Oh, you'll find someone who will produce the flagellar motor (a mere forty-five nanometers, compared to the ten thousand nanometers of a human cell) but not much else. I wouldn't worry about the thermodynamics and quantum effects until we're down to that motor level.


Do you realize you can determine this for yourself, right now?

By just taking standard equations found in textbooks, plugging in some numbers, and realizing that things such as Ideal Gas Law don't hold at such scales.

You can even derive the equations for yourself to confirm your not being misled by the textbooks, but that takes more skill and knowledge then is likely feasible to acquire without advanced study.


Fortunately, I have a physics degree so I am quite sure I am not being misled by the textbooks.

We will face different challenges at different scales, of course, but even as recently as 2016, researchers using a scanning tunneling microscope crafted a 8,192 bit message using less than ten thousand atoms. So, no, I see absolutely no reason why we couldn't have robots within an order of magnitude of the human cell -- we have bacteria which are much much smaller capable of a variety of functions as well as reproducing themselves

You might look into Feynman's "There's Plenty of Room at the Bottom."


Are you getting confused as to the topic?

Clearly cell-like robots are 100% possible at very small scales, there’s no one debating you on this.


Yes. These are standard idioms in low-level performance engineering. This is the kind of thing that would be a fun puzzle for someone with this skill set, something you could come up with in an hour. This particular example does demonstrate a relatively complete understanding of the toolset.

I feel like this kind of knowledge is becoming a lost art these days despite having lost no relevance.


"a sufficient smart compiler...." (/s)

I had a question recently: would an extra 65th bit in some registers have a good use case with stuff with SWAR?


Not directly, but I think it could be an alternative to the carry flag present on most CPU architectures (today that means x86 and ARM, and also historically pretty much anything else that was at all significant).

The much-hyped RISC-V is one of the few that doesn't have a flag register, because it's been claimed to be a performance bottleneck - in my somewhat heretical opinion, the actual reason is that its designers are deluded into thinking that C and UNIX are as fundamental to software as transistors are to hardware, and thus anything beyond what's needed to support that environment is not worth implementing.

But an extra bit per register would be like having a separate carry flag for each, potentially solving some problems with parallelization and also allowing checks for integer overflow at the point a value is stored into memory or used in a subsequent operation.

Having some kind of carry flag enables various programming tricks that can also be useful for SWAR, for example subtract-with-carry of a register with itself will either set or clear all bits.


These techniques are mostly not that useful because you can just use much bigger registers.


Imo the most important class in comp sci is computer architecture, specifically one that takes you from gates to a rudimentary alu, registers, to microcode, to some form of assembly coding.

There is nothing magical in computers once you can do that deconstruction in your head.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: