Hacker News new | past | comments | ask | show | jobs | submit login
Parsing integers quickly with AVX-512 (lemire.me)
163 points by usdogu on Sept 24, 2023 | hide | past | favorite | 39 comments



17 cycles still seems pretty slow compared to the methods explored in this article, which got the result down to 0.75ns (~3 cycles?) (for integer strings of a known length):

https://kholdstare.github.io/technical/2020/05/26/faster-int...

Most of the overhead is found in dealing with variable length input, minus signs, and overflow checking.


I'm guessing the instruction pipeline gets packed, so 4-5x more than 3 cyles.


If all you care about is throughput, that's what you want


I think the technique there is largely covered within the op, but it seems like this may be a somewhat well known technique.


Also, the benchmark of exactly as written there is missing some overhead, as it will always be cached. Memory accesses aren't likely to be an issue here, but I always try to run over a couple dozen or hundred of megs of input in benchmarks to ward against that.


This has been floating around in my random stuff directory for ages, while I think it's technically somewhere in the repo of a former employer if that recollection is correct it was copied without attribution from some random website so, fair game. I've always found it useful for thinking about this family of algorithm:

https://gist.github.com/b7r6/0f80c18035d4db0dba298decc0b1ada...


Does this handle the case where you try to parse a short string which is abutting the end of a page, and the next page is unmapped? The problem is you over-read and crash when you try to read the unmapped page. Looking at the code it doesn't seem to take account of that case. (On the other hand, maybe the AVX instructions don't do an actual read if the field is masked? In which case it would be fine.)

We had a similar case with some AVX-optimized string operation in glibc.


Yes. The bounds from the std::string are used to create a read-mask which is used in a masked load that does not fault on lanes outside the mask.

If the length had been unknown (say, a C-style null-terminated string) then the code would have needed to be more complex. You would either have to find the length before-hand or done only aligned loads. BTW. ARM's SVE and RISC-V's Vector extension have special "first fault" load-instructions that make that case simpler: they instead modify the vector's mask/length to the lowest-numbered lane that would have faulted.


It looks like the instructions handle this https://stackoverflow.com/questions/54497141/when-using-a-ma...


If you're parsing lots of numbers and they are often of similar length, then could you go for GPU-style SIMT with AVX-512 and parse 64 numbers at once?


It's not clear that this will be a win because the really-wide transpose at the beginning could be pretty expensive.


Considered as a naieve problem, inescapably you want to perform a series of multiply and addition phases. For short strings, "just do it" might be slower than "look it up in a table". For longer strings, divide-and-conquer would be to chop it into short strings and do the table lookup.

I don't know this is faster than simply doing the "mutiply by ten and add" sequence. I'd have to understand modern ALU behaviour, parallelism, shortcuts, chip cache dynamics you-name-it.

It would all have been so much simpler if we'd made decimal be hex instead. Octal doesn't work as well for me, no idea why. That said, it's much easier to ignore thumbs and count on 4 fingers of each hand than grow another 6

(I read the article. It descended into arcanum of instruction sets and assumptions which post date my DEC-10 ISA lessons, although I recall Digital had BCD instruction handling and a 6 bit byte model in a 36 bit word accordingly. DEC-10 instructions could take many, many clock cycles and have 5 components of from, to, via, because, maybe attached to them)


Maybe in the future don't comment on articles until you have read and understand them.

And while your comments on DEC-10 and 6-bit bytes might be interesting, there are very few people still working in CS who have seen a DEC-10 outside a museum. I find it always interesting to learn about computers that were obsolete before I wrote my first program, but their performance characteristics were vastly different than what we have today. And current college grads have always lived in a world with 64-bit SIMD machines that can do a scalar multiply in 3-4 cycles.

[Edit to add] I don't know why I'm salty about this, but Doug Lemire is real good and doesn't deserve to be blown off, especially not by someone referencing a computer that was obsolete before today's retirement-age CS people were in high school. VAX obsoleted the DEC-10 in 1977.


I agree but it's Daniel not Doug.


you are correct, sir. Sorry to Daniel.


I agree with the parent, but from a different angle.

If you need to parse integers so quickly, why are you storing them as decimal strings?

If you're dealing with a massive CSV why not just compile to a more easily parsable format ahead of time?


Because we receive them over the wire as decimal strings.


Define over the wire here then.

The article mentions 0.8gb/s to 1.8gb/s throughput.

If youre downloading something off the internet, the bottleneck isn't going to be converting the strings to ints.

Further if you're downloading that much data, it suggests even more that you should be using a more space efficient representation.


For trading we focus on latency rather than throughput.

We cannot really accomplish much by telling a bunch of other companies to change the format they use to send data to all of their customers.


Ah ok, the post didn't mention trading, and that use case didn't occur to me.


> I am going to assume that you know the beginning and the end of sequence of digits.

First, that's a strong assumption. But...

> We check whether some value exceeds 9, in which case we had a non-digit character.

Huh? You already know where the end of the digits is. At least, I thought you did? So why would this be necessary?


You know how many bytes you're going to want to read, and you know they're supposed to be decimal digits, but you also know they might not be.


I don't understand how that is supposed to resolve the contradiction.


> We check whether some value exceeds 9,

If I'm seeing the misunderstanding here, the article with that statement is not talking about string length. They're talking about if non numerical input was given, ie, "a" would "exceed 9"


Right, but an "a" would be past the end of the digits, and the code assumes you know the end of the digits.

It would make more sense if it was talking about the string length. Then you might find a non-digit before the end. But it specifically says you know where the end of the digits is, not the end of the string.

Edit: Or, I'm trying to figure out what code might exist that found the end of the digits, but not always? The article says "assume you already know this", but computers aren't magic. There's code that figures out where the end of the digits is. So how does that code work, where it finds the end of the digits, but sometimes counts non-digits as digits?


Do library atoi() actually do SIMD?


Not literally atoi() because most people don't use that anymore, but the C++ std::from_chars() uses SIMD in GCC 12+, thanks to some of the same people behind the OP: https://lemire.me/blog/2022/07/27/comparing-strtod-with-from...


That would depend on the implementation. But it doesn't really matter, no one should be using atoi.


Yeah, iirc it was exploitable because it doesn’t do any bounds checking or something like that, right?


It bounds checks just fine, as long as you pass it a null-terminated string. It just returns 0 on error, so there's no way to know if the string was for example, larger than would fit in an int, or even just totally invalid, or if the user actually entered 0.


In general almost nothing uses SIMD for processing strings :(


Readers can try their hand at a similar problem using AVX2 here: https://highload.fun/tasks/1/leaderboard


incidentally: what's the term for a foreign-language word creation in a different culture?

as far as i can tell, "highload" (typically spelt in english) exists only in the russian-speaking communities and means "high performance".


Pseudo-anglicism?


it might be a calque? seems unlikely, though.


I know this was done mostly for the sake of it, but is there a reasonable scenario where parsing integers be the bottleneck with the current method?


For fast "data-oriented" parsers of other formats which contain numbers (which to be fair are uncommon), number parsing is often the slowest part. For fast data transformations in the genre of jq (but not jq specifically), parsing is often the bulk of the runtime.


Do downvoters care to explain?


Not a downvoter, but something doesn't need to necessarily be a bottleneck to warrant speeding up. Put another way, why deliberately pick a slower method if a faster one exists?

Many applications these days suffer from "death by a thousand cuts" - there's no single thing which makes it slow, just lots and lots of slightly slow things piling on top of each other.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: