Parsing integers quickly with AVX-512

nly · on Sept 24, 2023

17 cycles still seems pretty slow compared to the methods explored in this article, which got the result down to 0.75ns (~3 cycles?) (for integer strings of a known length):

https://kholdstare.github.io/technical/2020/05/26/faster-int...

Most of the overhead is found in dealing with variable length input, minus signs, and overflow checking.

kolbe · on Sept 25, 2023

I'm guessing the instruction pipeline gets packed, so 4-5x more than 3 cyles.

nly · on Sept 25, 2023

If all you care about is throughput, that's what you want

foota · on Sept 24, 2023

I think the technique there is largely covered within the op, but it seems like this may be a somewhat well known technique.

foota · on Sept 25, 2023

Also, the benchmark of exactly as written there is missing some overhead, as it will always be cached. Memory accesses aren't likely to be an issue here, but I always try to run over a couple dozen or hundred of megs of input in benchmarks to ward against that.

benreesman · on Sept 25, 2023

This has been floating around in my random stuff directory for ages, while I think it's technically somewhere in the repo of a former employer if that recollection is correct it was copied without attribution from some random website so, fair game. I've always found it useful for thinking about this family of algorithm:

https://gist.github.com/b7r6/0f80c18035d4db0dba298decc0b1ada...

rwmj · on Sept 25, 2023

Does this handle the case where you try to parse a short string which is abutting the end of a page, and the next page is unmapped? The problem is you over-read and crash when you try to read the unmapped page. Looking at the code it doesn't seem to take account of that case. (On the other hand, maybe the AVX instructions don't do an actual read if the field is masked? In which case it would be fine.)

We had a similar case with some AVX-optimized string operation in glibc.

Findecanor · on Sept 25, 2023

Yes. The bounds from the std::string are used to create a read-mask which is used in a masked load that does not fault on lanes outside the mask.

If the length had been unknown (say, a C-style null-terminated string) then the code would have needed to be more complex. You would either have to find the length before-hand or done only aligned loads. BTW. ARM's SVE and RISC-V's Vector extension have special "first fault" load-instructions that make that case simpler: they instead modify the vector's mask/length to the lowest-numbered lane that would have faulted.

anonymoushn · on Sept 25, 2023

It looks like the instructions handle this https://stackoverflow.com/questions/54497141/when-using-a-ma...

modeless · on Sept 25, 2023

If you're parsing lots of numbers and they are often of similar length, then could you go for GPU-style SIMT with AVX-512 and parse 64 numbers at once?

anonymoushn · on Sept 25, 2023

It's not clear that this will be a win because the really-wide transpose at the beginning could be pretty expensive.

ggm · on Sept 25, 2023

Considered as a naieve problem, inescapably you want to perform a series of multiply and addition phases. For short strings, "just do it" might be slower than "look it up in a table". For longer strings, divide-and-conquer would be to chop it into short strings and do the table lookup.

I don't know this is faster than simply doing the "mutiply by ten and add" sequence. I'd have to understand modern ALU behaviour, parallelism, shortcuts, chip cache dynamics you-name-it.

It would all have been so much simpler if we'd made decimal be hex instead. Octal doesn't work as well for me, no idea why. That said, it's much easier to ignore thumbs and count on 4 fingers of each hand than grow another 6

(I read the article. It descended into arcanum of instruction sets and assumptions which post date my DEC-10 ISA lessons, although I recall Digital had BCD instruction handling and a 6 bit byte model in a 36 bit word accordingly. DEC-10 instructions could take many, many clock cycles and have 5 components of from, to, via, because, maybe attached to them)

mattgrice · on Sept 25, 2023

Maybe in the future don't comment on articles until you have read and understand them.

And while your comments on DEC-10 and 6-bit bytes might be interesting, there are very few people still working in CS who have seen a DEC-10 outside a museum. I find it always interesting to learn about computers that were obsolete before I wrote my first program, but their performance characteristics were vastly different than what we have today. And current college grads have always lived in a world with 64-bit SIMD machines that can do a scalar multiply in 3-4 cycles.

[Edit to add] I don't know why I'm salty about this, but Doug Lemire is real good and doesn't deserve to be blown off, especially not by someone referencing a computer that was obsolete before today's retirement-age CS people were in high school. VAX obsoleted the DEC-10 in 1977.

jzwinck · on Sept 25, 2023

I agree but it's Daniel not Doug.

mattgrice · on Sept 25, 2023

you are correct, sir. Sorry to Daniel.

benj111 · on Sept 25, 2023

I agree with the parent, but from a different angle.

If you need to parse integers so quickly, why are you storing them as decimal strings?

If you're dealing with a massive CSV why not just compile to a more easily parsable format ahead of time?

anonymoushn · on Sept 25, 2023

Because we receive them over the wire as decimal strings.

benj111 · on Sept 25, 2023

Define over the wire here then.

The article mentions 0.8gb/s to 1.8gb/s throughput.

If youre downloading something off the internet, the bottleneck isn't going to be converting the strings to ints.

Further if you're downloading that much data, it suggests even more that you should be using a more space efficient representation.

anonymoushn · on Sept 25, 2023

For trading we focus on latency rather than throughput.

We cannot really accomplish much by telling a bunch of other companies to change the format they use to send data to all of their customers.

benj111 · on Sept 25, 2023

Ah ok, the post didn't mention trading, and that use case didn't occur to me.

Karellen · on Sept 25, 2023

> I am going to assume that you know the beginning and the end of sequence of digits.

First, that's a strong assumption. But...

> We check whether some value exceeds 9, in which case we had a non-digit character.

Huh? You already know where the end of the digits is. At least, I thought you did? So why would this be necessary?

JonChesterfield · on Sept 25, 2023

You know how many bytes you're going to want to read, and you know they're supposed to be decimal digits, but you also know they might not be.

Karellen · on Sept 25, 2023

I don't understand how that is supposed to resolve the contradiction.

joe_guy · on Sept 27, 2023

> We check whether some value exceeds 9,

If I'm seeing the misunderstanding here, the article with that statement is not talking about string length. They're talking about if non numerical input was given, ie, "a" would "exceed 9"

Karellen · on Sept 28, 2023

Right, but an "a" would be past the end of the digits, and the code assumes you know the end of the digits.

It would make more sense if it was talking about the string length. Then you might find a non-digit before the end. But it specifically says you know where the end of the digits is, not the end of the string.

Edit: Or, I'm trying to figure out what code might exist that found the end of the digits, but not always? The article says "assume you already know this", but computers aren't magic. There's code that figures out where the end of the digits is. So how does that code work, where it finds the end of the digits, but sometimes counts non-digits as digits?

ape4 · on Sept 25, 2023

Do library atoi() actually do SIMD?

jzwinck · on Sept 25, 2023

Not literally atoi() because most people don't use that anymore, but the C++ std::from_chars() uses SIMD in GCC 12+, thanks to some of the same people behind the OP: https://lemire.me/blog/2022/07/27/comparing-strtod-with-from...

vinkelhake · on Sept 25, 2023

That would depend on the implementation. But it doesn't really matter, no one should be using atoi.

stingraycharles · on Sept 25, 2023

Yeah, iirc it was exploitable because it doesn’t do any bounds checking or something like that, right?

Lt_Riza_Hawkeye · on Sept 25, 2023

It bounds checks just fine, as long as you pass it a null-terminated string. It just returns 0 on error, so there's no way to know if the string was for example, larger than would fit in an int, or even just totally invalid, or if the user actually entered 0.

anonymoushn · on Sept 25, 2023

In general almost nothing uses SIMD for processing strings :(

anonymoushn · on Sept 25, 2023

Readers can try their hand at a similar problem using AVX2 here: https://highload.fun/tasks/1/leaderboard

5- · on Sept 25, 2023

incidentally: what's the term for a foreign-language word creation in a different culture?

as far as i can tell, "highload" (typically spelt in english) exists only in the russian-speaking communities and means "high performance".

WGH_ · on Sept 30, 2023

Pseudo-anglicism?

cmrx64 · on Sept 25, 2023

it might be a calque? seems unlikely, though.

KolmogorovComp · on Sept 25, 2023

I know this was done mostly for the sake of it, but is there a reasonable scenario where parsing integers be the bottleneck with the current method?

anonymoushn · on Sept 26, 2023

For fast "data-oriented" parsers of other formats which contain numbers (which to be fair are uncommon), number parsing is often the slowest part. For fast data transformations in the genre of jq (but not jq specifically), parsing is often the bulk of the runtime.

KolmogorovComp · on Sept 25, 2023

Do downvoters care to explain?

RaisingSpear · on Sept 26, 2023

Not a downvoter, but something doesn't need to necessarily be a bottleneck to warrant speeding up. Put another way, why deliberately pick a slower method if a faster one exists?

Many applications these days suffer from "death by a thousand cuts" - there's no single thing which makes it slow, just lots and lots of slightly slow things piling on top of each other.