17 cycles still seems pretty slow compared to the methods explored in this article, which got the result down to 0.75ns (~3 cycles?) (for integer strings of a known length):
Also, the benchmark of exactly as written there is missing some overhead, as it will always be cached. Memory accesses aren't likely to be an issue here, but I always try to run over a couple dozen or hundred of megs of input in benchmarks to ward against that.
This has been floating around in my random stuff directory for ages, while I think it's technically somewhere in the repo of a former employer if that recollection is correct it was copied without attribution from some random website so, fair game. I've always found it useful for thinking about this family of algorithm:
Does this handle the case where you try to parse a short string which is abutting the end of a page, and the next page is unmapped? The problem is you over-read and crash when you try to read the unmapped page. Looking at the code it doesn't seem to take account of that case. (On the other hand, maybe the AVX instructions don't do an actual read if the field is masked? In which case it would be fine.)
We had a similar case with some AVX-optimized string operation in glibc.
Yes. The bounds from the std::string are used to create a read-mask which is used in a maskedload that does not fault on lanes outside the mask.
If the length had been unknown (say, a C-style null-terminated string) then the code would have needed to be more complex. You would either have to find the length before-hand or done only aligned loads.
BTW. ARM's SVE and RISC-V's Vector extension have special "first fault" load-instructions that make that case simpler: they instead modify the vector's mask/length to the lowest-numbered lane that would have faulted.
If you're parsing lots of numbers and they are often of similar length, then could you go for GPU-style SIMT with AVX-512 and parse 64 numbers at once?
Considered as a naieve problem, inescapably you want to perform a series of multiply and addition phases. For short strings, "just do it" might be slower than "look it up in a table". For longer strings, divide-and-conquer would be to chop it into short strings and do the table lookup.
I don't know this is faster than simply doing the "mutiply by ten and add" sequence. I'd have to understand modern ALU behaviour, parallelism, shortcuts, chip cache dynamics you-name-it.
It would all have been so much simpler if we'd made decimal be hex instead. Octal doesn't work as well for me, no idea why. That said, it's much easier to ignore thumbs and count on 4 fingers of each hand than grow another 6
(I read the article. It descended into arcanum of instruction sets and assumptions which post date my DEC-10 ISA lessons, although I recall Digital had BCD instruction handling and a 6 bit byte model in a 36 bit word accordingly. DEC-10 instructions could take many, many clock cycles and have 5 components of from, to, via, because, maybe attached to them)
Maybe in the future don't comment on articles until you have read and understand them.
And while your comments on DEC-10 and 6-bit bytes might be interesting, there are very few people still working in CS who have seen a DEC-10 outside a museum. I find it always interesting to learn about computers that were obsolete before I wrote my first program, but their performance characteristics were vastly different than what we have today. And current college grads have always lived in a world with 64-bit SIMD machines that can do a scalar multiply in 3-4 cycles.
[Edit to add]
I don't know why I'm salty about this, but Doug Lemire is real good and doesn't deserve to be blown off, especially not by someone referencing a computer that was obsolete before today's retirement-age CS people were in high school. VAX obsoleted the DEC-10 in 1977.
If I'm seeing the misunderstanding here, the article with that statement is not talking about string length. They're talking about if non numerical input was given, ie, "a" would "exceed 9"
Right, but an "a" would be past the end of the digits, and the code assumes you know the end of the digits.
It would make more sense if it was talking about the string length. Then you might find a non-digit before the end. But it specifically says you know where the end of the digits is, not the end of the string.
Edit: Or, I'm trying to figure out what code might exist that found the end of the digits, but not always? The article says "assume you already know this", but computers aren't magic. There's code that figures out where the end of the digits is. So how does that code work, where it finds the end of the digits, but sometimes counts non-digits as digits?
It bounds checks just fine, as long as you pass it a null-terminated string. It just returns 0 on error, so there's no way to know if the string was for example, larger than would fit in an int, or even just totally invalid, or if the user actually entered 0.
For fast "data-oriented" parsers of other formats which contain numbers (which to be fair are uncommon), number parsing is often the slowest part. For fast data transformations in the genre of jq (but not jq specifically), parsing is often the bulk of the runtime.
Not a downvoter, but something doesn't need to necessarily be a bottleneck to warrant speeding up. Put another way, why deliberately pick a slower method if a faster one exists?
Many applications these days suffer from "death by a thousand cuts" - there's no single thing which makes it slow, just lots and lots of slightly slow things piling on top of each other.
https://kholdstare.github.io/technical/2020/05/26/faster-int...
Most of the overhead is found in dealing with variable length input, minus signs, and overflow checking.