Fast function to parse strings into double (binary64) floating-point values

scottlamb · on March 11, 2020

It has about 15 KB of data: 10KB in power_of_ten_components and 5KB in mantissa_128. [1] That's ~23% of the per-core L1 data cache on (eg) the recently announced APU chip. [2, 3] So before getting too excited about the microbenchmark numbers, I'd be careful to ensure the additional cache pressure doesn't slow down my program as a whole.

edit: also, ~2.5KB of that is due to padding in power_of_ten_components. I wonder if it'd be better to split the uint64_t field into two uint32_ts to avoid this.

[1] https://github.com/lemire/fast_double_parser/blob/master/inc... [2] https://news.ycombinator.com/item?id=22440894 [3] https://en.wikichip.org/wiki/amd/ryzen_embedded/r1606g

userbinator · on March 11, 2020

Also, I know it's not always possible, but if your application is spending so much time parsing strings into numbers that this could make a difference, seriously consider whether the source of those strings could be changed to not emit strings, because unless it's something like user input, chances are they may already be doubles which can be used as-is.

thedance · on March 11, 2020

HN is collectively schizophrenic on this topic. On the one hand, the comments say that nobody really needs anything fancier than JSON. On the other hand, other comments say that if JSON parsing shows up in your profiles, you should switch off JSON!

scottlamb · on March 11, 2020

I think if you're a little more generous and say "you rarely need" rather than "nobody really needs", there's no contradiction. Most of the time, you're not using much CPU in total or your server is doing something much heavier than parsing and serialization. Either way, JSON is fine. Improving its performance is nice but not worth a ton of effort. Once in a while, it's significant, and you switch to a binary format if you can.

And like anything, it depends on what you're doing. If you're doing lots of server-to-server RPCs on big distributed systems like within Google, it's much more reasonable to start with protobuf or the like than with JSON. And not just for performance reasons; having the generated code helps a lot with maintainability. (Some JSON libraries have something similar; some don't.)

zozbot234 · on March 11, 2020

Hexfloats are not a binary format, they're as text based as anything else. And they get rid of significant bottlenecks in float parsing.

thedance · on March 11, 2020

But the existence of the hex float format, and the fact that it accepts either 0x or 0X, upper or lowercase characters for A-F, dot or comma for the point depending on locale, and either p or P for the exponent, all contribute to the reasons why strtod is so slow. The faster library in this article handles none of those things. It also does not handle NaN or INF.

userbinator · on March 11, 2020

HN (and a lot of other places, really) has a large population of web or otherwise "web-heritage" developers, as well as those whose ventures into computing started beside/before the web. I suspect the JSON promotion comes mostly from the former group, while the latter, of which I am a member, laments it roughly as much as we did the "XML craze" of the late 90s and early 2000s.

xxs · on March 11, 2020

Indeed, hand rolled fixed point network marshaling, preferably with delta encoding for time series alike data (or data that oscillates) is couple of orders of magnitude faster and but also significantly "smaller" than a double-string-double, counterpart. Getting ~1 byte per data point of time series is quite achievable.

vbezhenar · on March 11, 2020

How do you encode floating point delta into one byte?

thedance · on March 11, 2020

You can't, that's why they said fixed point. Time series data should probably never use floats, for the exact reason that delta-encoding is not effective.

scottlamb · on March 11, 2020

"Can't" is too strong a word. Depending on what your numbers represent, I'd think you could do one or more of:

* have a shorter exponent and/or mantissa to make each number shorter and the overall series more repetitive. (maybe even skip the mantissa if you only care about order of magnitude! call it the opposite of fixed-point, or just storing the logarithm of the value, whatever terminology works for you...)

* skip the sign bit for a non-negative series (e.g., deltas of a non-decreasing series),

* use a variable-width bitstream encoding like exp-Golomb to represent the exponent and/or mantissa,

* use run-length encoding if it's repetitive (after delta-encoding),

* etc.

One byte per value (or less!) is surely possible in some cases. But I've never done it myself and agree fixed-point sounds much more pleasant. One of my projects represents deltas of durations in units of 90,000ths of a second, varint-encoded. I much prefer that to dealing with float seconds.

thedance · on March 11, 2020

There's also the XOR compression scheme described by Facebook:

https://www.vldb.org/pvldb/vol8/p1816-teller.pdf

cm2187 · on March 11, 2020

And a tiny minority I am proudly part of, who watches these debates about xml vs json or tabs vs whitespace with the same amusement than Mike Judge!

xxs · on March 11, 2020

C'mon spaces vs tab is a serious one and it makes sense... unlike xml, json shizz.

saagarjha · on March 11, 2020

Spaces versus tabs is a stupid debate because "tabs for indentation, spaces for alignment" is clearly the correct answer.

XMPPwocky · on March 11, 2020

Couldn't it just be different users posting those?

mandelbrotwurst · on March 11, 2020

That's exactly what this person is saying / what they intended by the phrase "collectively schizophrenic".

XMPPwocky · on March 11, 2020

Okay, but then "people disagree about this" seems a bit less interesting?

gameswithgo · on March 11, 2020

that just reflects that actual engineers and not actual engineers frequent hn

kzrdude · on March 11, 2020

A serious question is how can we make microbenchmarks that take this into account? Along the lines of measuring a speedup of an operation, in varying scenarios of cache pressure.

vbezhenar · on March 11, 2020

I'm not sure about actual implementation, but one approach would be to copy this function 100 times and invoke those copies sequentially. Of course linker must not optimize it into one function, so that's why I'm not sure about actual implementation. May be naive approach would work or may be not.

plus · on March 11, 2020

> That is, given the string "1.0e10", it should return a 64-bit floating-point value equal to 10.

Err, surely it should be equal to 10,000,000,000. Or more probably, they meant to write "1.0e1".

vardump · on March 11, 2020

Oh, you don't use base 10,000,000,000 yet?

eesmith · on March 10, 2020

More context at Daniel Lemire's blog post at https://lemire.me/blog/2020/03/10/fast-float-parsing-in-prac... . It's about twice as fast as abseil or from_chars and nearly 10x faster than strtod.

londons_explore · on March 11, 2020

I was expecting this to include fancy bit twiddling and simd assembly for parsing 8 decimal digits at once...

But the reality is the core of the algorithm is still a while loop that looks at each digit one at a time and multiplies by 10 each time ...

zone411 · on March 11, 2020

Shouldn't std::from_chars be included in the benchmarks?

deadmutex · on March 11, 2020

Yeah, looking forward to it. Especially with latest version of MSVC, based on this post from the author: https://www.reddit.com/r/cpp/comments/92bkxp/how_to_efficien...

voldacar · on March 11, 2020

Is there a site out there that collects the absolute fastest ways of doing common low-level operations like this? For something as common as converting strings of digits to IEEE 754 floats/doubles, you would think we would already have the absolute fastest sequence of assembly instructions to do this. It's disconcerting thinking that the functions in the standard c/c++ library may not even be close to optimal.

Very cool btw

thedance · on March 11, 2020

I can't think of a single thing in the C or C++ standard library that is state of the art. Often they are actually quite terrible. Some of them just can't be fixed because of ABI compatibility issues. That's why, just as an example, C++'s std::unordered_map is an embarrassment compared to all recent hash tables.

grandinj · on March 11, 2020

Terrible is hyperbole. They're not state of the art, but they are pretty decent. If you are using multi-G data structures, or chasing that last 2% of performance, then sure, they will make a difference. But in most large applications, switching to things like that don't make much difference compared to just writing a slightly better algorithm, or compacting your data-structures a little to improve cache-hit, or re-organising your code a little to do the expensive stuff outside the loop.

Source: I regularly optimise C++ code in LibreOffice.

voldacar · on March 11, 2020

That's so true and honestly it kind of gives me a pit in my stomach just thinking about all the C++ programs in particular that use STL and end up being so much slower than they could be. Hopefully Rust and Zig won't end up making this kind of mistake

pjscott · on March 11, 2020

The most egregious offenders here IMO are C++'s map and unordered_map. The spec constrains the possible implementations, and it's hard to implement them as anything better than a red-black tree or a badly-unoptimized hash table, respectively. And really the unordered one should be the default with the nice short name.

It looks like Rust is doing well on this. HashMap (the one everybody uses) is essentially a clone of Google's heavily-optimized C++ hash table:

https://abseil.io/blog/20180927-swisstables

... and if you need the collection to be sorted, their BTreeMap has much lower overhead than the C++ std::map, plus a big prominent note in its documentation saying that HashMap is usually the one you want.

I'm impressed. (Haven't looked at Zig yet.)

scottlamb · on March 11, 2020

> It looks like Rust is doing well on this. HashMap (the one everybody uses) is essentially a clone of Google's heavily-optimized C++ hash table

It didn't start out this way, but their spec was not so constrained as C++, so they were able to switch to this implementation in a backward-compatible way. In particular, Rust's std::collections::HashMap doesn't guarantee that keys and values have stable addresses across mutations of unrelated entries, so they can use open addressing rather than chaining. (And they did use open addressing from the beginning; the SwissTables-like "hashbrown" implementation was just a refinement of that.)

I think a lot of this is just that newer languages can learn from the mistakes of previous languages. It will be interesting to see how Rust (and other newer languages) are able to evolve when their standard library is 20+ years old and some parts don't age so well. (Although actually C++'s std::unordered_map only goes back to C++11 and still sucks...)

One part of Rust's answer I think is to keep the standard library small. Less there, less to screw up. Make it easy to pull in crates instead. Crates can supply the same functionality but can bump their major version relatively easily if the interface has to change. There are advantages and disadvantages to this approach...

saagarjha · on March 11, 2020

The default hash map probably doesn't need the iterator invalidation constraints, but they are quite useful in certain cases.

RossBencina · on March 11, 2020

> anything better than a red-black tree

Is there an equivalent data structure that has better worst-case time complexity than red-black tree?

Dylan16807 · on March 11, 2020

There are plenty of algorithms that are also O(logn) that are much faster in practice, because of how caches work.

RossBencina · on March 11, 2020

Could you name some please? I'm genuinely interested in learning of data structures with this property.

Dylan16807 · on March 11, 2020

The simplest one is a B tree. It's like a self-balancing binary tree, except that you put more than two children in each node.

Same big-O, but faster by a big linear multiplier.

vardump · on March 11, 2020

If the data is small enough (up to about n=100 magnitude), linear scan O(n) often beats asymptotically faster algorithms.

LessDmesg · on March 11, 2020

What, C++'s spec constrains implementation of standard datastructures? This language has hit a new low for me.

hoseja · on March 11, 2020

Well it defines the big-O of the datastructure APIs. And to match all the requirements, the implementation is usually very constrained.

leni536 · on March 11, 2020

My impression is that it's typically the iterator/reference stability requirements that lock imlementations down, not the complexity requirements.

hoseja · on March 11, 2020

Oh yeah, that too.

KMnO4 · on March 11, 2020

I think Rust has an intrinsic advantage simply in its power. Standard libraries (and really most code) are written by humans. Giving those humans the confidence to implement complex data structures and algorithms without worrying about null pointers or memory leaks has ripping effects on performance. See C’s (“simple”) AVL tree vs Rust’s B-Tree[0].

[0]: http://dtrace.org/blogs/bmc/2018/09/28/the-relative-performa...

zozbot234 · on March 11, 2020

Great hack, but in a broader sense this is silly. If parsing/pretty-printing floats from/to ASCII strings is a bottleneck you should be using hexfloats, as supported in the latest C/C++ standards and elsewhere. As a bonus, they will reliably evaluate to the same floating point number, eliminating an extra source of ambiguity and possible overhead.

chadaustin · on March 11, 2020

It can be a big chunk of the cost of parsing JSON.

gameswithgo · on March 11, 2020

stop using json is the message

blacklion · on March 11, 2020

I dream about day when all cryptoexchanges get this message. Not today :-(

thedance · on March 11, 2020

It's not silly at all. Replacing a lot of sprintf calls with Grisu and later Ryu made a huge difference in the production fleet at my last job. It wasn't by any means "the bottleneck" but it was costing real money. Most of the calls were in peripheral functions like debug logs, status page outputs, and whatnot.

That said, I've never seen strtod showing up in my profiles, but I bet if I was profiling web browsers then I would see it.

MaxBarraclough · on March 13, 2020

I was wondering if ctrl-f for 'Ryu' would match anything.

Jeroen van der Zijp, developer of the lesser known FOX Toolkit, claims to have done better than the Ryu algorithm. From his changelist [0]:

> New, faster, and often more accurate floating point to string conversion. No, this is not based on Ryu paper. New method reliably generates 16 digits, and does much better rounding.

I believe this [1] is his implementation in C++. See also his paper [2]. (I believe it's not peer-reviewed, but it does appear in Google Scholar.)

[0] http://www.fox-toolkit.org/

[1] https://github.com/gogglesguy/fox/blob/77396c638c1c4e05/lib/...

[2] http://www.fox-toolkit.org/ftp/fasthalffloatconversion.pdf

zone411 · on March 11, 2020

This conversion was a profiler-verified bottleneck in my projects twice. Nothing silly about it.

alkonaut · on March 11, 2020

What's a scenario where you have a huge flow of floating point data as text? Assuming it's usually json, but in which case does that happen and you need to work around it (parse faster) rather than fix the underlying problem of having a huge stream of numbers as text?

blacklion · on March 11, 2020

If you could control both ends, yes. But sometimes you need to consume data produced by entities which you could not control. Examples? Crypto currency exchanges, for example. All of them use ugly, often terrible, JSON-based protocols, and some of them have non-trivial amount of trading to see JSON parsers as hot spots.

roel_v · on March 11, 2020

Would having a mode that restricts the types of numbers it accepts (in my particular case: no scientific notation) speed it up significantly?

abductee_hg · on March 11, 2020

bad. you should not do that at all.

yellowapple · on March 11, 2020

If you're writing a parser for any language which is expected to deal with floating point numbers, being able to parse those numbers is generally necessary.