Hacker News new | past | comments | ask | show | jobs | submit login
FastDoubleParser: Java port of Daniel Lemires fast_double_parser (github.com/wrandelshofer)
100 points by mfiguiere on March 22, 2021 | hide | past | favorite | 62 comments



Reading this code almost makes me want to think about how differently I review Java than C or C++.

  ch = ++index < strlen ? str.charAt(index) : 0;
If I saw that code scattered across in some java code, I would immediately be like - "rewrite please".

But the same code fragment in C would be like - yeah, we need a bounds check, good job, can you make it a macro please.

The java default Float parser was somewhat of a complete perf headache when multi-threading, more than about single core consumption (till jdk8b96 - JDK-7032154), so in the past Hive has shipped with its own strod impl[1] to avoid the central lock in parseDouble.

Also it worked off a raw utf8 byte[] stream, which is more common coming out of a disk file or network serialization rather than a CharSequence.

[1] - https://github.com/apache/hive/blob/4446414f4478091db1eb20bc...


The real question is why C/C++ programmers accept code like that. You can write that same code in an easier-to-follow way, and the compiler will turn it into an identical binary.

Code should be written for humans.


Frameworks produced by Windows dev team are a very good example of that.

They could produce something that looks like Qt, OWL/VCL, or nowadays .NET like but in C++.

Instead they produce libraries where one is forced to use C low level details, mixed with Win32 and COM pointer handling and casts everywhere, not even MFC escapes it.

Not to count how many interactions of COM smart pointers that have come up with since COM exists.

C++/CX looked like they finally learned their way and would offer something similar to C++ Builder, but no that wasn't meant for us to enjoy it.

So now anyone doing UWP with C++, gets to manually edit IDL files without tooling support, and merge the generated files manually.

I really don't get their macho developer culture.


According to Steven Sinofsky, MFC being a thin wrapper around the underlying MFC was a deliberate design decision during its development. https://hardcoresoftware.learningbyshipping.com/p/010-our-bi...

In retrospect, do you feel it was it a bad idea to not use a heavier abstraction?


I am aware of it, hence my remark.

It was a bad idea versus what other compiler vendors had in their toolbox.

MFC only won because Microsoft is the OS vendor, and Borland lost it trying to fly too high.

Back in the MS-DOS and Windows 3.x very few would use Microsoft compilers, Borland and Watcom were the favourite.


> I really don't get their macho developer culture.

I'm inclined to not recognize that as a distinct culture and rather call it an infectious habit. Auditable, maintainable code is a skill, after all.


It is. Apart from ++index, this is literally the same as:

  let ch = if index < strlen then str.charAt(index) else 0
Eminently readable and less boring than long if/else chains in Python.


It's not. To be easily human-parsable you want to do one thing at a time. It's that incrementation that you've ommited that messes things up.

On a side note, having gotten used to newer languages, an option type sure is a nice thing.


You can write in a very similar way in python too if you want to. I think I'd argue this is cleaner syntax, even.

    >>> foo = 10
    >>> bar = foo if foo > 5 else 0
    >>> bar
    10
    >>> bar = foo if foo < 5 else 0
    >>> bar
    0


IDK,

"for ch in index:"

reads pretty cleanly to me.


People who care about correctness wouldn't be using C/C++ in the first place. The culture of those languages is performance above all else, and it only gets worse as more and more "moderates" move to memory-safe languages, leaving only the extremists left.


That is not an informed opinion at all. The C++ people I know ALL care about correctness... That's why they use C++. They want to make sure it's done correctly, and not left to some node package that barely works.


Agree, especially as there's so little upside to writing it this way. It isn't going to run any faster. The style emphasises code-density over readability, but density is only a virtue when it aids readability.

I'd far prefer:

    ++index;
    ch = index < strlen ? str.charAt(index) : 0;
I don't see a good reason to make it a macro, though.


    index += 1;
    if (index < strlen) {
      ch = str.charAt(index);
    } else {
      ch = 0;
    }
If I was using kotlin, I'd do it even cleaner:

    inline fun String.at(index: Int): Char? =
      if (index >= 0 && index < length) charAt(index)
      else null

    [...]

    index += 1
    ch = str.at(index) ?: 0;
The bytecode is exactly the same, but I’d argue it’s much cleaner


I haven't used kotlin, but where is length defined? Should it be a parameter to the function?


The above is an extension function(which is basically a member function of class not written in that class), which has an implicit `this` parameter. In other words `length` is actually `this.length`, just as it would be in a regular method.


To me, the original C-style code looks nicer than the Java.

C++:

    bool found_minus = (*p == '-');
    bool negative = false;
    if (found_minus) {
      ++p;
      negative = true;
      if (!is_integer(*p)) { // a negative sign must be followed by an integer
        return nullptr;
      }
    }
Java:

    int strlen = str.length();
    if (strlen == 0) {
        throw new NumberFormatException("empty String");
    }
    int index = 0;
    char ch = str.charAt(index);

    boolean negative = false;
    if (ch == '-') {
        negative = true;
        ch = ++index < strlen ? str.charAt(index) : 0;
        if (ch == 0) {
            throw new NumberFormatException("'-' cannot stand alone");
        }
    }


I suppose the main point I was trying to convey is that the code snippet gopalv quoted is not ugly because it's C-like. That ternary expression doesn't appear in the original C++ implementation at all. Many of the things people are complaining about—like relying on the return value of a prefix increment—are only in the Java version.


As someone who writes both C++ and C# I disagree around writing C/C++ in this way. (I am talking specifically about the side effect to 'index', not the use of a tertiary operator which is fine here)

I have seen far too many bugs creep in this way. I feel like C/C++ induce a desire to keep the code as short as possible and being 'clever'; even I often feel the need to do that in C++ when I am happy to write a more readable version in other programming languages.

The bugs are also hard to spot when skimming through the code: When there's a lot of code you will often miss the ++ on the 'index' as it's a long line compared to having '++index' on its own line.

Another good 'pet peeve' of mine is the following construct:

  for(int i = 10; i--;)
    // Do something until 'i' reaches 0
So many off-by-one errors...


the irony is that there is a bounds check already within the str.charAt() method...


That throws an exception though, while this code is trying to return 0. The java "fix" would be to add a charAtOrDefault(), but that doesn't seem to exist.

The source of charAt is also interesting because it manually bounds checks while the array access into the underlying char[] has its own bounds check. I'm not sure why having the separate exception is better than letting the array access throw directly?

https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/19fb8f93c...


Agreed. Though catching and handling that exception maybe more performant if majority of those indexes are within the bounds of the array.

>> I'm not sure why having the separate exception is better than letting the array access throw directly?

(1) The original String charAt method (Sun/Oracle) had to account for an offset and an index. The exception from the bounds check on the underlying char[] would be misleading/confusing... and therefore the birth of the seperate StringIndexOutOfBoundsException

Source: http://www.docjar.com/html/api/java/lang/String.java.html

(2) Because of the above and for backward compatibility, the OpenJDK is doing the same exception handling... sometimes it pays to review the original source code to get some context and rationale to the OpenJDK version.


Oh that makes sense, I had forgotten about the old substring sharing optimization.


I disagree, why exactly would it need to be re-written because it's in Java? Does the language matter at all? To me this piece of code is concise, I immediately recognized it as a bounds check. If you re-write this to add braces, newlines, the if statement... just creates more cognitive load in favor of? To me this little line is perfect, it does exactly what it is supposed to do and it doesn't hide its intent.


If being more concise was always better for cognitive load, we'd all be writing APL.


IME naming projects "fast" or things like that is bad karma. Over time something better comes out and now you're not fast anymore, or there were compromises people have to watch out for to make it fast and now the name is an attractive nuisance.

Swift originally had an option named -Ofast that just turned off all the safety features - eventually it got renamed to -Ounchecked.


I guess Swift just adopted a similar thing to -ffast-math which some C compilers offer where guarantees about precision of floating-point operations are thrown away to get a similar result faster. Incidentally, that could deserve a better name as well (oh, there seems to be a -funsafe-math-optimizations, which reads a lot closer to what is intended).


Who wouldn't want some fun, safe, math optimizations!

I still don't think it's very descriptive though. Whether or not it's unsafe depends a lot on the domain. In most of my code, I don't care about the sign of zero or floating point associativity and want my compiler to auto convert division to multiplication by reciprocals. Better would be something like -fnon-compliant-float-math or something.

Although I guess -fno-trapping-math is a bit different from the other optimizations which could actually be "unsafe" in a different way, since signaling nans are quite useful for debugging.


In java, method need 'strictfp' modifier to get correct math.


That's not been the case for a very long time and even then it wasn't really about 'correctness'. Actually there's a proposal to delete strictfp from the language:

http://openjdk.java.net/jeps/306


My favorite example of this is Fast Ethernet—a max speed of 100 Mbps.


M1 MacBook Air

Apple M1 OpenJDK 64-Bit Server VM, Azul Systems, Inc., 16+36

parsing random integers in the range [0,1)

=== number of trials 32 =====

FastDoubleParser.parseDouble MB/s avg: 492.324205, min: 479.57, max: 516.26

Double.parseDouble MB/s avg: 111.509085, min: 110.63, max: 114.65

Speedup FastDoubleParser vs Double: 4.415104

Apple M1 OpenJDK 64-Bit Server VM, Azul Systems, Inc., 16+36

parsing integers in file data/canada.txt

read 111126 lines

=== number of trials 32 =====

FastDoubleParser.parseDouble MB/s avg: 483.642065, min: 420.43, max: 502.15

Double.parseDouble MB/s avg: 104.347746, min: 99.03, max: 116.91

Speedup FastDoubleParser vs Double: 4.634907


Ryū algorithm, the converse (doubles to strings), is also much faster than using Java's number formatting classes.

https://github.com/ulfjack/ryu/blob/master/src/main/java/inf...

Aside, my JMathTeX fork uses Ryū to achieve real-time rendering of TeX in Java.

https://github.com/DaveJarvis/JMathTeX/


See also the port of Lemire's SIMD UTF8 validation in java: https://github.com/AugustNagro/utf8.java/blob/master/src/mai... Which is made possible since openjdk 16 (the Vector API) however the results aren't yet impressive, and the openjdk devs are currently optimizing the SIMD generation for it


A surprising fact I just learned about float parsing is that it is absolutely trivial (and, I expect, very fast) if you are willing to tolerate a 1 ULP inaccuracy: https://www.exploringbinary.com/quick-and-dirty-decimal-to-f...

(Though 1/100M randomly tested conversions had a 14ULP inaccuracy, so I guess the inaccuracy is a bit unpredictable).


It's been fascinating to watch these developments because in my career I've never encountered a performance-sensitive application that wasn't fixed by just eliminating the parsing of doubles from text format. Parsing them faster never seemed to be a good choice. On the other hand producing human-readable doubles (for debug logs or whatever) is a pretty frequent bottleneck that has also seen great improvements recently (Grisu, then Ryu).

Anyway, it's all great programming but it seems to me a symptom or a larger social failure, where we are making bad protocols faster instead of making the protocols better.


Have you never been in projects where a hard requirement was "save files must be editable with notepad.exe" ? If so, lucky you


It is putting lipstick on a pig, but the problem is that we've gotten used to the human convenience that text-based formats offer (HTTP, HTML, XML, JSON, etc), and we don't want to give them up despite the energy wastefulness of parsing them.

That's why I'm building Concise Encoding [1], which is a twin text/binary format that can be 1:1 converted back and forth. You do everything in binary, and only convert to/from text in the few places where a human needs to be involved (like debugging or configuration).

[1] https://concise-encoding.org


This format seems to have a lot of parallels to Amazon’s Ion format, though I’m surprised to see a strongly typed binary encoding with 19 types! Definitely not the 16 I’d expect. Is a whole byte used to store the type information, or is something more complex going on?


Slightly more complex. The type code distribution is based on how often the type is likely to occur in your average document.

The type field is almost always 1 byte long, except for typed arrays (which would generally in aggregate be long enough to offset the cost) and two uncommon types/variations. These have an effective 2-byte type field (0x94 selects a secondary type plane, and the next byte selects the specific type from there).

The most common integer values (-100 to 100) are encoded directly into the types 0x9c to 0x64 (wraparound), such that interpreting them directly as signed 8-bit signed integers yields their actual value (type 0x00 = integer 0, type 0x64 = integer 100, type 0x9c = -100, etc).

Strings are also optimized such that types 0x80 - 0x8f are used for the most common string lengths (0 to 15) so as not to require a separate length field.

The rest have a 1-byte type field. There are also 3 reserved type codes left in the first plane in case something big comes up in the future. You can see the chart here: https://github.com/kstenerud/concise-encoding/blob/master/cb...


Database bulk load are where I have hit this many times. Often it's not viable to change the import to be a binary format (e.g cross databases, migrating from old proprietary software, etc).


Is there a downside to this? Or is it just a plain better algorithm? If so, why has it not been implemented in the Java standard libs?


It will load 2 big arrays in memory (per classloader). Code is super complex so maintenance would be very hard. I guess it makes sense to use it as a library (MIT license) if your code parses a log of doubles.


Yes but if you were to modify java runtime, it can just call the native C version


The bottleneck is usually I/O, or parsing into large object structures, not double parsing.


See also the work on jsoniter-scala, which is the fastest serialization library on the JVM https://github.com/plokhotnyuk/jsoniter-scala/pull/542 I wonder how they compare and whether some of the optimization tricks could be shared.

more on the optimization techniques here: https://github.com/FasterXML/jackson-core/issues/577


Any body wanting to port the fast float parser in java too ? https://github.com/fastfloat/fast_float

Lemire's really has made a huge contribution to computer science


This is the port of fast_float, at least that is what fast_float's README says.


Well this doesn't make sense. This port is for 64 bit aka double precision floating points. I am talking about supporting 32 bit floats. Maybe this can be achieved in the same codebase, I don't know


The fact that this requires Java 16 just because the author didn’t want to write a single POJO seems kind of silly to me.

I also don’t think parsing doubles will ever be the bottleneck in any of my Java code. If it turns out it is, though...


> parsing doubles will ever be the bottleneck in any of my Java code

In my C# and even in C++, profiler sometimes showed me I am bottlenecked on parsing or printing floats. Happened quite a few times.

In all cases the data was 3D models or similar things. STL, 3MF, VTK, G-mesh, G-code are all text based, either always (looking at you, 3MF consortium) or as an option (STL and some versions of VTK can be either binary or text). Even low-end modern GPUs can easily render 1E+7 triangles these days, thus high-poly 3D models are not uncommon.

Games are not affected because they preprocess the hell out of their assets, optimizing everything they can about the data format. But in CAD/CAM stuff you’re pretty much guaranteed to bottleneck on parsing numbers when loading these text-based formats.


What's the allocation pressure in comparison? Is this a time for memory optimization? i.e. am I sacrificing anything using this?


This is awesome!


Good job!

Though I would have added a unit test to iterate over complete 2^64 set of values and compare the output to Double.parseDouble().


Even if you could parse a double in a single cycle, it would take 292 years to run through 2^64 cases on a 2GHz machine. https://www.google.com/search?q=2**64+%2F+2GHz&oq=2**64+%2F+...

Also, while there are only 2^64 distinct double values, there are many more string representations than that. For example, "1" and "1.0" are the same underlying value, but a parser needs to be able to handle both.


My point is about comparing accelerated code behavior to the standard java Double.


your point is still invalid because it's pointless to attempt what you're suggesting. and that's not how unit tests work.

unit testing doesn't mean that all possible values have to be tested. usually it's about code paths, i.e. standard, error and edge cases.


The successes of fuzzing projects like oss-fuzz have demonstrated significant shortcomings to hand-curating test cases in the manner you describe. Testing every 64bit float value is unrealistic, but testing a huge number of randomly selected values by cross-comparison with other libraries is a very good idea for code like this.

https://github.com/google/oss-fuzz


My point is based on experience. Yes, it's probably too much to ask to test all 2^64 cases, but I've been dealing with code in production that was failing precisely because some values lead to code paths that gave incorrect results.

In this particular case, I would have tested 2^N where N>32 random, unique values, at a minimum.

Of course, it depends on who the customer is. Some customers can tolerate a few bugs here and there; and some would incur million dollar losses.


That doesn't cover the entire space of possibilities, because the input can be an arbitrarily long decimal number. A correct implementation must (in the limit) expand the entire binary representation of that number and correctly round (using round-to-nearest, ties-to-even), even if the tie-breaking bit is a million places down.


Is there a provable limit to the maximum depth of the tie-breaking bit? Otherwise sounds like a potential DOS vector?


It's only limited by the number of decimal digits (after the decimal point) input by the user; computing and storing that is linear in the input size. It's no more a DOS vector than just an enormous source file.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: