Hacker News new | past | comments | ask | show | jobs | submit login

I don’t think the conclusion need a lot of precision in the benchmark. When the suggested code from the standard library (or some tutorial) is two orders of magnitude slower, something is not right.

The author is right that we are wasting something somewhere when we are only operating at 2% of the possible speed of the hard disk.




From the code samples it's hard to tell whether or not this has to do with de-serialization though. It would have been fun to see profiling results for tests such as these.


Author here, I'm away from my computer atm, but I can cook up a repo with each test in a few hours when I get home.

I designed the tests as a drag race because that mimics my real world usage.


That's nice - I'd encourage you to play around with attaching e.g. JMC [1] to the process to better understand why things are as they are.

I tried recreating your DataInputStream + BufferedInputStream (wrote the 1brc data to separate output files, read using your code - I had to guess at ResultObserver implementation though). On my machine it roughly in the same time frame as yours - ~1min.

According to Flight Recorder:

  - ~49% of the time is spent in reading the strings (city names). Almost all of it in the DataInputStream.readUTF/readFully methods.
  - ~5% of the time is spent reading temperature (readShort)
  - ~41% of the time is spent doing hashmap look-ups for computeIfAbsent()
  - About 50GB of memory is allocated - %99.9 of it for Strings (and the wrapped byte[] array in them). This likely causes quite a bit of GC pressure.
Hash-map lookups are not de-serialization, yet the lookup likely affected the benchmarks quite a bit. The rest of the time is mostly spent in reading and allocating strings. I would guess that that is true for some of the other implementations in the original post as well.

[1] https://github.com/openjdk/jmc

edit: better link to JMC


JMC is indeed a valuable tool, though what you see in any java profiler is to be taken with a grain of salt. The string parsing and hash lookups are present in most of the implementations, yet some of them are up to 10 times faster than the DataInputStream + BufferedInputStream code.

It doesn't seem like it can be true that 90% of the time is spent in string parsing and hash lookups if the same operation takes 10% of the time when reading from a filechannel and bytebuffer.


Aren't the versions that take 10% of the time only reading each city name once, and then doing an array lookup rather than a hashmap lookup?


Nope, see for example "Custom 1":

  var buffer = ByteBuffer.allocate(4096);
  try (var fc = (FileChannel) Files.newByteChannel(tempFile, 
                        StandardOpenOption.READ)) 
  {

    buffer.flip();

    for (int i = 0; i < records; i++) {

        if (buffer.remaining() < 32) {
            buffer.compact();
            fc.read(buffer);
            buffer.flip();
        }

        int len = buffer.get();
        byte[] cityBytes = new byte[len];
        buffer.get(cityBytes);
        String city = new String(cityBytes);
        int temperature = buffer.getShort();

        stats.computeIfAbsent(city, k -> new ResultsObserver())
             .observe(temperature / 100.);
    }
  }


My bad - I got confused as the original DIS+BIS took ~60s on my machine. I reproducing the Custom 1 implementation locally (before seeing your repo) and it took ~48s on the same machine. JFR (which you honestly can trust most of the time) says that the HashMap lookup now is ~50% of the time and the String constructor call being ~35%.


JFR only samples running Java methods.

I would guess at least some of the bottlenecks are in hardware, the operating system or in native code (including the JVM) in this case.


Hi,

Please add https://github.com/apache/fury to the benchmark. It claims to be a drop-in replacement for the built-in serialization mechanism so it should be easy to try.


Will do!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: