The sorry state of Java deserialization

charleslmunger · 2024-09-22T15:59:10 1727020750

I noticed:

1. Small read buffers. No reason to sequentially read and parse gigabytes only 4kb at a time.

2. parseDelimitedFrom created a new CodedInputStream on every message, which has its own internal buffer; that's why you don't see a buffered stream wrapper in the examples. Every iteration of the loop is allocating fresh 4kb byte[]s.

3. The nio protobuf code creates wrappers for the allocated ByteBuffer on every iteration of the loop.

But the real sin with the protobuf code is serializing the same city names over and over, reading parsing and hashing. Making a header with the city string mapped to an integer would dramatically shrink the file and speed up parsing. If that was done, your cost would essentially be the cost of decoding varints.

krackers · 2024-09-22T21:01:41 1727038901

If you're writing out protobufs, isn't the recommended approach to use some record-oriented wrapper? Something like TFRecord. Or is parseDelimitedFrom how that's implemented in the first place?

But Google must surely have some optimized Java libraries for this given that they use protobufs everywhere.

Edit: I found https://beam.apache.org/releases/javadoc/current/org/apache/... which given it was developed by google is probably close to what they use. I wonder if the logic from this could be pulled out for use with non-distributed workflows.

amluto · 2024-09-22T17:23:40 1727025820

> Making a header with the city string mapped to an integer would dramatically shrink the file and speed up parsing.

Indeed, Parquet will do this for you if you let it.

I also wonder how good Java is at optimizing closure creation in a loop, as in:

    k -> new ResultObserver()

The vast majority of those closures are created and never called. C++ might optimize this well if calling a template but has basically no chance if the parameter is an std::function. Java probably has more ability to optimize it, but I don’t know whether it actually does so.

the-alchemist · 2024-09-22T17:35:47 1727026547

I believe the term for this is "partial escape analysis" (PEA).

GraalVM is the SOTA in the JVM world for this.

Here's an oldish blog on this: https://chrisseaton.com/truffleruby/seeing-escape-analysis/

I'm sure PEA is better now, but it's not sure if it's moved beyond scalars (int, float, etc.)

vips7L · 2024-09-22T23:05:50 1727046350

Wouldn't the lambda here by always escape? It gets created in the function and passed into another one. Looking at the byte code it looks like it always allocates in when its in the interpreter:

    invokedynamic #16,  0             // InvokeDynamic #0:apply:()Ljava/util/function/Function;

amluto · 2024-09-23T01:32:16 1727055136

I would expect the relevant optimization to be inlining and/or specialization. An inlined function could allow the JIT detect that the hot path creates and destroys an object without ever using it. Alternatively, if the JIT could specialize the callee for the specific function that is passed in, then the overhead would go away.

(One think I like about less heavily JIT-reliant languages is that it can be more obvious when something will incur a runtime cost.)

vips7L · 2024-09-23T23:25:34 1727133934

Reading more about how invokedynamic works, it appears that if your lambda doesn't capture it shouldn't allocate it more than one time.

WASDx · 2024-09-22T17:38:45 1727026725

The closure itself is only being created once, it's essentially a singleton. Only if it would capture variables it would have to be recreated every iteration.

marginalia_nu · 2024-09-22T19:26:16 1727033176

So what does the performant way of doing this look like? Your final suggestion brought the nio version down to about 32s.

I was looking for ways of re-using the wrappers in the NIO code, but couldn't figure out how to do that.

charleslmunger · 2024-09-25T02:08:40 1727230120

Well, if you aren't constrained in the format, write it out as fixed width integers as small as you can, mmap the file, madvise as MADV_SEQUENTIAL, and scan with a branch free loop the compiler will vectorize. You'll saturate your read bandwidth for almost any speed.

Another trick I've seen is to set up a ring buffer by mapping the same physical page on either side of another, which allows you to always read a contiguous region even if it wraps around the end of your buffer. This would let you keep a traditional read loop without having to shift bytes.

In Java, these tools aren't all available. Java protobuf uses an object oriented style that comes with performance costs. Making the message object immutable avoids bugs and makes threading easier. If you like the object oriented style, you can have it, but you'll pay in allocation and cache misses if your objects are all tiny.

The protobuf wire format is optimized for flexibility - if you're storing only two ints per message, only 1/3 of the parsing you do is your actual content. You pay for the tag number, the message length delimiter, and then the tag numbers for each of your fields - four varints of framing for two varints of content. This lets you add and remove fields of any type in the message safely, but if you're optimizing for pure speed of many tiny snippets of data, you are paying for flexibility you may not need.

But if you like protobuf you should be able to get respectable performance in Java by making a single CodedInputStream with a large buffer (16kb at least) and using push limit/poplimit yourself to do parseDelimitedFrom repeatedly without making new stream objects or buffer wrappers every time. At that point I'd expect your bottleneck to be allocating and eventually GCing the message objects, but maybe escape analysis has gotten good enough for those to be stack allocated nowadays.

mynegation · 2024-09-22T13:41:50 1727012510

“I admit I don’t understand these results. There’s clearly nothing in the runtime itself that prevents these types of speeds.” Oh, there is. The default Java serialization is sort of like “pickle” module in Python - if you are familiar. It will deal with pretty much anything you throw at it, figuring the data structures and offsets to serialize or parse at runtime. More efficient methods trade universality for speed, where the offsets and calls to read/write the parts of the structure are determined in advance. Also, hard to say without source code but there is a high chance even more efficient methods like Protobuf create a lot of Java objects and that kills cache locality. With Java, you have to go out of your way to maintain good cache locality because you give up control over memory layout for automatic memory management.

jffhn · 2024-09-22T16:10:49 1727021449

>With Java, you have to go out of your way to maintain good cache locality because you give up control over memory layout for automatic memory management.

There is a shadowy cult in a hidden corner of the Java community, an heresy to many, only followed by a handful of obnoxious zealots inspired by the dark ages of Ada 83, C, or even assembly, who take pride in creating Java programs that only allocate a finite amount of objects regardless of how long you run them for, and to which the "new" keyword is a taboo which avoidable use is assimilated to blasphemy.

As a member of this sect, in a few cases of presenting some of our programs on some laptop, I've had dumbfounded observers looking around the laptop for the network cable linking it to the server they thought it must have been running on.

gred · 2024-09-22T18:20:03 1727029203

Ah, so you're one of the six Epsilon GC users :-)

vrighter · 2024-09-23T06:42:13 1727073733

this is pretty much a necessity sometimes. I was writing an audio synthesizer in c# and this is exactly what I had to do. Allocate whatever I need up front and ban the new keyword from then on.

Not that different to c++ had I chosen that instead.

neonsunset · 2024-09-23T10:09:21 1727086161

To be fair, working with audio is great in a way that you usually know the exact size of the buffer you need, so it is easy to either stackalloc it or NativeMemory.Alloc the pointer and wrap it in a span, completely bypassing GC. The downside is there aren’t that many libraries that are up-to-date, but working with system APIs directly or e.g. PortAudio bindings is also an option.

kaba0 · 2024-09-22T14:52:05 1727016725

> Also, hard to say without source code but there is a high chance even more efficient methods like Protobuf create a lot of Java objects and that kills cache locality

I don’t think this can be claimed that easily without more info, generational GCs work pretty much like an arena allocator, with very good cache locality (think of an ArrayList getting filled with objects that are continuously allocated in short order. The objects will be right next to each other, in memory). If the objects are short-lived, they can be similarly cheap to stack allocation (thread-local allocation buffers that just bumping pointers).

marginalia_nu · 2024-09-22T15:32:43 1727019163

GC pressure is another factor. Even in the trivial ownership case, gc:ing in a GB/s allocation environment comes at a cost.

NovaX · 2024-09-22T20:21:01 1727036461

Adding Guava testlib's GcFinalization.awaitFullGc() before a benchmark run and -XX:+UseParallelGC, I saw the runtimes decrease by 30s in Bench_Fury_Ordinal and 20s in Bench_ObjectOutputStream. Ideally you would run using JMH to avoid jit warmup issues, previous runs, etc from polluting your results.

marginalia_nu · 2024-09-22T20:42:27 1727037747

Did you dial up to run for 1 billion items?

In general I'm not a big fan of JMH for testing sustained I/O scenarios, as CPU, OS and storage behavior are extremely relevant, and JMH tends to interfere with access patterns.

NovaX · 2024-09-22T21:24:59 1727040299

Nope, I used the default in your github repository (10_000_000). From a quick profile it looked like previous benchmark allocations where crossing into later runs, who were then penalized unfairly, so I made those small adjustments.

hedora · 2024-09-22T14:27:45 1727015265

The article does have source code.

I don’t think any of the examples use Java’s Serializable. The first attempt reads shorts and utf8 directly from the stream.

marginalia_nu · 2024-09-22T14:29:02 1727015342

ObjectInputStream is one of the faster stream options tested.

jbellis · 2024-09-22T14:27:28 1727015248

True, but none of the slow methods in the article involve this.

garblegarble · 2024-09-22T15:04:34 1727017474

I know it's bad form to comment on style instead of content, but saying Smartphone enjoyers will want to switch to horizontal mode for this article due to code samples that barely fit on desktop while having the article text column shrink to less than 1/3rd of the horizontal space just feels disrespectful

marginalia_nu · 2024-09-22T15:07:43 1727017663

Try refresh? Should cover 60ch, but some browsers bug out when you turn for reasons i don't understand.

garblegarble · 2024-09-22T16:32:26 1727022746

Tried a refresh, a force-refresh... no change at all (Firefox and Chrome)

marginalia_nu · 2024-09-22T16:35:05 1727022905

Interesting. Android or iOS?

garblegarble · 2024-09-22T18:18:48 1727029128

Desktop - that's my complaint... you wrote that the code samples "barely fit on desktop", which is only true because your CSS wastes over 2/3rds of the screen width I have provided to my user agent.

(it's fine though - I use Reader Mode on such user-hostile sites)

marginalia_nu · 2024-09-22T18:43:03 1727030583

Oh yeah, now I understand.

I'm struggling to find a width for the layout that makes sense for text (where narrow columns are preferable), but also for code snippets (where you want wider columns).

garblegarble · 2024-09-22T20:18:51 1727036331

You may be overthinking it - most blogs simply allow the text to be the width of the window... your article is very readable in Firefox's Reader Mode, so for what it's worth, my $0.02 is that'd be fine

SebastianKra · 2024-09-22T15:09:25 1727017765

Yep. I have the same frustration with GitHub and GitLab which add 3km of horizontal padding on phone screens for no reason.

vips7L · 2024-09-22T15:25:48 1727018748

Size 89 font for the code samples too

splix · 2024-09-22T15:09:10 1727017750

I don't think that Java serialization is designed for such a small object with just two fields. It's designed for large and complex objects. Obviously it would be slower and much larger in size that a columnar implementation designed and heavily optimized for this scenarios. It's not a fair comparison and too far from a real use case.

Try with nested objects and at least a dozen of fields across this hierarchy. And different structure for each row. It's still not a use case for Java serialization, but at least closer to what a real code would do.

Same for Protobuf, I guess. Also the JSON serialization plays the same role more or less.

Maybe something like Avro Data Files is better for a comparison with columnar formats.

mike_hearn · 2024-09-22T14:55:26 1727016926

It'd be interesting to see Cap'n'Proto be profiled here, as the whole idea of that format is to eliminate deserialization overhead entirely.

marginalia_nu · 2024-09-22T19:11:14 1727032274

A repo with benchmark code is up now:

https://github.com/vlofgren/Serialization1BRCBenchmark/

SirYwell · 2024-09-22T15:59:04 1727020744

Why doesn't it mention the used Java version? And a few flame graphs would be interesting as well.

twoodfin · 2024-09-22T13:46:52 1727012812

I’m probably missing something obvious, but what’s wrong with Apache parquet-java for this use case?

marginalia_nu · 2024-09-22T14:03:45 1727013825

The implementation is inexorably merged with hadoop, to the point where it is not useful outside of it.

Parquet-floor is a shim that replaces the hadoop depenencies with drop in java.io-ones.

krackers · 2024-09-22T20:54:50 1727038490

What's the reason why reading 3GB via duckdb is _faster_ than raw HDD speed? Is it compression/caching?

marginalia_nu · 2024-09-22T21:07:09 1727039229

That's an in memory benchmark, to help give context to the figures in the duckdb parquet benchmarks.

pestatije · 2024-09-22T13:36:00 1727012160

are the benchmarks done properly? whats the actual test code?

Jaxan · 2024-09-22T13:38:59 1727012339

I don’t think the conclusion need a lot of precision in the benchmark. When the suggested code from the standard library (or some tutorial) is two orders of magnitude slower, something is not right.

The author is right that we are wasting something somewhere when we are only operating at 2% of the possible speed of the hard disk.

antonhag · 2024-09-22T14:14:30 1727014470

From the code samples it's hard to tell whether or not this has to do with de-serialization though. It would have been fun to see profiling results for tests such as these.

marginalia_nu · 2024-09-22T14:26:05 1727015165

Author here, I'm away from my computer atm, but I can cook up a repo with each test in a few hours when I get home.

I designed the tests as a drag race because that mimics my real world usage.

antonhag · 2024-09-22T15:55:35 1727020535

That's nice - I'd encourage you to play around with attaching e.g. JMC [1] to the process to better understand why things are as they are.

I tried recreating your DataInputStream + BufferedInputStream (wrote the 1brc data to separate output files, read using your code - I had to guess at ResultObserver implementation though). On my machine it roughly in the same time frame as yours - ~1min.

According to Flight Recorder:

  - ~49% of the time is spent in reading the strings (city names). Almost all of it in the DataInputStream.readUTF/readFully methods.
  - ~5% of the time is spent reading temperature (readShort)
  - ~41% of the time is spent doing hashmap look-ups for computeIfAbsent()
  - About 50GB of memory is allocated - %99.9 of it for Strings (and the wrapped byte[] array in them). This likely causes quite a bit of GC pressure.

Hash-map lookups are not de-serialization, yet the lookup likely affected the benchmarks quite a bit. The rest of the time is mostly spent in reading and allocating strings. I would guess that that is true for some of the other implementations in the original post as well.

[1] https://github.com/openjdk/jmc

edit: better link to JMC

marginalia_nu · 2024-09-22T18:21:23 1727029283

JMC is indeed a valuable tool, though what you see in any java profiler is to be taken with a grain of salt. The string parsing and hash lookups are present in most of the implementations, yet some of them are up to 10 times faster than the DataInputStream + BufferedInputStream code.

It doesn't seem like it can be true that 90% of the time is spent in string parsing and hash lookups if the same operation takes 10% of the time when reading from a filechannel and bytebuffer.

antonhag · 2024-09-22T18:31:45 1727029905

Aren't the versions that take 10% of the time only reading each city name once, and then doing an array lookup rather than a hashmap lookup?

marginalia_nu · 2024-09-22T18:41:34 1727030494

Nope, see for example "Custom 1":

  var buffer = ByteBuffer.allocate(4096);
  try (var fc = (FileChannel) Files.newByteChannel(tempFile, 
                        StandardOpenOption.READ)) 
  {

    buffer.flip();

    for (int i = 0; i < records; i++) {

        if (buffer.remaining() < 32) {
            buffer.compact();
            fc.read(buffer);
            buffer.flip();
        }

        int len = buffer.get();
        byte[] cityBytes = new byte[len];
        buffer.get(cityBytes);
        String city = new String(cityBytes);
        int temperature = buffer.getShort();

        stats.computeIfAbsent(city, k -> new ResultsObserver())
             .observe(temperature / 100.);
    }
  }

antonhag · 2024-09-22T19:30:41 1727033441

My bad - I got confused as the original DIS+BIS took ~60s on my machine. I reproducing the Custom 1 implementation locally (before seeing your repo) and it took ~48s on the same machine. JFR (which you honestly can trust most of the time) says that the HashMap lookup now is ~50% of the time and the String constructor call being ~35%.

haglin · 2024-09-22T21:46:00 1727041560

JFR only samples running Java methods.

I would guess at least some of the bottlenecks are in hardware, the operating system or in native code (including the JVM) in this case.

cowwoc · 2024-09-22T14:42:04 1727016124

Hi,

Please add https://github.com/apache/fury to the benchmark. It claims to be a drop-in replacement for the built-in serialization mechanism so it should be easy to try.

marginalia_nu · 2024-09-22T14:48:15 1727016495

Will do!