Hacker News new | past | comments | ask | show | jobs | submit login

I noticed:

1. Small read buffers. No reason to sequentially read and parse gigabytes only 4kb at a time.

2. parseDelimitedFrom created a new CodedInputStream on every message, which has its own internal buffer; that's why you don't see a buffered stream wrapper in the examples. Every iteration of the loop is allocating fresh 4kb byte[]s.

3. The nio protobuf code creates wrappers for the allocated ByteBuffer on every iteration of the loop.

But the real sin with the protobuf code is serializing the same city names over and over, reading parsing and hashing. Making a header with the city string mapped to an integer would dramatically shrink the file and speed up parsing. If that was done, your cost would essentially be the cost of decoding varints.




If you're writing out protobufs, isn't the recommended approach to use some record-oriented wrapper? Something like TFRecord. Or is parseDelimitedFrom how that's implemented in the first place?

But Google must surely have some optimized Java libraries for this given that they use protobufs everywhere.

Edit: I found https://beam.apache.org/releases/javadoc/current/org/apache/... which given it was developed by google is probably close to what they use. I wonder if the logic from this could be pulled out for use with non-distributed workflows.


> Making a header with the city string mapped to an integer would dramatically shrink the file and speed up parsing.

Indeed, Parquet will do this for you if you let it.

I also wonder how good Java is at optimizing closure creation in a loop, as in:

    k -> new ResultObserver()
The vast majority of those closures are created and never called. C++ might optimize this well if calling a template but has basically no chance if the parameter is an std::function. Java probably has more ability to optimize it, but I don’t know whether it actually does so.


I believe the term for this is "partial escape analysis" (PEA).

GraalVM is the SOTA in the JVM world for this.

Here's an oldish blog on this: https://chrisseaton.com/truffleruby/seeing-escape-analysis/

I'm sure PEA is better now, but it's not sure if it's moved beyond scalars (int, float, etc.)


Wouldn't the lambda here by always escape? It gets created in the function and passed into another one. Looking at the byte code it looks like it always allocates in when its in the interpreter:

    invokedynamic #16,  0             // InvokeDynamic #0:apply:()Ljava/util/function/Function;


I would expect the relevant optimization to be inlining and/or specialization. An inlined function could allow the JIT detect that the hot path creates and destroys an object without ever using it. Alternatively, if the JIT could specialize the callee for the specific function that is passed in, then the overhead would go away.

(One think I like about less heavily JIT-reliant languages is that it can be more obvious when something will incur a runtime cost.)


Reading more about how invokedynamic works, it appears that if your lambda doesn't capture it shouldn't allocate it more than one time.


The closure itself is only being created once, it's essentially a singleton. Only if it would capture variables it would have to be recreated every iteration.


So what does the performant way of doing this look like? Your final suggestion brought the nio version down to about 32s.

I was looking for ways of re-using the wrappers in the NIO code, but couldn't figure out how to do that.


Well, if you aren't constrained in the format, write it out as fixed width integers as small as you can, mmap the file, madvise as MADV_SEQUENTIAL, and scan with a branch free loop the compiler will vectorize. You'll saturate your read bandwidth for almost any speed.

Another trick I've seen is to set up a ring buffer by mapping the same physical page on either side of another, which allows you to always read a contiguous region even if it wraps around the end of your buffer. This would let you keep a traditional read loop without having to shift bytes.

In Java, these tools aren't all available. Java protobuf uses an object oriented style that comes with performance costs. Making the message object immutable avoids bugs and makes threading easier. If you like the object oriented style, you can have it, but you'll pay in allocation and cache misses if your objects are all tiny.

The protobuf wire format is optimized for flexibility - if you're storing only two ints per message, only 1/3 of the parsing you do is your actual content. You pay for the tag number, the message length delimiter, and then the tag numbers for each of your fields - four varints of framing for two varints of content. This lets you add and remove fields of any type in the message safely, but if you're optimizing for pure speed of many tiny snippets of data, you are paying for flexibility you may not need.

But if you like protobuf you should be able to get respectable performance in Java by making a single CodedInputStream with a large buffer (16kb at least) and using push limit/poplimit yourself to do parseDelimitedFrom repeatedly without making new stream objects or buffer wrappers every time. At that point I'd expect your bottleneck to be allocating and eventually GCing the message objects, but maybe escape analysis has gotten good enough for those to be stack allocated nowadays.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: