Hacker News new | past | comments | ask | show | jobs | submit login
Saving Millions by Dumping Java Serialization (quantcast.com)
86 points by jnewhouse on April 12, 2017 | hide | past | favorite | 45 comments



Was using Thrift or Protobuf an option?


I'd like to know this too. As a passerby, those seem to have solved serialization, so I'm curious why you need rowfiles instead of e.g. protobuf.


One reason on top of my head: Using such communication protocol would require changes to the other services consuming it.


So did switching to their homebrew serialization format -- in fact, most of the article is about how they managed the changes (which touched codebases at multiple sites in a fairly large organization).


Those switches all occurred at the pipeline level, leaving the map-reduce platform untouched. Switching our base logs to something like Parquet, Thrift or Protobuf would be a much larger project. We do support writing and reading Parquet to allow us to interface with other big data systems.


We started developing rowfiles around 2005. Thrift wasn't open sourced until 2007. I couldn't find a date for protobuf's release, but I don't think it was standard outside of google at that time. We use protobufs internally, and have a number of Rows whose field values are byte[]s containing protobufs. One big thing our rowfiles gives us is fast indexing. The only other big data format I know of that gives that is Kudu, which uses the same indexing scheme.


> Kudu

Do you mean Apache Arrow?


Nope, Kudu https://kudu.apache.org/. Although from Arrow's homepage it looks like it works with Kudu. "Apache Arrow is backed by key developers of 13 major open source projects, including Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala, Kudu, Pandas, Parquet, Phoenix, Spark, and Storm making it the de-facto standard for columnar in-memory analytics."


Former Kudu developer here.

Kudu was designed to be a columnar data store. Which means that if you had a schema like table PERSON(name STRING, age INT), you would store all the names together, and then all the ages together. This lets you do fancy tricks like run-length encoding on the age fields, and so forth.

There is also a Kudu RPC format which uses protocol buffers in places. But Kudu also sends some data over the wire that is not encoded as protocol buffers, to avoid the significant overhead of protocol buffer serialization / deserialization.

Apache Arrow is a separate project which started taking shape later. Essentially it was conceived as a way of allowing multiple applications to share columnar data that was in memory. I remember there being a big focus on sharing memory locally in a "zero-copy" fashion by using things like memory mapping. I never actually worked on the project, though. Developers of projects like Impala, Kudu, etc. thought this was a cool idea since it would speed up their projects. I don't know how far along the implementations are, though. Certainly Kudu did not have any integration with Arrow when I worked on it, although that may have changed.

Protocol buffers is simple and has a lot of language bindings, but its performance is poor on Java. The official PB Java libraries generate a lot of temporary objects, which is a big problem in projects like HDFS. It works better on C++, but it's still not super-efficient or anything. It's... better than ASN.1, I guess? It has optional fields, and XDR didn't...


What is Thrift? Is it a service?


Thrift is a software library not a BaaS...


Some interesting benchmarks of various Java serialization libraries:

https://github.com/eishay/jvm-serializers/wiki


Author here, let me know if you have any questions/want more details.


Thanks for writing about your experiences. Why not use an existing serialization framework, such as Protobuf, instead of building something in-house?


I don't think protobuf was around for public use when we came up with this format, which began around 2005. We use Protobuf internally, and some of our columns are actually byte[]'s containing protobuf data. We now support Parquet and are doing more work with other big data tools, but we've had a hard time matching the performance of our custom stuff.


1) Can you provide any more details about how Rowfiles are structured and/or implemented? Specifically, how does it handle nested objects? Does it support `transient`? Do `writeObject` and/or `readObject` come into play?

2) Do you feel this is a generic enough solution that you would consider submitting it as a JSR?


It natively supports a limited set of Columns. Basically boxed primitives, java.util.Date, joda.time.DateTime, and arrays and double arrays of both boxed and unboxed versions of the preceding. The list of Columns being used is used to read and write to a byte buffer. The byte buffer is almost entirely the field's data, with one or two bytes describing how the subsequent field is encoded. Nested objects aren't handled out of the box, but there is the capability to define a UserRowField that allows for serialization/deserialization to bytes of any Serializable class. This gets used for our SQL map-reduce function a lot. The downside is that you need to have the UserRowField implementation in your classpath in order to read the Row, which is not generally the case.


So... what's quantcast?


We're a big data advertise and measure company based in San Francisco. We run online display ad campaigns for marketers across realtime bidding exchanges (RTB), such as those run by Google and AppNexus. We also provide a publisher product to give site owners insights into their audience. Stack Overflow's profile is at https://www.quantcast.com/stackoverflow.com.


I am by no means an expert, but I always wonder why people don't adopt ASN.1 for serialization? I know it is not pretty but writing machine readable stuff never is.


Portable Object Format from 10 years ago? https://docs.oracle.com/cd/E24290_01/coh.371/e22837/api_pof....


Can someone explain to an amateur why serialization is faster than say passing raw JSON?

It seems like parsing JSON would be faster than the serialize -> deserialize process but with the popularity of things like Protobuff it's clear that JSON is slower.


Serialization is the process of writing arbitrary data out into a blob of some sort (binary, text, whatever) that can be read in later and processed back into the original data, possibly not by the same system. This should be considered to include even the degenerate case of just writing the content of an expanse of RAM out, as that still raises issues related to serialization.

"JSON Serialization" and "Java Serialization" are two different things that can accomplish that goal. It sounds to me from your question that you think they have some fundamental difference, because your second paragraph implies you believe there is some sort of fundamental difference between Java serialization and JSON serialization, but there isn't. There is a whole host of non-fundamental differences that you always have to consider with a serialization format (speed, what can be represented, circular data structure handling, whether untrusted data can be used), but there's not a fundamental difference.


JSON is a serialization format, just one that at least nods in the direction of human-readability. Formats which don't worry about human readability (e.g. Protobuf) can gain various degrees of efficiency.


For one thing, all numbers can be written in binary, saving the lexing and conversion time. For another, strings can be written by first writing the length (in binary, of course), then writing the raw contents; there's no need to scan the input looking for the closing quote, handle backslash escapes, or do UTF-8 conversion.

That's probably most of the gain right there, but more things can be done along those lines.


And no whitespace or curly braces taking up room, so the serialized data is smaller, and thus faster to transmit/store. Downside: Legibility? Future-proofing? Whats that?


>Downside: Legibility? Future-proofing? Whats that?

There's nothing in this practice that is against future-proofing.

Legibility, yes, but those formats are not meant to be human readable.


Without the ability to future-proof being inherent in the format (like XML, which is self-describing), the sad reality of development practices in programming shops mean that one day, someone will make an undocumented change or take a shortcut that will tightly couple the binary format to the specific version of the code used to produce & read it. Which is fine, as long as you know that that coupling will happen when you're planing things. Not so fun when you have to go back and read a 3-year old file, only to discover that you can't.

Something that comes to mind is the old COM formats that MS-Office used to use. Eventually they had to abandon it (and not just because of the EU lawsuit) because it was unmaintainable, and no one understood how they worked well enough to not-break backwards compatibility for the next release.


Either way, it's serialization: object serialization or JSON serialization.

However, independently of how the data is represented (JSON or one of the many binary formats), the issue is to only encode/decode what you actually care about. From the little I understand about java object serialization, there's a lot of extra stuff that gets encoded, which may not be needed at all for the application at hand.

For an example of efficient serialization techniques, take a look at some of the MPEG formats (the older ones are easier to grok). They have a neat way of representing what is needed and dealing with optional data.


JSON must also be serialized or deserialized. Parsing it is slow and hard and not cache friendly.

Protobuf has the benefit of being extremely compact and, in some important languages, fast and friendly to serialize and deserialize.


Thanks! You and the others are right: I didn't know JSON was serialized.

I can see why something in binary would be faster than structured text (think assembler vs Python).

Thanks again.


>I didn't know JSON was serialized.

Think of it like this: anytime you get stuff from the memory of your program (arrays, lists, strings, etc) and export it in a textual or binary format that can be exchanged between programs, moved over the network, saved to a file, etc, that's serialization.


JSON is a serialization format itself. Even Javascript, which has JSON-like structures as native objects, has to serialize it (JSON.stringify) and deserialize it (JSON.parse) into text.


> Secondly, Java serialization produces very bulky outputs. Each serialization contains all of the data required to deserialize. When you’re writing billions of records at a time, recording the schema in every record massively increases your data size.

Sounds to me like you shouldn't be storing objects in your database.

Why not just write the data into tables, and then create new POJO's when necessary, using the selected data?


A standard database table isn't large enough to handle our large datasets. For example, the Hercules dataset was over 2 petabytes and even after optimization is almost 1 petabyte. Big data systems like Spark, Impala, Presto, etc. are designed to make the data look like a table, even though it is spread out into many files in a distributed filesystem. This is what we do. It's pretty common to reimplement some database features onto these big data file formats. In our case we have very fast indexes that let us quickly fetch data, similar to an index in a postgresql table.


Well, you understand your system and requirements better than I, obviously, but...

    A standard database table isn't large enough to handle our large datasets
... isn't much of an answer as-to why you're storing objects in your database.

As you already mentioned in your post, serialized objects are big - they contain all of their data, plus everything necessary to deserialize the object into something usable.

I imagine your objects have the standard amount of strings, characters, numbers, booleans, etc... why not just store those in the database and select them back out when needed? Less data in the database, and faster retrieval time since you skip serialization in both steps (storage and retrieval). Even if you have nested objects within nested objects, you can write-out a "flat" version of the data to a couple of joined tables surely.

On the other hand, serializing the object is probably more "simple" to implement and use... but then you get the classical tradeoff of performance vs. convenience.


What's "the database" that you have in mind?

Start out with the idea that you have hundreds of machines in your cluster, with 1000s of TB of data. Suppose the current data efficiency is on the order of 80% - that is, 80% of the 1000s of TB is the actual bytes of the data fields. What database do you have in mind to store this data, still on the order of 1000s of TB?

You say: a couple of joined tables. So you have hundreds of machines, and the tables are not all going to fit on one machine; they're going to be scattered across hundreds of machines each. How do you efficiently do a join across two distributed tables?

It's no picnic.

If each row in one table only has a few related rows in the other table, it's much, much better to store the related data inline. Locality is key; you want data in memory right now, not somewhere on disk across the network.


The article explains that they did, in fact, promote the data into columns of the data store.


Perhaps I could share my project which is trying to 'fix' java serialization?

It was originally part of database engine, but was extracted into separate project. It solves things like cyclic reference, non-recursive graph traversal and incremental serialization of large object graphs.

https://github.com/jankotek/elsa/


TLDR: we had shitty code, optimized it, now it runs well. No code examples, nothing.


If you want more details, we were packing a Row class into a base64 encoded string using an ObjectOutputStream. This is a fine thing for small scale serialization but sucks at scale, because of the reasons mentioned in the post. Sorry we don't have code examples, but it's unclear how useful it'd be given that no one else uses our file format. If you want a bit more detail on how the format works. Each metadata contains a list of typed columns to define the schema of a given part. Our map-reduce framework has a bunch of internal logic that tries to justify the written Row class with the one the Mapper class is asking for. This allows us to do things like ingest different versions of a row with in the context of a single job. I think questions of serialization at the scale are generally interesting, although ymmv. I know of one company using Avro, which doesn't let you cleanly update or track schema. They've ended up storing every schema in an HBase table and reserving the first 8 bytes to do a lookup into this table to know the row's schema.


Avro can store the schema inline or out of line; with inline schemas, it's at the start of the file (embedded JSON), and it describes the schema for all the rows in that file. If you're working with Hive, the schema you put in the Hive metastore is cross-checked with each Avro file read; if any given Avro file doesn't contain a particular column, it just turns up as null for that subset of rows. Spark and Impala work similarly.

I agree serialization at scale is interesting. My particular interest right at this moment is in efficiently doing incremental updates of HDFS files (Parquet & Avro) from observing changes in MySQL tables - not completely trivial because some ETL with joins and unions is required to get data in the right shape.


What do you mean when you say Avro doesn't let you "cleanly update or track schema"?

From what I've read about Avro 1. It can transform data between two compatible schemas. 2. It can serialize/load schemas off the wire, so you can send the schema in the header.

If schema serialization causes too much overhead, you can set things up so you only send the schema version identifier, as long as the receiver can use that to get access to the full schema.


I think what I'd heard about was likely a poorly implemented use of Avro. I haven't actually worked with it.


"We had shitty code, optimized it, now it runs [maybe] better" is probably the only fixture in software development.

I'm not sure kind of code samples one would want in the context of an article as abstract as this one.

I did not take anything from the article, but had a lot of "been there, done that" moments when skimming it. The difference between the OP an me: The OP wrote about it so others can learn, something I never did.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: