Hacker News new | past | comments | ask | show | jobs | submit login

Could someone explain the difference between this and Avro or Parquet? Do they serve the same purpose?



Parquet is designed specifically to store large amounts of data efficiently on disk. As such it defaults to compressing and encoding data to save space. Arrow is designed for immediate consumption without any materialization into a different in-memory data structure. It is already in a format well suited to be used for sending over the wire or reading directly from an API.

I don't know as much about the internals of Avro, but I know it is a bit different from Parquet, in that it can be used to serialize and deserialize smaller amounts of data. It is used to store large datasets in files, although it will in most cases be less space efficient than Parquet. It has also been used as a way of embedding complex structures into other systems (similarly to how JSON can be embedded in a database), or for serializing individual structures between systems. The binary representation of Avro needs to be read into a system-specific format like a C/C++ struct/object, Java object, etc. for consumption.

In contrast, Arrow is designed to represent a list of objects/records efficiently. It is designed to allow for a chunk of memory to handed to a lightweight language-specific container that can immediately reference into the memory to grab a specific value, without reading each of the records into it's own individual object or structure.


Parquet is also designed to efficiently store nested data on disk (by efficient I mean it can retrieve a field at arbitrary depth without needing to walk from the root of the record)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: