Apache Arrow: A new open source in-memory columnar data format

bcoates · on Feb 17, 2016

If you don't speak press-release, this is a cool project to create an in-memory interop format for columnar data, so various tools can share data without a serialize/deserialize step which is very expensive compared to copying or sharing memory on the same machine or within a local network.

https://git-wip-us.apache.org/repos/asf?p=arrow.git;a=blob;f...

(edited post because I fail reading git and didn't notice the java implementation)

jaltekruse · on Feb 17, 2016

Disclosure I am a committer on Apache Drill and Apache Arrow.

This isn't actually true. The java implementation has been complete and used in Apache Drill, a distributed SQL engine, for the past few years. While we anticipate a few small changes to make sure the standard works well across new systems, this is by no means an announcement without tested code.

https://git-wip-us.apache.org/repos/asf?p=arrow.git;a=commit...

perishabledave · on Feb 17, 2016

I stumbled upon Google's whitepaper on Dremel. IIRC it explained how to store the data in columnar format, but I didn't quite get how that translated into quick queries. Happen to know where I can look to better understand how it works?

TallGuyShort · on Feb 17, 2016

I'll take a stab at explaining it myself: "transactional queries" are faster in a traditional format because they access many columns in few rows. For instance, if you want to log in to a website, you access the username, password, and possibly other authentication factors for a single user: this ends up being faster if you can go to the user's row, and then read all those fields in a contiguous scan.

"Analytical queries" are faster in a column-based format, because you're doing things like computing the correlation between 2 variables. Instead of looking at many columns in a specific row, you're looking at most of the data in a few columns. So instead of reading a whole row at once, it would be nice to skip the columns you don't care about and just grab big chunks of two or three columns.

Does that make sense?

edit: For a long time I was confused about why HBase was described as a columnar data store when access was still pretty row-based. I think the reason is because you group columns into column families which can be stored and retrieved separately, so you still get some of the benefit of a true column-oriented store.

vdm · on Feb 20, 2016

> For a long time I was confused about why HBase was described as a columnar data store when access was still pretty row-based

The term was overloaded by column-family stores, which were often referred to as just 'column stores', probably by people who were not aware of systems like Vertica and MonetDB.

http://dbmsmusings.blogspot.co.uk/2010/03/distinguishing-two...

ww520 · on Feb 18, 2016

Column-based table store the values of each column together. Two properties make them suitable for fast query against a column: data locality and compressible data.

Since the column values are stored together, more of them can be crammed into a data page. Querying the data of a column can process much more data per page load than row-based, whereas a row-based table's data page has all other columns.

The values of a column can be stored in a sorted order, which is highly compressible, enabling cramming even more data into a data page. You can get much more data with each data page loading. E.g. The column Name is stored as sorted along with the associated row id.

    Name, (row id)
    --------------
    Joan, (101)
    Joanna, (307)
    John, (15)
    John, (32)
    John, (6)
    Johnson, (31)
    Johnson, (44)

The duplicate names can be stored as one:

    Name, (row id)
    --------------
    Joan, (101)
    Joanna, (307)
    John, (15,32,6)
    Johnson, (31,44)

Prefix encoding can further compress the sorted data:

    Name, (row id)
    --------------
    Joan, (101)
    4+na, (307)
    2+hn, (15,32,6)
    4+son, (31,44)

Now imagine the query: select count(Name) from Table. Scanning the Name column values only touches 4 records right next to each other in a data page and adds up each record's reference row ids.

Select Name from Table where Name like "John%" would do a binary search down the sorted names, load two records 2+hn and 4+son, and expand them into 5 names.

infinite8s · on Feb 17, 2016

This is a good high level overview from Twitter (who created Parquet): https://blog.twitter.com/2013/dremel-made-simple-with-parque...

jaltekruse · on Feb 17, 2016

To be fair this isn't the best UI for git. We were waiting for the mirror to Github to be set up, so we decided to just include the link to the Apache git repo UI to be safe the link was live when the press announcements went out.

The Github mirror is now live, you can see it here: https://github.com/apache/arrow

evmar · on Feb 17, 2016

As best I can tell, you don't have a "git clone" URL visible anywhere on the site. I have the vague impression that gitweb is capable of showing one but you have to configure it(?).

jaltekruse · on Feb 17, 2016

We are just using the site as set up by the Apache Infrastructure team. I'll file a ticket to see if they can add the clone URL there.

Here is a page with the clone URLs: https://git.apache.org/

shin_lao · on Feb 17, 2016

Does it mean that other engines could/should use that format?

okigan · on Feb 18, 2016

Where are the API docs? this seem rather useless: https://github.com/apache/arrow

xtacy · on Feb 17, 2016

Nice initiative. Cheap serde and cross-language compatibility with an eye towards data scan intensive workloads is an important component!

Have you folks considered Supersonic engine from Google, which was designed with similar (but not as extensive as Arrow) goals in mind?

https://github.com/google/supersonic

infinite8s · on Feb 18, 2016

Do you have any experience with Supersonic? It appears to be abandoned (at least there's no activity on the mailing list as of 2014 - https://groups.google.com/forum/#!forum/supersonic-query-eng...)

rch · on Feb 17, 2016

In-memory only... How is this better than the SFrame implementation from Dato (2015) that was posted here a couple of days ago?

https://news.ycombinator.com/item?id=11106501

sandGorgon · on Feb 17, 2016

even i had that question - especially when people are talking about using SFrame as the underlying structure for Julia. but then I saw this:

"Arrow's cross platform and cross system strengths will enable Python and R to become first-class languages across the entire Big Data stack," said Wes McKinney, creator of Pandas.

Code committers to Apache Arrow include developers from Apache Big Data projects Calcite, Cassandra, Drill, Hadoop, HBase, Impala, Kudu (incubating), Parquet, Phoenix, Spark, and Storm as well as established and emerging Open Source projects such as Pandas and Ibis.

this is pretty much nuke-from-orbit.

rch · on Feb 17, 2016

> this is pretty much nuke-from-orbit

That analogy might imply overkill, thus highlighting the tactical advantages of the SFrame approach in processing a month's worth of 1-10GB daily-generated SQLite files, for instance.

sandGorgon · on Feb 18, 2016

Well no. It means the same situation as git vs mercurial. One may be better than the other (as you mentioned), but it really doesn't matter.

If Pandas and other high profile products are endorsing it (and may adopt it), it's going to be very hard for 99% of people to choose something else.

tanlermin · on Feb 17, 2016

Python's Dask out of core dataframe can also do that.

infinite8s · on Feb 17, 2016

Dasks' out of core dataframes are just a thin wrapper around pandas dataframes (aided by the recent improvement in pandas to release the GIL on a bunch of operations)

tanlermin · on Feb 17, 2016

Uh, no they are not. They lazy- scale pandas to on disk and distributed files.

http://dask.pydata.org/en/latest/dataframe.html

"Dask dataframes look and feel like pandas dataframes, but operate on datasets larger than memory using multiple threads."

http://blaze.pydata.org/blog/2015/09/08/reddit-comments/

sandGorgon · on Feb 18, 2016

Why doesn't Pandas have anything to save the entire workspace to disk (like .RData). There are all these cool file formats like Castra, HDF5, even the vanilla pickle - but I don't see anything with a one shot save of the workspace (something like Dill)

Is this an antipattern for Pandas?

infinite8s · on Feb 18, 2016

You haven't refuted anything I said. Internally the dask dataframe operations sit on top of pandas dataframes. All dask does is automatically handle the chunking into in-memory pandas dataframes and interpret dask workflows as a series of pandas operations.

rodionos · on Feb 17, 2016

I had a hard time parsing out the technical bits behind the facade of titles and endorsements. I didn't know you could be a Vice President of an open source product. Two roles I'm familiar with: committer and PMC chair. Now VPs. Just saying...

glibgil · on Feb 17, 2016

An open source project with a succession plan? What's wrong with that?

jkestelyn · on Feb 17, 2016

PMC Chair = Apache VP

rodionos · on Feb 17, 2016

Good to know, thanks.

agentgt · on Feb 17, 2016

Not to completely denigrate Apache but I have found contributing to Apache projects (not the license but real apache projects) to be somewhat of a hassle. They generally do not do PRs but rather patches via email or attached to bugs (perhaps some do but I have yet to see one that does), require signatures for contribution, and JIRA is really getting slow these days.

I'm somewhat ignorant as I don't run any Apache projects but I'm curious as to why people choose Apache to back their project these days. I guess why choose a committee instead of just leaving it on Github. I suppose its the whole voting and board stuff.

rectang · on Feb 17, 2016

Projects at Apache have varying levels of integration with Github. For projects with good integration (and there are lots: Spark, Cordova, etc.), a pull request is fine.

As for why projects choose Apache over Github, there are lots of reasons. Some of them apply to choosing any foundation (Apache, Eclipse, etc.) over Github: legal rigor, known quantity for enterprisey consumers, and so on. Probably the biggest reason many company sponsored projects end up at the ASF is because the ASF has a reputation as a good place for competing companies to collaborate on a common code base.

Personally, I have chosen to donate a great deal of my own time and energy to the ASF because I greatly treasure its emphasis on governance by individual contributors rather than corporations. (That the ASF is a 501(c)(3) non-profit rather than a 501(c)(6) like some of the more slick, consortium-like foundations is related.)

ccleve · on Feb 17, 2016

Could someone explain the difference between this and Avro or Parquet? Do they serve the same purpose?

jaltekruse · on Feb 17, 2016

Parquet is designed specifically to store large amounts of data efficiently on disk. As such it defaults to compressing and encoding data to save space. Arrow is designed for immediate consumption without any materialization into a different in-memory data structure. It is already in a format well suited to be used for sending over the wire or reading directly from an API.

I don't know as much about the internals of Avro, but I know it is a bit different from Parquet, in that it can be used to serialize and deserialize smaller amounts of data. It is used to store large datasets in files, although it will in most cases be less space efficient than Parquet. It has also been used as a way of embedding complex structures into other systems (similarly to how JSON can be embedded in a database), or for serializing individual structures between systems. The binary representation of Avro needs to be read into a system-specific format like a C/C++ struct/object, Java object, etc. for consumption.

In contrast, Arrow is designed to represent a list of objects/records efficiently. It is designed to allow for a chunk of memory to handed to a lightweight language-specific container that can immediately reference into the memory to grab a specific value, without reading each of the records into it's own individual object or structure.

infinite8s · on Feb 17, 2016

Parquet is also designed to efficiently store nested data on disk (by efficient I mean it can retrieve a field at arbitrary depth without needing to walk from the root of the record)

buremba · on Feb 17, 2016

Nice to see a new columnar data format alternative. Just a quick question though.

The existing columnar data formats such as Parquet and ORC aim to be space-efficient since the data is stored in disk and IO operations are usually the bottleneck. The columnar data formats shine in big-data area so the amount of data will be huge. Given that columnar data formats can be compressed efficiently and that's of the main points of columnar data formats such as Parquet and ORC, I'm not sure that I understand the main point of in-memory columnar data formats.

Once the data is in-memory and we can access any column of a row in constant-time what's the difference between a row-oriented data format and columnar data format?

gopalv · on Feb 18, 2016

> that's of the main points of columnar data formats such as Parquet and ORC, I'm not sure that I understand the main point of in-memory columnar data formats.

ORC has had its own in-memory columnar data-format for the last couple of years - VectorizedRowBatch.

This doesn't use any Unsafe access, but uses pure JVM arrays for layout since it allows the Java JIT to unroll a lot of loops internally.

Here's an example of a patch from Intel for unrolling the 64 bit operations into 256 bit operations and allowing the Java code to be auto-vectorized to use ymm<n> registers

https://issues.apache.org/jira/browse/HIVE-10180

Secondly, we maintain isRepeating=true independently for each column, which means that operations for 1024 rows can sometimes be as fast as operations for 1 row.

While in a row-based model, it would have no way of maintaining duplicate information on a column directly.

I have looked at the ValueVectors impl in Drill and Tungsten in Spark - which are more cache efficient than Hive's internal columnar structure. However erasing type information into a byte[] structure prevents the JIT from auto-vectorizing inner loops.

khc · on Feb 17, 2016

Random memory access isn't really constant time when you factor in hardware prefetch and cache lines. See "What Every Programmer Should Know About Memory"

TallGuyShort · on Feb 17, 2016

That, and regardless of row- or column-orientation, any common in-memory format that was actually as well-established as this looks like it will be in Big Data projects would be nice.

jaltekruse · on Feb 17, 2016

Exactly this, the most basic example is an operation on a single column. If the data for a single column is mixed in with other data in a row-oriented layout, you are going to have to bring a new chunk of memory into the CPU cache sooner to process a given amount of data. If all of the data is packed together tightly you can read many more values out of the cache before you need to go get more from main memory.

mamcx · on Feb 17, 2016

I have a similar question. I'm building a relational engine with columnar storage, but is only in-memory. All the info I have read is about improve IO, but wonder about a in-memory storage that is performant (currently: I just have a array per-column).

My understanding is that insert will suffer badly for things like compressed columns. If my relational engine is used alike-Datasets where is common to iterate by rows AND do joins, unions, projections and filters, so how balance the thing?

david-given · on Feb 17, 2016

Is it streamable? Could I use this as an intermediate format to send columnar data between two processes via a pipe?

barneso · on Feb 17, 2016

Better, if the processes are on the same machine you could use it to share the data via shared memory or a common memory mapping, to avoid having copies of the data on each end of the pipe.

tycho01 · on Feb 22, 2016

I'm interpreting this as saying they want to use the same representation to use in memory (for querying) and for 'serialization' (sending the same thing over the wire). This begs the question why separate serialized representations ever became a thing in the first place.

My understanding is that serialization became a thing because in-memory representations tend to use pointers to shared data structures that may thus be referenced multiple times while being stored only once. This would not translate 1:1 to serialized representations (where memory offsets would no longer hold meaning) -- much less in any language-agnostic way.

So I have this suspicion that Apache Arrow would not support reusing duplicate data while storing it only once. Would anyone mind clarifying on this point?

NovaX · on Feb 18, 2016

"Modern CPUs are designed to exploit data-level parallelism via vectorized operations and SIMD instructions. Arrow facilitates such processing."

How will Arrow use vectorized instructions on the JVM? That seems to be only available to the JIT and JNI, which is a frustrating limitation.

crudbug · on Feb 17, 2016

"All systems utilize the same memory format"

Will Cassandra Java drivers support this ?

jkestelyn · on Feb 18, 2016

More technical details available here:

http://blog.cloudera.com/blog/2016/02/introducing-apache-arr...