Show HN: Vaex - Out of Core Dataframes for Python and Fast Visualization

themmes · on Dec 14, 2018

First of all, great to see more powertools to choose from for my ds workflow!

However, I am suprised to see no mention of Dask in the article. How do these libraries compare?

maartenbreddels · on Dec 14, 2018

Dask and vaex are not 'competing', they are orthogonal. Vaex could use dask to do the computations, but when this part of vaex was built, dask didn't exist. I recently tried using dask, instead of vaex' internal computation model, but it gave a serious performance hit.

There is some overlap with dask.dataframe, I think they are closer to pandas than vaex is. Vaex has a strong focus on large datasets, statistics on N-d grids and visualization as well. For instance calculating a 2d histogram for a billion row can be done in < 1 second, which can be used for visualization or exploration. The expression system is really nice, it allows you to store the computations itself, calculate gradients, do Just-In-Time compilation, and will be the backbone for our automatic pipelines for machine learning. So vaex feels like Pandas for the basics, but adds new ideas that are useful for really large datasets.

themmes · on Dec 14, 2018

How could I've missed you being the author. Thanks for your extensive answer, will definitely try the library! And thanks again for Ipyvolume, has been very useful so far.

maartenbreddels · on Dec 14, 2018

thanks!

JPKab · on Dec 14, 2018

Such phenomenal work.

BTW, for anyone on a Windows machine, getting this to work is very trivial.

There is a unix only library for locking files (fcntl) which prevents it from working on Windows. I mocked it in the path and made a function that returns 0 to test it.

Obviously adding a check for os and switching to a cross platform file locker would be a great contribution. I'll see if I can make that happen in the next week.

maartenbreddels · on Dec 14, 2018

There is an issue open for this: https://github.com/vaexio/vaex/issues/93 It should have been fixed, some more detailed report (version numbers installed) would be good to know.

maartenbreddels · on Dec 14, 2018

Oh, and thanks for the kind words!

rax · on Dec 14, 2018

It looks quite nice, and I will have to explore the performance comparisons with Dask more.

I have recently started using Xarray for some projects, and really appreciate the usability of multidimensional labelled data. Are the memory mapping techniques used for speedup here only applicable to tabular data?

The support for Apache arrow is quite nice. Have you considered any other formats, such as Zarr?

maartenbreddels · on Dec 14, 2018

Thank you. Memory mapping could be used for other data as well, and I have looked into zarr (even opened an issue for that https://github.com/zarr-developers/zarr/issues ). Memory mapping of contiguous data makes life much easier (for the application as well as OS), chunked data could be supported, but is more bookkeeping.

ah- · on Dec 14, 2018

I'll need to have a closer look later, but would vaex fit in with somewhat indexed mapped files?

E.g. parquet supports column indexes now: https://issues.apache.org/jira/browse/PARQUET-1201

sevensor · on Dec 14, 2018

Uses HDF5, which itself is a great file format, well suited for big tables of numbers. Good for similar reasons as SQLite3, but for different applications. Not a relational database, columns are more strongly typed. Better suited when you have hundreds or thousands of columns, worse when you're trying to query a particular row.

angelmass · on Dec 14, 2018

Very interesting! I will share it with my DS friends.

One thing I have struggled with optimizing is visualization and coordinate calculation of network graphs with 10s of millions of edges + nodes using networkX and most visualization tools. Have you looked into this utility for Vaex? Reading your article it sounds like it would be well-suited for it.

bayesian_horse · on Dec 14, 2018

The bigger question is what you want to achieve by visualizing so many nodes. If you want a map that can be zoomed in to view individual nodes, you mainly need to compute coordinates for every node. Finding the arrangement of the node is probably what gets you in trouble, so you probably need a custom algorithm which scales better (and does poorer, probably).

More interesting may be to identify clusters and either group them together or visualize these clusters as nodes themselves.

maartenbreddels · on Dec 14, 2018

I have not looked into it, maybe datashader can do this, which is a package purely focussing on viz, while vaex is more allround (although there is overlap). If you think vaex can be useful here, feel free to ask question/open issues https://github.com/vaexio/vaex

blattimwind · on Dec 14, 2018

Gephi?

ah- · on Dec 14, 2018

Great to see that you're supporting Apache Arrow! That makes it so much easier to gradually switch over.

wesm · on Dec 14, 2018

Note: Vaex has its own memory model. If you input Arrow, it converts to the Vaex data representation. Details here:

https://github.com/vaexio/vaex/blob/master/packages/vaex-arr...

One of the primary objectives of Apache Arrow is to have a common data representation for computational systems, and avoid serialization / conversions altogether.

maartenbreddels · on Dec 14, 2018

That is not correct, I just refer to the buffers/memory, 0 copying going on. Vaex is not really opinionated about the memory model actually. The only exception is the bitmasks that are being copied for now because of an incompatibility with numpy. But if I get a 50GB Arrow dataset, vaex leaves the structure intact. Thanks for your work on Arrow, I hope to support and contribute more to it in the future.

wesm · on Dec 14, 2018

I'm looking at the code I linked, and you are serializing in the general case, it is not zero copy. Unpacking a bitmap is not free.

maartenbreddels · on Dec 15, 2018

the 'convert' name is misleading perhaps, maybe we can agree the proof is in the execution time https://youtu.be/TlTcQJPUL3M?t=478 Anyway, let us celebrate a wider adoption of Arrow! :)

wenc · on Dec 14, 2018

Nice work. This looks like it could add a lot of value to a DS's toolbox.

Exploratory data analysis of large (but not huge) datasets have always been a slow and frustrating experience.

In the enterprise, we have plenty of datasets that are 100s of millions to a few billion rows (and many columns), so big enough to make conventional tools sluggish but not quite big enough for distributed computing. It sounds like vaex can help with EDA of these types of datasets on a single machine. I'd be interested in exploring the out-of-core functionality, which I hope means it will continue chugging along without throwing "out of memory" errors.

maartenbreddels · on Dec 14, 2018

That is exactly the sweet spot for vaex, and with a familiar DataFrame API (read pandas like) the transition does not hurt so much. It may sound cool to set up a cluster, but in many cases it is overkill, and vaex can get these kinds of jobs done.

aw3c2 · on Dec 14, 2018

> For example, it takes about a second to calculate the mean of a column in regular bins even when the dataset contains a billion rows (yes, 1 billion rows per second!).

A billion 32 bit floating point numbers are 4 Gigabytes. How can that be processed in one second unless there was any preprocessing?

fulafel · on Dec 14, 2018

Desktop PCs have about 35 GB/s of memory bandwidth and can do compute at ~200 Gflops, so this is just ~10% of peak bw and leaves you a budget of 200 flops computation per float value. If all 4 columns are accessed, there is still enough bandwidth (no idea of the data here was columnar layout or not).

The relevance to big data or out-of-core computation is left hazy, which would make this I/O bound in most cases? 4 GB fits easily in memory and is just mmap'ed from the OS disk cache if the data was recently touched. I guess with 4 columns you get to 16 GB which might be pushing it on a laptop.

maartenbreddels · on Dec 14, 2018

You are right, I'm actually underselling it. 1 second is the typical performance for doing a 2d histogram (or other binned statistics) since it involves writing to memory as well.

I just ran a quick benchmark: In [7]: %timeit -r3 -n3 df.mean(df.ra) 330 ms +- 5.46 ms per loop (mean +- std. dev. of 3 runs, 3 loops each) In [11]: f'{len(df):,}' Out[11]: '1,692,919,135' In [12]: 330/len(df)1e9 Out[12]: 194.92957057278463

so it is 0.2second for 1.7 billion rows, which is:

In [15]: (len(df)8/10243)/0.2 Out[15]: 63.066152296960354

63 GB/s. (this is a high end machine, on my laptop I get ~12GB/s)

We do not use float32 much in science since you really should know how not to screw up. It does give some extra performance boost (not much though), and also saves you on memory cache.

aw3c2 · on Dec 16, 2018

Is this cold data? Or already in RAM? What about a billion rows that are not in RAM yet?

How does it compare to plain numpy or pandas?

aw3c2 · on Dec 16, 2018

My thought was on first access or out-of-memory sizes. This would always be bound by I/O which means it is kind of a meaningless statistic.

Don't get me wrong, this seems like a project I will use but that marketing speak is weird.

maartenbreddels · on Dec 17, 2018

Don't expect this performance on a 4GB machine. Most machines now would have 16GB, or more. Let us assume you have 32GB and take a 24GB dataset. Most libraries load this into memory (allocating 24GB, leaves max 8 GB for the OS, including disk cache). The next process that wants to do the same cannot without a memory error. Also, when you restart your program, the OS will not have it in the cache, it will be as fast as your hard drive.

Vaex is much smarter with memory, it memory maps the data, nothing is allocated, all the memory is left to the OS for disk cache. This means you can have 10 users regularly restarting/running their program, without any (or minimal) harddrive activity. So I think it is a fair statement, and actually very conservative (see my other comments). An important message is to make people aware that working with a dataset this large is a pip install away, no need to spin up a cluster (yet).

_gok2 · on Dec 14, 2018

This is big news.

I've used similar proprietary libraries before, and virtual operations can be really powerful

maartenbreddels · on Dec 14, 2018

Thank you, yes they give much more flexibility: optimization (JIT), derivatives, checking your calculations afterwards, sending them to a remote server etc. Glad you like that :)

colobas · on Dec 14, 2018

Does it have python3 support? Tried installing it on a python3.7 environment and it failed

EDIT: I then tried a python3.6 environment and it worked. I guess it answers my question

maartenbreddels · on Dec 14, 2018

Absolutely, I think nowadays the question should be: 'does it still support Python2?' (it does btw)

My question is to you is, would you be so kind to open an issue to decribe the failure on https://github.com/vaexio/vaex/issues ? Please share which OS, which Python distribution (anaconda maybe) and/or the installation steps and error msg.