Metaprogramming Python For Big Data

vtuulos · on Dec 12, 2013

Author here: i just checked this morning - we have 100B+ rows in our deliroll matrices, hosted on a single machine.

I'm happy to answer any questions.

vtuulos · on Dec 12, 2013

Btw, if you would be interested in hacking Deliroll and other related things, feel free to contact me - my email is on the slides. We are hiring.

storrgie · on Dec 12, 2013

What did you use to make these slides?

vtuulos · on Dec 12, 2013

Reveal.js https://github.com/hakimel/reveal.js/

nly · on Dec 12, 2013

What data format are you using for your sparse matrices, and what's the INSERT performance like?

vtuulos · on Dec 12, 2013

Matrices are read-only. The set of matrices is updated once a day to include the latest data.

The data format is a collection of pre-aggregated row and column vectors, encoded with variable length integers and run-length encoding. I should give a separate presentation about this.

fixxer · on Dec 12, 2013

I should give a separate presentation about this.

Please do. Also, any change of releasing this?

Fede_V · on Dec 12, 2013

Awesome to see tech like numba being used like this. Are you guys going to write a more 'techy' version of how you accomplish all of this?

vtuulos · on Dec 12, 2013

Yes, definitely. We might be able to open-source some parts of the system eventually as well.

Fede_V · on Dec 12, 2013

Excellent. For sparse storage, do you use Scipy Sparse, or you have your own custom built solution?

I know continuum was working on a version of memmapped numpy (blaze, iirc) arrays which looked really interesting.

vtuulos · on Dec 12, 2013

I started with Scipy Sparse. I ditched it for a custom solution so I could support variable-length integers, run-length encoding etc.

pwang · on Dec 12, 2013

Good work Ville, let me know when you post those slides/writeup. :-)

vtuulos · on Dec 12, 2013

Thanks Peter! Let me know if you happen to visit SF. It would be nice to catch up.

3pt14159 · on Dec 12, 2013

This is next level. I've been doing this type of stuff manually for years I don't know why it never occurred to me to build out the general solution.

eliteraspberrie · on Dec 12, 2013

Numba is impressive! If you use NumPy regularly, check it out: http://numba.pydata.org/

parrotdoxical · on Dec 12, 2013

This is awesome. While it's great how performant the approach is, I also really dig how elegant the whole solution is -- using Postgres FDW with Numba is very pragmatic and clean, while at the same time potentially extensible to GPGPU. I might try and give this a go for some DSP stuff at some point.

mathattack · on Dec 12, 2013

Thanks for sharing. It's interesting when people walk through their though process.

wedesoft · on Dec 12, 2013

I did something similar with Ruby. I used GCC and on-the-fly linking for JIT compilation: http://www.wedesoft.de/hornetseye-api/

virmundi · on Dec 12, 2013

It looks really neat. Isn't 660 GB not that much really? I grant that the slides say they've used an optimized binary for the storage, but how does this compare to pandas?

vtuulos · on Dec 12, 2013

660GB was just a small benchmark. The real thing uses more than a petabyte of raw data.

Pandas uses NumPy internally. You could use Deliroll as a replacement for NumPy in Pandas to get a nice interactive environment for amounts of data that can't be easily handled with plain NumPy.

virmundi · on Dec 12, 2013

This is interesting. My current project is a fraud detection system. We currently leverage Cascading/Hadoop. But I wanted to make sure the system is not Hadoop-centric. So I made a point of having the system be language agnostic. It looks like there might be a fit for this tool.

I passed the slides along to my team to see what they think. If they just impress upon the team that we need to store something other than just 0x0A delimited text files, I'll consider it a win.

CraigJPerry · on Dec 12, 2013

So is numba more like a Cython replacement? I know nothing about it, just assumed it was a NumPy replacement?

pwang · on Dec 12, 2013

You can think of it like that. It's a Python -> Machine Code compiler based on LLVM, and it uses Numpy types (and Blaze types) to do type inference on numerical and data transformation functions.

http://numba.pydata.org/

doug1001 · on Dec 13, 2013

really impressive work; i like how you guys (apparently) began with a blank page, and set aside at least a few stale assumptions that most consider inviolate principles in DW design--eg, denormalization, star/snowflake schema.

meowface · on Dec 12, 2013

Very cool. Impressive that they got that kind of efficiency with Python.

agumonkey · on Dec 12, 2013

Beautiful work. Also, I read lisp on the last two slides.

cbsmith · on Dec 12, 2013

Or just use FastBit-Python directly...