Hacker News new | past | comments | ask | show | jobs | submit login
Metaprogramming Python For Big Data (tuulos.github.io)
104 points by fixxer on Dec 12, 2013 | hide | past | favorite | 27 comments



Author here: i just checked this morning - we have 100B+ rows in our deliroll matrices, hosted on a single machine.

I'm happy to answer any questions.


Btw, if you would be interested in hacking Deliroll and other related things, feel free to contact me - my email is on the slides. We are hiring.


What did you use to make these slides?



What data format are you using for your sparse matrices, and what's the INSERT performance like?


Matrices are read-only. The set of matrices is updated once a day to include the latest data.

The data format is a collection of pre-aggregated row and column vectors, encoded with variable length integers and run-length encoding. I should give a separate presentation about this.


I should give a separate presentation about this.

Please do. Also, any change of releasing this?


Awesome to see tech like numba being used like this. Are you guys going to write a more 'techy' version of how you accomplish all of this?


Yes, definitely. We might be able to open-source some parts of the system eventually as well.


Excellent. For sparse storage, do you use Scipy Sparse, or you have your own custom built solution?

I know continuum was working on a version of memmapped numpy (blaze, iirc) arrays which looked really interesting.


I started with Scipy Sparse. I ditched it for a custom solution so I could support variable-length integers, run-length encoding etc.


Good work Ville, let me know when you post those slides/writeup. :-)


Thanks Peter! Let me know if you happen to visit SF. It would be nice to catch up.


This is next level. I've been doing this type of stuff manually for years I don't know why it never occurred to me to build out the general solution.


Numba is impressive! If you use NumPy regularly, check it out: http://numba.pydata.org/


This is awesome. While it's great how performant the approach is, I also really dig how elegant the whole solution is -- using Postgres FDW with Numba is very pragmatic and clean, while at the same time potentially extensible to GPGPU. I might try and give this a go for some DSP stuff at some point.


Thanks for sharing. It's interesting when people walk through their though process.


I did something similar with Ruby. I used GCC and on-the-fly linking for JIT compilation: http://www.wedesoft.de/hornetseye-api/


It looks really neat. Isn't 660 GB not that much really? I grant that the slides say they've used an optimized binary for the storage, but how does this compare to pandas?


660GB was just a small benchmark. The real thing uses more than a petabyte of raw data.

Pandas uses NumPy internally. You could use Deliroll as a replacement for NumPy in Pandas to get a nice interactive environment for amounts of data that can't be easily handled with plain NumPy.


This is interesting. My current project is a fraud detection system. We currently leverage Cascading/Hadoop. But I wanted to make sure the system is not Hadoop-centric. So I made a point of having the system be language agnostic. It looks like there might be a fit for this tool.

I passed the slides along to my team to see what they think. If they just impress upon the team that we need to store something other than just 0x0A delimited text files, I'll consider it a win.


So is numba more like a Cython replacement? I know nothing about it, just assumed it was a NumPy replacement?


You can think of it like that. It's a Python -> Machine Code compiler based on LLVM, and it uses Numpy types (and Blaze types) to do type inference on numerical and data transformation functions.

http://numba.pydata.org/


really impressive work; i like how you guys (apparently) began with a blank page, and set aside at least a few stale assumptions that most consider inviolate principles in DW design--eg, denormalization, star/snowflake schema.


Very cool. Impressive that they got that kind of efficiency with Python.


Beautiful work. Also, I read lisp on the last two slides.


Or just use FastBit-Python directly...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: