Matrices are read-only. The set of matrices is updated once a day to include the latest data.
The data format is a collection of pre-aggregated row and column vectors, encoded with variable length integers and run-length encoding. I should give a separate presentation about this.
This is awesome. While it's great how performant the approach is, I also really dig how elegant the whole solution is -- using Postgres FDW with Numba is very pragmatic and clean, while at the same time potentially extensible to GPGPU. I might try and give this a go for some DSP stuff at some point.
It looks really neat. Isn't 660 GB not that much really? I grant that the slides say they've used an optimized binary for the storage, but how does this compare to pandas?
660GB was just a small benchmark. The real thing uses more than a petabyte of raw data.
Pandas uses NumPy internally. You could use Deliroll as a replacement for NumPy in Pandas to get a nice interactive environment for amounts of data that can't be easily handled with plain NumPy.
This is interesting. My current project is a fraud detection system. We currently leverage Cascading/Hadoop. But I wanted to make sure the system is not Hadoop-centric. So I made a point of having the system be language agnostic. It looks like there might be a fit for this tool.
I passed the slides along to my team to see what they think. If they just impress upon the team that we need to store something other than just 0x0A delimited text files, I'll consider it a win.
You can think of it like that. It's a Python -> Machine Code compiler based on LLVM, and it uses Numpy types (and Blaze types) to do type inference on numerical and data transformation functions.
really impressive work; i like how you guys (apparently) began with a blank page, and set aside at least a few stale assumptions that most consider inviolate principles in DW design--eg, denormalization, star/snowflake schema.
I'm happy to answer any questions.