Hacker News new | past | comments | ask | show | jobs | submit login
Data Brewery (open-source data processing + OLAP in python) (databrewery.org)
74 points by thibaut_barrere on June 8, 2011 | hide | past | favorite | 11 comments



I'm the author of Brewery/Cubes. Both projects are very young - started last year, in December 2010.

As for Cubes: goal is to create light-weight framework with pluggable backends. Currently simple SQL backend and MongoDB backend are implemented.

Some public projects that are using Cubes for OLAP are:

Donations for sport and culture:

http://granty.transparency.sk/en/

Public procurements of Slovakia (still under development):

http://vestnik-test.democracyfarm.org/en/report/all?cut=date...

If you are asking about performance, my answer is: I do not know yet, haven't stressed it too much. I would very like to hear any feedback and/or recommendations. Current focus was on simplicity and easy of use, performance will come later.

For brewery, here are some blog notes:

http://blog.databrewery.org/

Presentation where data brewery was used in a project:

http://slidesha.re/i9O4kC

I hope to prepare more information soon, with examples. I want data brewery to be more distributed with cusomisable nodes (like you would be able to use a distant server as a processing node or part of processing stream).

Goal of data brewery is to provide "way of working with data streams", focusing more on data analysis than on data transformation. However, it does not mean that you would not be able to use it for the further.

Anyway, I would appreciate any feedback, and gladly answer any questions. I am also looking for cooperation, if you are interested, drop me a line.

Stefan - @Stiivi on twitter, author of Data Brewery/Cubes


As always, if you know other open-source OLAP or data-processing stuff, I'd love to here from you.


Check out Mondrian: http://mondrian.pentaho.com/ I used it a couple years back at a startup. It's written in Java and IIRC works (only?) with MySQL. Mondrian took a bit of work to get setup, mostly due to my lack of OLAP knowledge at the time, but once setup, it was pretty nice and fast. I think it uses materialized views for the cube data. I'm actually working on a new project where an OLAP will be nice. Thanks for posting Brewery, I'll definitely give it a spin. OLAP's don't get much love, but are very useful for certain types of problems.


I will definitely look at it soon, thanks!

One thing I wonder is how easy it would be to integrate this with a MongoDB or Redis backend.

Other useful links I found on my quest:

- https://github.com/rsim/mondrian-olap (jruby olap queries on mondrian)

- http://www.slideshare.net/rsim/multidimensional-data-analysi...

- http://www.amazon.com/Pentaho%C2%AE-Solutions-Intelligence-W...


We use mondrian currently. It supports just about every database that is supported by JDBC (we use it with PostgreSQL). Just an FYI.


Have a look at knime.

http://www.knime.org/


A few questions:

  * What kind of limits and performance does this implementation have?
  * Will data be fetched from the database for each query?
  * Is it possible to have dimensions with millions of
    values and expect reasonable query times?
  * Looks like it supports advanced topologies and hierarchies.
    How will dimensions with a high carnality affect performance?


See my post about the projects: they are very young, just little over half-year old - performance was not focus yet. I would definitely have them to be able to handle more data more efficiently, however, goal more on simplicity of use than on ability to process really huge amounts of data (like telco data - background where I come from).

Before I answer your questions (I assume that you are referring to Cubes - OLAP framework), I think it would be good to note, that Cubes has pluggable backends. Currently simple denormalisation-based SQL backend and MongoDB backend are implemented. I want to have them more advanced.

* Will data be fetched from the database for each query?

- currently yes, however we did some experiments with plain HTTP caching of Cubes/Slicer server and it worked pretty nicely for our current needs

* Is it possible to have dimensions with millions of values and expect reasonable query times?

- not tested yet

* Looks like it supports advanced topologies and hierarchies. How will dimensions with a high carnality affect performance?

- right, it supports hierarchies, however same as above: not tested yet for performance

I am open to any commeents/suggestions regarding the framework(s).

Stefan Urbanek, @Stiivi on Twitter (author of Cubes)


I have no idea (I just came across this link and thought I would share).

If you find out, I'd like to know too.


Is this distributed?


Not yet, I would like Brewery to be distributed, I had it in mind while designing it. See my general post about brewery.

@Stiivi - author of Brewery/Cubes




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: