Hacker News new | past | comments | ask | show | jobs | submit login
Bonobo – A data processing toolkit for Python 3.5+ (bonobo-project.org)
227 points by rdorgueil on April 22, 2017 | hide | past | favorite | 77 comments



I'm annoyed that I bothered to read the tutorial to this. The TLDR: "Write some generators or functions, put them in a list, and Bonobo will call them all for you in order. Look at the example files for more." The example files are all basic string transformations. The docs are mostly blank pages and missing sections. What little is written has more jokes and conversational tics than information.

What does this even do? There's mention of DAGs and different execution strategies if you really dig through the docs, but is that it? If so, why would you use this instead of joblib or some other established parallelism lib?


Bonobo runs each functions in the pipeline in parallel and make the fifo queues plumbing and thread pool management completely transparent.

The TLDR would then be "Write some generators or functions, link them in a graph, and call them in order on each line of data as soon as the previous transformation node output is ready.". For example if you have a database cursor that yields each line of a query as its output, it starts to run the next step(s) in the graph as soon as the first result is ready (yet not stop yielding from database until the graph is done for the current row). I did not find it easy to do with the libraries I tried.

The docs clearly lacks completion to say the least, and would need an example with a big dataset, one with long individual operations and one with a non linear graph, so it's more obvious that, of course, it's not made to process strings to uppercase twice in a row.

Stay tuned, I'm very happy HN brought it to homepage, did not really think it could happen at this stage though and I understand you. But that's a good thing for the project to move forward.


This is really cool!

Python is my usual language of choice, but recently I picked up Go for some data processing because there was a lot of benefits to parallelising the task - which Go made easy.


Yeah, why does anyone need something to run some functions in order for them? I can do that, thanks. If it ran them on some... say 'big data' platform, that would be something. As is, this does not deserve to be front page. This is vaporware.


Yeah there seems to be a lot of marketing but I found a concise definition on the author's personal website: "extract transform load for python 3.5+". It could be noted that some of the earlier commit messages include "more marketing".


They mentioned it is in ALPHA, give it some time.


Pretty snappy website for a project that's in alpha.


I haven't tried this yet, but am praying that it delivers even half of what it promises. For whatever reason I just can't get my head around pandas, despite multiple attempts.

If this also turns out to be inscrutable I may be forced to conclude that I'm stupid...


>For whatever reason I just can't get my head around pandas, despite multiple attempts.

You need to work with pandas consistently for a month or two, and then it'll all click.

pandas is not complex, nor deep. It is, however, very broad. Most of the time it is "Here's what I need to do. I'm sure there's an API or two in pandas that will let me do this," and then you spend an hour or so looking at the documentation to find those APIs.

My first month or two was: "I need to do this. Let me Google". Pretty much every time someone had asked that same question on SO.

If you stick to it for 2 months, you'll eventually "learn" all the routine tasks and Googling stuff becomes only occasional.

And it does help if you're familiar with NumPy.


Tried these video series? That guy explains it really nicely and is really bright.

https://pythonprogramming.net/search/?q=pandas


Thanks for this! It's funny in a way: I'm trying to learn the basics, but don't have a clear idea of what the basics actually are. This looks like it could be just the ticket. Cheers!


In case you're looking for more, this tutorial series hit the front page of HN a few years ago: http://www.gregreda.com/2013/10/26/intro-to-pandas-data-stru...

Modern pandas is a bit more idiomatic now though: https://tomaugspurger.github.io/modern-1.html


Are you trying to learn pandas just to learn pandas or do you have a motivating example?


What have you wanted to use it for?

Pandas is basically an R data frame for Python. A sloppy description of that is a text mode spreadsheet.

The description of Bonobo doesn't immediately invite the comparison to Pandas, to me anyway.


You're very right, as I'm using both pandas and bonobo for different reasons.

Mostly, when I want a quasi-mathematical look over a dataset, pandas is my tool of choice. For all those data pipeline things that reasonably fit on one computer, I do use bonobo.


I'm an avid Pandas user. Some stuff at work has come up recently that calls for ETL - and I'm trying to figure out what the best tools are. Is any of your Bonobo code public? I'd be curious to see what a real-life project looks like...


No, I don't have real-life public code available. I'm gonna see what I can extract from old commercial project for publication, but I can't guarantee anything.


Luigi is a simple tool from Spotify that seems to solve a similar data workflow (with DAGs) problem as Bonobo. Airflow from AirBnB is a more complex tool and I've understood that Spotify has lately moved from Luigi to Airflow.

Can anybody comment how Bonobo compares to Luigi?


Usually just simple data analysis, really nothing far outside of the 'statistics' lib. Currently it's more the exploration and discovery part of the exercise I'm struggling with. I've got a few hundred thousand csv files representing various aspects of Australia's national energy market (e.g. outcomes of 5 minute supply auctions). I'm trying to make my way through that, figure out what's relevant and wrangle the relevant stuff in some organised fashion.

Is pandas the wrong kind of tool for this type of thing? Going off what rdorgueil has said, I'm beginning to suspect so. Is there a data-wrangling 'gold standard' library for python?


I'm just learning pandas as well but I think it is the right tool for the job. I am using django-pandas so I can do easy ORM stuff. If I were to sketch out your use case:

Create a object/class called

    AuctionResult
     - some datetime
     - value
Then you'd query it qs = AuctionResult.objects.all()

then you load it into a pandas dataframe:

df = read_frame(qs)

After that you can do all sorts of the fun stuff I imagine.


I don't see why pandas won't work for your case. It sounds like most if not all the csvs contain the same columns and type of data. You could easily create a pandas dataframe that combines them all, then use any plotting library like matplotlib and/or seaborn to plot. If you need help provide some examples of the csvs you are trying to parse.


Pandas is definitely the most popular (and imo best) data wrangling library for Python.


You should also have a look at this book(http://www.goodreads.com/book/show/14744694-python-for-data-...), it helps that author of pandas library is also the author of the book. From the description of your use-case, you seem to be doing exploratory data analysis, pandas can definitely handle that.


Wait a little to buy it. There is a new edition in the oven.


McKinney's book is good. Unfortunately there have been several reasonably important breaking API changes since it was published, so it's now to be taken with some salt.


You're not alone! I think pandas made some design decisions around their transformation functions that make it a lot more cumbersome to use than R's dplyr. It's not obvious from the documentation, though.

As an example from the pandas docs [1], in dplyr you can do

> gdf <- group_by(df, col1)

> summarise(gdf, avg=mean(col1))

In pandas this is similar to

> df.groupby('col1').agg({'col1': 'mean'})

But dplyr's summarize it's much more flexible than agg, as you can do all kinds of things to any number of columns. E.g.

> summarise(gdf, some_name = f1(col1) + f2(col2))

But in pandas you can apply 1 function to 1 column with agg.

[1] http://pandas.pydata.org/pandas-docs/stable/comparison_with_...


Swings and roundabouts. I'm a big fan of dplyr, and R definitely does some thing better than Pandas, but I've never found anything as flexible as pd.pivot_table for cross tabulations. For instnace, the lack of multiindexing in R is a big drawback.


you can supply a map, ie:

gdf = df.groupby('col1').agg({'col2': np.mean, 'col3':np.std,'col4': lamba x: np.mean((x) / np.std(x))})

once you've got your grouped dataframe, go nuts

gdf['some_name'] = gdf['col1'].apply(f1) + gdf['col2'].apply(f2)


The summarise example I gave creates a single new column (some_col), that is a function of two columns from the grouped data frame. Passing a map to agg is just creating multiple columns, each a function of at most a single column in the dataframe.

(I should have used a better example, like summarise(gdf, some_col = f(cola / colb))


Totally a preference thing. I strongly prefer pandas to dplyr having worked with both.


You can supply a map to agg, something like {col1:'sum', col2: 'mean',...}


Yes, but it is still applying that function to a single column (the summarise example I gave could aggregate multiple columns into a single result, e.g. sum(col1 / col2))


I enjoyed this free Edx course:

https://courses.edx.org/courses/course-v1:Microsoft+DAT208x+...

Their is some pratical exercises that you do in your browser that really helps to get the grasp of it.

Don't miss Pandas, they are really cool!


Don't feel bad. Pandas is powerful but it has (at least when I used it) some truly abominable documentation.


Agreed the documentation could be a lot better. I wonder about a visual approach to using DataFrames and Series with all the different methods to demonstrated more clearly what's being done.


I really liked this visual explanation of pivot and reshape:

http://nikgrozev.com/2015/07/01/reshaping-in-pandas-pivot-pi...

(I´m not affiliated to site)


You mean excel?


And some annoying interface decisions, like d['x'] is a column but d[1:3] is a row slice.


pandas does a lot. This does nothing. That's the difference.


I'm trying to figure out if this is "all hat, no cattle". There seems to be a lot of "framework" here, without much core functionality.

Stated more precisely: if I'm stitching together things that process data and 'yield' results, why can't I just do that in pure Python? What does this framework add?


Short answer : parralel execution.


Multiprocessing or muilthreading? Why don't you market it as parallel coroutines processing? That gets me interested. Because there are dozens of frameworks with overgeneral descriptions.


Today, as a default, multithreading. But that's an implementation detail. Actually, Bonobo does not support coroutines (as in asyncio coroutines) so it would be a lie to market it this way. The plan though is to allow to use coroutines/futures in the future, for specific reasons (like long running/blocking operations where keeping output order tied to input order is of no importance). Still, there is a lot on the roadmap before this becomes a priority.

I note that I still have a lot of work explaining in simple terms what is actually bonobo, without falling in the trap of "overgeneral description".


"parallel" :)


Remember, the double l's make parallel lines.


The docs say nothing about this. If you've implemented pmap, maybe that is useful. But the framework itself doesn't seem to do anything.


It'd be good to see some comparisons, why this and not one of the other currently available systems? Why should I use this over, for example, Luigi?

What scale is this intended for?

Is it intended to nearly solve a simple problem over my 20TB of data on S3? Big complex graphs? Or more for transitioning a small local report system that's currently in three excel files into a tested python script?


It's indeed intended for «small data», by opposition to «big data». I know, that does not say much, but I basically wanted to handle small flux of data without having to install the "big weapons".

I'm preparing explanation pages for a lot of the questions I got, including comparisons, volumes of data, where it is good and where it is not ...

All that will be well ready before 1.0, but for now, we're at 0.2 ...

Thanks for all the hackerlove, though!


With the ancestor of bonobo, I was processing 5M lines of data in around 1 hour, including extraction, joins, api calls and a few loads. That should give a first info about the size target.


From looking at their examples and interfaces, it's clearly for simple, small scale processing.


All those references to monkeys hurt my head. Bonobos are not monkeys. If they wanted to name it after monkeys, they should've called it Capuchin or something.


Noted, sorry for that. I'll get more infos about bonobos.


The picture looks more like a Gorilla than a Bonobo too


Came here to post just that. It's called 'Bonobo', there's a picture of a gorilla, and the page keeps saying 'monkey'- as petty as it sounds, you're probably losing potential users to zoological nerdrage.


Yes, hackernews and twitter brutally told me I should take animal reign culture classes asap ...

This being said, if any of you have a good picture of bonobos that I can use instead of the current one, I'd be really glad to replace it! It needs to be released under a free license, though.

Thanks HN


Currently realizing that we only have one word in french for both ape and monkeys ...


Oh... now it makes more sense. Didn't mean to sound harsh. Thanks for sharing your framework! :)


It didn't sound harsh at all. I'm really laughing a lot right now about how ignorant I am about apes and monkeys. ^^


I came here to say that! The bonobo is an an indigenous ape of the left bank of the Congo river in the Congo rain forest in the DR Congo. They look indistinguishable to chimpanzees to the untrained eye.

Gorillas are a whole different species and you have at least 4 subspecies of gorillas, none of which look like chimps or bonobos.


They have some behavioral features that tell them apart from a chimpanzee. Apparently they use sex as a greeting. It is (sort of) anthropomorphized in a hilarious way in the Will Self novel 'Great Apes'. However that may be colouring my memory of how common this is in the real creatures...


Monkey just isn't a good term to fight over. There are plenty of respectable people arguing that monkey means simian, not just non-hominoid simian (as you take from it). Either which way, fighting over where the taxonomical set ends is likely a waste of breath.

But yeah, a CGI gorilla for a site called bonobo. Le sigh.


Interesting, right now we are using PETL[1] that we used to do with SSIS, bonobo for some reason reminds me of Bubbles library.

[1]: https://petl.readthedocs.io/en/latest/


+1 for petl, I'm using it right now on a project that deals with a lot of tabular data and it's been a huge time saver.


Ah, interesting. The example on the 'on-boarding' page reminds me of what we used to do at work.

We used itertools chains to write producers and consumer to create 'Chain' objects that process data exactly as the bonobo.Graph.

Can't wait to try this.


Sweet! generator based utilities for ETL. I think this really a good use of generators and coroutines. Reminds me of stackless based https://bitbucket.org/diji/pypes/src (backing video http://pyvideo.org/pycon-us-2011/pycon-2011--large-scale-dat... )


Before there was Pandas I wrote a website (using Django) to transform grid data into a denormalized CSV file. In other words, a reverse pivot. Basically it converts multiple header rows and header columns as separate fields for each intersecting value.

I've written this basic routine several times over in my career (once in Access VBA!) for different reasons. The current version of it is used to convert a store/item/quantity grid into per-store pick/pack slips.

Pandas has a built-in function that can de-pivot a table. I'm not sure it can handle my use case, however, with multiple header rows. Mine also has extra goodies like populating blank row or column values with the previous value in the row-column, among other bizarre features written to grapple with the inconsistent ways our clients make their distro spreadsheets. Trying to break them of their reliance on Excel for this type of planning has proven futile.

I'll have to spend some time with Bonobo and Pandas before I take on refactoring our grid tool. It needs a refactor mostly because I'm the only one who understands it. The new data munging libraries would surely simplify some very gnarly logic, and make it accessible to other developers should I get hit by a bus or leave the company.


Hmm. So what would be the advantage over Dask, which lets me scale out over a cluster?


This name is already taken.

https://wiki.gnome.org/Attic/Bonobo


There's a syntax error in mutate_my_dict_like_crazy at http://docs.bonobo-project.org/en/0.2/guide/purity.html.

Seems the documentation is still quite WIP.


So how is this different from Airflow, other than Windows compatibility and a lack of dashboard?


That's my question too. I've come to heavily rely on Airflow. As an Apache project now it's becoming mature.

From what I can tell browsing the site, Bonobo looks like it's designed to do data processing within the framework. Airflow insists that it's really a task coordinator/scheduler...however, tasks can be Python function calls. So it seems like Bonobo is a specific use case, where Airflow is the more general case (tasks can be SQL queries, bash commands, etc).


So, what is the advantage of using this over existing workflow management systems, such as Airflow, Azkaban and Luigi?


Seems like sklearn pipelines with a more generalized use case + additional helpful features for ETL. Very interested


Who is behind this?


Me (as an individual), and a few great people that helped me along the way. Not commercially endorsed, or supported.


How does this compare to Dask, Luigi or Airflow?


As soon as I can, I'll include comparison pages to the documentation, trying to keep it as objective as possible. I can't seriously answer this question in depth here, but it is planned, so at least experts from other systems can also jump in and complement/correct my understanding of each systems. I used a bunch of them, but I'm in no mean expert user of each so making it collaborative sound like a better idea than just giving my point of view.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: