Bonobo – A data processing toolkit for Python 3.5+

goodside · on April 22, 2017

I'm annoyed that I bothered to read the tutorial to this. The TLDR: "Write some generators or functions, put them in a list, and Bonobo will call them all for you in order. Look at the example files for more." The example files are all basic string transformations. The docs are mostly blank pages and missing sections. What little is written has more jokes and conversational tics than information.

What does this even do? There's mention of DAGs and different execution strategies if you really dig through the docs, but is that it? If so, why would you use this instead of joblib or some other established parallelism lib?

rdorgueil · on April 23, 2017

Bonobo runs each functions in the pipeline in parallel and make the fifo queues plumbing and thread pool management completely transparent.

The TLDR would then be "Write some generators or functions, link them in a graph, and call them in order on each line of data as soon as the previous transformation node output is ready.". For example if you have a database cursor that yields each line of a query as its output, it starts to run the next step(s) in the graph as soon as the first result is ready (yet not stop yielding from database until the graph is done for the current row). I did not find it easy to do with the libraries I tried.

The docs clearly lacks completion to say the least, and would need an example with a big dataset, one with long individual operations and one with a non linear graph, so it's more obvious that, of course, it's not made to process strings to uppercase twice in a row.

Stay tuned, I'm very happy HN brought it to homepage, did not really think it could happen at this stage though and I understand you. But that's a good thing for the project to move forward.

robzyb · on April 23, 2017

This is really cool!

Python is my usual language of choice, but recently I picked up Go for some data processing because there was a lot of benefits to parallelising the task - which Go made easy.

rjurney · on April 23, 2017

Yeah, why does anyone need something to run some functions in order for them? I can do that, thanks. If it ran them on some... say 'big data' platform, that would be something. As is, this does not deserve to be front page. This is vaporware.

ben_jones · on April 23, 2017

Yeah there seems to be a lot of marketing but I found a concise definition on the author's personal website: "extract transform load for python 3.5+". It could be noted that some of the earlier commit messages include "more marketing".

reactor · on April 23, 2017

They mentioned it is in ALPHA, give it some time.

castle-bravo · on April 23, 2017

Pretty snappy website for a project that's in alpha.

spangry · on April 22, 2017

I haven't tried this yet, but am praying that it delivers even half of what it promises. For whatever reason I just can't get my head around pandas, despite multiple attempts.

If this also turns out to be inscrutable I may be forced to conclude that I'm stupid...

BeetleB · on April 22, 2017

>For whatever reason I just can't get my head around pandas, despite multiple attempts.

You need to work with pandas consistently for a month or two, and then it'll all click.

pandas is not complex, nor deep. It is, however, very broad. Most of the time it is "Here's what I need to do. I'm sure there's an API or two in pandas that will let me do this," and then you spend an hour or so looking at the documentation to find those APIs.

My first month or two was: "I need to do this. Let me Google". Pretty much every time someone had asked that same question on SO.

If you stick to it for 2 months, you'll eventually "learn" all the routine tasks and Googling stuff becomes only occasional.

And it does help if you're familiar with NumPy.

unixhero · on April 22, 2017

Tried these video series? That guy explains it really nicely and is really bright.

https://pythonprogramming.net/search/?q=pandas

spangry · on April 22, 2017

Thanks for this! It's funny in a way: I'm trying to learn the basics, but don't have a clear idea of what the basics actually are. This looks like it could be just the ticket. Cheers!

gjreda · on April 22, 2017

In case you're looking for more, this tutorial series hit the front page of HN a few years ago: http://www.gregreda.com/2013/10/26/intro-to-pandas-data-stru...

Modern pandas is a bit more idiomatic now though: https://tomaugspurger.github.io/modern-1.html

fnord123 · on April 23, 2017

Are you trying to learn pandas just to learn pandas or do you have a motivating example?

maxerickson · on April 22, 2017

What have you wanted to use it for?

Pandas is basically an R data frame for Python. A sloppy description of that is a text mode spreadsheet.

The description of Bonobo doesn't immediately invite the comparison to Pandas, to me anyway.

rdorgueil · on April 22, 2017

You're very right, as I'm using both pandas and bonobo for different reasons.

Mostly, when I want a quasi-mathematical look over a dataset, pandas is my tool of choice. For all those data pipeline things that reasonably fit on one computer, I do use bonobo.

RobinL · on April 22, 2017

I'm an avid Pandas user. Some stuff at work has come up recently that calls for ETL - and I'm trying to figure out what the best tools are. Is any of your Bonobo code public? I'd be curious to see what a real-life project looks like...

rdorgueil · on April 22, 2017

No, I don't have real-life public code available. I'm gonna see what I can extract from old commercial project for publication, but I can't guarantee anything.

dirtyaura · on April 22, 2017

Luigi is a simple tool from Spotify that seems to solve a similar data workflow (with DAGs) problem as Bonobo. Airflow from AirBnB is a more complex tool and I've understood that Spotify has lately moved from Luigi to Airflow.

Can anybody comment how Bonobo compares to Luigi?

spangry · on April 22, 2017

Usually just simple data analysis, really nothing far outside of the 'statistics' lib. Currently it's more the exploration and discovery part of the exercise I'm struggling with. I've got a few hundred thousand csv files representing various aspects of Australia's national energy market (e.g. outcomes of 5 minute supply auctions). I'm trying to make my way through that, figure out what's relevant and wrangle the relevant stuff in some organised fashion.

Is pandas the wrong kind of tool for this type of thing? Going off what rdorgueil has said, I'm beginning to suspect so. Is there a data-wrangling 'gold standard' library for python?

jtchang · on April 22, 2017

I'm just learning pandas as well but I think it is the right tool for the job. I am using django-pandas so I can do easy ORM stuff. If I were to sketch out your use case:

Create a object/class called

    AuctionResult
     - some datetime
     - value

Then you'd query it qs = AuctionResult.objects.all()

then you load it into a pandas dataframe:

df = read_frame(qs)

After that you can do all sorts of the fun stuff I imagine.

eanzenberg · on April 23, 2017

I don't see why pandas won't work for your case. It sounds like most if not all the csvs contain the same columns and type of data. You could easily create a pandas dataframe that combines them all, then use any plotting library like matplotlib and/or seaborn to plot. If you need help provide some examples of the csvs you are trying to parse.

RobinL · on April 22, 2017

Pandas is definitely the most popular (and imo best) data wrangling library for Python.

abhirag · on April 22, 2017

You should also have a look at this book(http://www.goodreads.com/book/show/14744694-python-for-data-...), it helps that author of pandas library is also the author of the book. From the description of your use-case, you seem to be doing exploratory data analysis, pandas can definitely handle that.

neves · on April 23, 2017

Wait a little to buy it. There is a new edition in the oven.

hprotagonist · on April 22, 2017

McKinney's book is good. Unfortunately there have been several reasonably important breaking API changes since it was published, so it's now to be taken with some salt.

closed · on April 22, 2017

You're not alone! I think pandas made some design decisions around their transformation functions that make it a lot more cumbersome to use than R's dplyr. It's not obvious from the documentation, though.

As an example from the pandas docs [1], in dplyr you can do

> gdf <- group_by(df, col1)

> summarise(gdf, avg=mean(col1))

In pandas this is similar to

> df.groupby('col1').agg({'col1': 'mean'})

But dplyr's summarize it's much more flexible than agg, as you can do all kinds of things to any number of columns. E.g.

> summarise(gdf, some_name = f1(col1) + f2(col2))

But in pandas you can apply 1 function to 1 column with agg.

[1] http://pandas.pydata.org/pandas-docs/stable/comparison_with_...

RobinL · on April 22, 2017

Swings and roundabouts. I'm a big fan of dplyr, and R definitely does some thing better than Pandas, but I've never found anything as flexible as pd.pivot_table for cross tabulations. For instnace, the lack of multiindexing in R is a big drawback.

madenine · on April 22, 2017

you can supply a map, ie:

gdf = df.groupby('col1').agg({'col2': np.mean, 'col3':np.std,'col4': lamba x: np.mean((x) / np.std(x))})

once you've got your grouped dataframe, go nuts

gdf['some_name'] = gdf['col1'].apply(f1) + gdf['col2'].apply(f2)

closed · on April 22, 2017

The summarise example I gave creates a single new column (some_col), that is a function of two columns from the grouped data frame. Passing a map to agg is just creating multiple columns, each a function of at most a single column in the dataframe.

(I should have used a better example, like summarise(gdf, some_col = f(cola / colb))

upquark · on April 23, 2017

Totally a preference thing. I strongly prefer pandas to dplyr having worked with both.

theghostofjr · on April 22, 2017

You can supply a map to agg, something like {col1:'sum', col2: 'mean',...}

closed · on April 22, 2017

Yes, but it is still applying that function to a single column (the summarise example I gave could aggregate multiple columns into a single result, e.g. sum(col1 / col2))

neves · on April 23, 2017

I enjoyed this free Edx course:

https://courses.edx.org/courses/course-v1:Microsoft+DAT208x+...

Their is some pratical exercises that you do in your browser that really helps to get the grasp of it.

Don't miss Pandas, they are really cool!

Blackthorn · on April 22, 2017

Don't feel bad. Pandas is powerful but it has (at least when I used it) some truly abominable documentation.

goatlover · on April 22, 2017

Agreed the documentation could be a lot better. I wonder about a visual approach to using DataFrames and Series with all the different methods to demonstrated more clearly what's being done.

neves · on April 23, 2017

I really liked this visual explanation of pivot and reshape:

http://nikgrozev.com/2015/07/01/reshaping-in-pandas-pivot-pi...

(I´m not affiliated to site)

Nydhal · on April 22, 2017

You mean excel?

sixo · on April 22, 2017

And some annoying interface decisions, like d['x'] is a column but d[1:3] is a row slice.

rjurney · on April 23, 2017

pandas does a lot. This does nothing. That's the difference.

payne92 · on April 22, 2017

I'm trying to figure out if this is "all hat, no cattle". There seems to be a lot of "framework" here, without much core functionality.

Stated more precisely: if I'm stitching together things that process data and 'yield' results, why can't I just do that in pure Python? What does this framework add?

rdorgueil · on April 22, 2017

Short answer : parralel execution.

rs86 · on April 22, 2017

Multiprocessing or muilthreading? Why don't you market it as parallel coroutines processing? That gets me interested. Because there are dozens of frameworks with overgeneral descriptions.

rdorgueil · on April 23, 2017

Today, as a default, multithreading. But that's an implementation detail. Actually, Bonobo does not support coroutines (as in asyncio coroutines) so it would be a lie to market it this way. The plan though is to allow to use coroutines/futures in the future, for specific reasons (like long running/blocking operations where keeping output order tied to input order is of no importance). Still, there is a lot on the roadmap before this becomes a priority.

I note that I still have a lot of work explaining in simple terms what is actually bonobo, without falling in the trap of "overgeneral description".

ColanR · on April 22, 2017

"parallel" :)

cicero · on April 22, 2017

Remember, the double l's make parallel lines.

rjurney · on April 23, 2017

The docs say nothing about this. If you've implemented pmap, maybe that is useful. But the framework itself doesn't seem to do anything.

IanCal · on April 22, 2017

It'd be good to see some comparisons, why this and not one of the other currently available systems? Why should I use this over, for example, Luigi?

What scale is this intended for?

Is it intended to nearly solve a simple problem over my 20TB of data on S3? Big complex graphs? Or more for transitioning a small local report system that's currently in three excel files into a tested python script?

rdorgueil · on April 22, 2017

It's indeed intended for «small data», by opposition to «big data». I know, that does not say much, but I basically wanted to handle small flux of data without having to install the "big weapons".

I'm preparing explanation pages for a lot of the questions I got, including comparisons, volumes of data, where it is good and where it is not ...

All that will be well ready before 1.0, but for now, we're at 0.2 ...

Thanks for all the hackerlove, though!

rdorgueil · on April 22, 2017

With the ancestor of bonobo, I was processing 5M lines of data in around 1 hour, including extraction, joins, api calls and a few loads. That should give a first info about the size target.

ah- · on April 22, 2017

From looking at their examples and interfaces, it's clearly for simple, small scale processing.

rkda · on April 22, 2017

All those references to monkeys hurt my head. Bonobos are not monkeys. If they wanted to name it after monkeys, they should've called it Capuchin or something.

rdorgueil · on April 22, 2017

Noted, sorry for that. I'll get more infos about bonobos.

nn3 · on April 22, 2017

The picture looks more like a Gorilla than a Bonobo too

e5an · on April 22, 2017

Came here to post just that. It's called 'Bonobo', there's a picture of a gorilla, and the page keeps saying 'monkey'- as petty as it sounds, you're probably losing potential users to zoological nerdrage.

rdorgueil · on April 22, 2017

Yes, hackernews and twitter brutally told me I should take animal reign culture classes asap ...

This being said, if any of you have a good picture of bonobos that I can use instead of the current one, I'd be really glad to replace it! It needs to be released under a free license, though.

Thanks HN

rdorgueil · on April 22, 2017

Currently realizing that we only have one word in french for both ape and monkeys ...

rkda · on April 22, 2017

Oh... now it makes more sense. Didn't mean to sound harsh. Thanks for sharing your framework! :)

rdorgueil · on April 22, 2017

It didn't sound harsh at all. I'm really laughing a lot right now about how ignorant I am about apes and monkeys. ^^

init · on April 22, 2017

I came here to say that! The bonobo is an an indigenous ape of the left bank of the Congo river in the Congo rain forest in the DR Congo. They look indistinguishable to chimpanzees to the untrained eye.

Gorillas are a whole different species and you have at least 4 subspecies of gorillas, none of which look like chimps or bonobos.

jimnotgym · on April 22, 2017

They have some behavioral features that tell them apart from a chimpanzee. Apparently they use sex as a greeting. It is (sort of) anthropomorphized in a hilarious way in the Will Self novel 'Great Apes'. However that may be colouring my memory of how common this is in the real creatures...

oliwarner · on April 23, 2017

Monkey just isn't a good term to fight over. There are plenty of respectable people arguing that monkey means simian, not just non-hominoid simian (as you take from it). Either which way, fighting over where the taxonomical set ends is likely a waste of breath.

But yeah, a CGI gorilla for a site called bonobo. Le sigh.

vittore · on April 22, 2017

Interesting, right now we are using PETL[1] that we used to do with SSIS, bonobo for some reason reminds me of Bubbles library.

[1]: https://petl.readthedocs.io/en/latest/

ctippett · on April 23, 2017

+1 for petl, I'm using it right now on a project that deals with a lot of tabular data and it's been a huge time saver.

bhargav · on April 22, 2017

Ah, interesting. The example on the 'on-boarding' page reminds me of what we used to do at work.

We used itertools chains to write producers and consumer to create 'Chain' objects that process data exactly as the bonobo.Graph.

Can't wait to try this.

srean · on April 22, 2017

Sweet! generator based utilities for ETL. I think this really a good use of generators and coroutines. Reminds me of stackless based https://bitbucket.org/diji/pypes/src (backing video http://pyvideo.org/pycon-us-2011/pycon-2011--large-scale-dat... )

teilo · on April 22, 2017

Before there was Pandas I wrote a website (using Django) to transform grid data into a denormalized CSV file. In other words, a reverse pivot. Basically it converts multiple header rows and header columns as separate fields for each intersecting value.

I've written this basic routine several times over in my career (once in Access VBA!) for different reasons. The current version of it is used to convert a store/item/quantity grid into per-store pick/pack slips.

Pandas has a built-in function that can de-pivot a table. I'm not sure it can handle my use case, however, with multiple header rows. Mine also has extra goodies like populating blank row or column values with the previous value in the row-column, among other bizarre features written to grapple with the inconsistent ways our clients make their distro spreadsheets. Trying to break them of their reliance on Excel for this type of planning has proven futile.

I'll have to spend some time with Bonobo and Pandas before I take on refactoring our grid tool. It needs a refactor mostly because I'm the only one who understands it. The new data munging libraries would surely simplify some very gnarly logic, and make it accessible to other developers should I get hit by a bus or leave the company.

rcarmo · on April 22, 2017

Hmm. So what would be the advantage over Dask, which lets me scale out over a cluster?

jwilk · on April 23, 2017

This name is already taken.

https://wiki.gnome.org/Attic/Bonobo

maxerickson · on April 22, 2017

There's a syntax error in mutate_my_dict_like_crazy at http://docs.bonobo-project.org/en/0.2/guide/purity.html.

Seems the documentation is still quite WIP.

throwaway_374 · on April 22, 2017

So how is this different from Airflow, other than Windows compatibility and a lack of dashboard?

glial · on April 22, 2017

That's my question too. I've come to heavily rely on Airflow. As an Apache project now it's becoming mature.

From what I can tell browsing the site, Bonobo looks like it's designed to do data processing within the framework. Airflow insists that it's really a task coordinator/scheduler...however, tasks can be Python function calls. So it seems like Bonobo is a specific use case, where Airflow is the more general case (tasks can be SQL queries, bash commands, etc).

erwinvaneyk · on April 23, 2017

So, what is the advantage of using this over existing workflow management systems, such as Airflow, Azkaban and Luigi?

madenine · on April 22, 2017

Seems like sklearn pipelines with a more generalized use case + additional helpful features for ETL. Very interested

zfrenchee · on April 22, 2017

Who is behind this?

rdorgueil · on April 22, 2017

Me (as an individual), and a few great people that helped me along the way. Not commercially endorsed, or supported.

lookACamel · on April 23, 2017

How does this compare to Dask, Luigi or Airflow?

rdorgueil · on April 23, 2017

As soon as I can, I'll include comparison pages to the documentation, trying to keep it as objective as possible. I can't seriously answer this question in depth here, but it is planned, so at least experts from other systems can also jump in and complement/correct my understanding of each systems. I used a bunch of them, but I'm in no mean expert user of each so making it collaborative sound like a better idea than just giving my point of view.