I'm annoyed that I bothered to read the tutorial to this. The TLDR: "Write some generators or functions, put them in a list, and Bonobo will call them all for you in order. Look at the example files for more." The example files are all basic string transformations. The docs are mostly blank pages and missing sections. What little is written has more jokes and conversational tics than information.
What does this even do? There's mention of DAGs and different execution strategies if you really dig through the docs, but is that it? If so, why would you use this instead of joblib or some other established parallelism lib?
Bonobo runs each functions in the pipeline in parallel and make the fifo queues plumbing and thread pool management completely transparent.
The TLDR would then be "Write some generators or functions, link them in a graph, and call them in order on each line of data as soon as the previous transformation node output is ready.". For example if you have a database cursor that yields each line of a query as its output, it starts to run the next step(s) in the graph as soon as the first result is ready (yet not stop yielding from database until the graph is done for the current row). I did not find it easy to do with the libraries I tried.
The docs clearly lacks completion to say the least, and would need an example with a big dataset, one with long individual operations and one with a non linear graph, so it's more obvious that, of course, it's not made to process strings to uppercase twice in a row.
Stay tuned, I'm very happy HN brought it to homepage, did not really think it could happen at this stage though and I understand you. But that's a good thing for the project to move forward.
Python is my usual language of choice, but recently I picked up Go for some data processing because there was a lot of benefits to parallelising the task - which Go made easy.
Yeah, why does anyone need something to run some functions in order for them? I can do that, thanks. If it ran them on some... say 'big data' platform, that would be something. As is, this does not deserve to be front page. This is vaporware.
Yeah there seems to be a lot of marketing but I found a concise definition on the author's personal website: "extract transform load for python 3.5+". It could be noted that some of the earlier commit messages include "more marketing".
I haven't tried this yet, but am praying that it delivers even half of what it promises. For whatever reason I just can't get my head around pandas, despite multiple attempts.
If this also turns out to be inscrutable I may be forced to conclude that I'm stupid...
>For whatever reason I just can't get my head around pandas, despite multiple attempts.
You need to work with pandas consistently for a month or two, and then it'll all click.
pandas is not complex, nor deep. It is, however, very broad. Most of the time it is "Here's what I need to do. I'm sure there's an API or two in pandas that will let me do this," and then you spend an hour or so looking at the documentation to find those APIs.
My first month or two was: "I need to do this. Let me Google". Pretty much every time someone had asked that same question on SO.
If you stick to it for 2 months, you'll eventually "learn" all the routine tasks and Googling stuff becomes only occasional.
Thanks for this! It's funny in a way: I'm trying to learn the basics, but don't have a clear idea of what the basics actually are. This looks like it could be just the ticket. Cheers!
You're very right, as I'm using both pandas and bonobo for different reasons.
Mostly, when I want a quasi-mathematical look over a dataset, pandas is my tool of choice. For all those data pipeline things that reasonably fit on one computer, I do use bonobo.
I'm an avid Pandas user. Some stuff at work has come up recently that calls for ETL - and I'm trying to figure out what the best tools are. Is any of your Bonobo code public? I'd be curious to see what a real-life project looks like...
No, I don't have real-life public code available. I'm gonna see what I can extract from old commercial project for publication, but I can't guarantee anything.
Luigi is a simple tool from Spotify that seems to solve a similar data workflow (with DAGs) problem as Bonobo. Airflow from AirBnB is a more complex tool and I've understood that Spotify has lately moved from Luigi to Airflow.
Usually just simple data analysis, really nothing far outside of the 'statistics' lib. Currently it's more the exploration and discovery part of the exercise I'm struggling with. I've got a few hundred thousand csv files representing various aspects of Australia's national energy market (e.g. outcomes of 5 minute supply auctions). I'm trying to make my way through that, figure out what's relevant and wrangle the relevant stuff in some organised fashion.
Is pandas the wrong kind of tool for this type of thing? Going off what rdorgueil has said, I'm beginning to suspect so. Is there a data-wrangling 'gold standard' library for python?
I'm just learning pandas as well but I think it is the right tool for the job. I am using django-pandas so I can do easy ORM stuff. If I were to sketch out your use case:
Create a object/class called
AuctionResult
- some datetime
- value
Then you'd query it qs = AuctionResult.objects.all()
then you load it into a pandas dataframe:
df = read_frame(qs)
After that you can do all sorts of the fun stuff I imagine.
I don't see why pandas won't work for your case. It sounds like most if not all the csvs contain the same columns and type of data. You could easily create a pandas dataframe that combines them all, then use any plotting library like matplotlib and/or seaborn to plot. If you need help provide some examples of the csvs you are trying to parse.
You should also have a look at this book(http://www.goodreads.com/book/show/14744694-python-for-data-...), it helps that author of pandas library is also the author of the book. From the description of your use-case, you seem to be doing exploratory data analysis, pandas can definitely handle that.
McKinney's book is good. Unfortunately there have been several reasonably important breaking API changes since it was published, so it's now to be taken with some salt.
You're not alone! I think pandas made some design decisions around their transformation functions that make it a lot more cumbersome to use than R's dplyr. It's not obvious from the documentation, though.
As an example from the pandas docs [1], in dplyr you can do
> gdf <- group_by(df, col1)
> summarise(gdf, avg=mean(col1))
In pandas this is similar to
> df.groupby('col1').agg({'col1': 'mean'})
But dplyr's summarize it's much more flexible than agg, as you can do all kinds of things to any number of columns. E.g.
> summarise(gdf, some_name = f1(col1) + f2(col2))
But in pandas you can apply 1 function to 1 column with agg.
Swings and roundabouts. I'm a big fan of dplyr, and R definitely does some thing better than Pandas, but I've never found anything as flexible as pd.pivot_table for cross tabulations. For instnace, the lack of multiindexing in R is a big drawback.
The summarise example I gave creates a single new column (some_col), that is a function of two columns from the grouped data frame. Passing a map to agg is just creating multiple columns, each a function of at most a single column in the dataframe.
(I should have used a better example, like summarise(gdf, some_col = f(cola / colb))
Yes, but it is still applying that function to a single column (the summarise example I gave could aggregate multiple columns into a single result, e.g. sum(col1 / col2))
Agreed the documentation could be a lot better. I wonder about a visual approach to using DataFrames and Series with all the different methods to demonstrated more clearly what's being done.
I'm trying to figure out if this is "all hat, no cattle". There seems to be a lot of "framework" here, without much core functionality.
Stated more precisely: if I'm stitching together things that process data and 'yield' results, why can't I just do that in pure Python? What does this framework add?
Multiprocessing or muilthreading? Why don't you market it as parallel coroutines processing? That gets me interested. Because there are dozens of frameworks with overgeneral descriptions.
Today, as a default, multithreading. But that's an implementation detail. Actually, Bonobo does not support coroutines (as in asyncio coroutines) so it would be a lie to market it this way. The plan though is to allow to use coroutines/futures in the future, for specific reasons (like long running/blocking operations where keeping output order tied to input order is of no importance). Still, there is a lot on the roadmap before this becomes a priority.
I note that I still have a lot of work explaining in simple terms what is actually bonobo, without falling in the trap of "overgeneral description".
It'd be good to see some comparisons, why this and not one of the other currently available systems? Why should I use this over, for example, Luigi?
What scale is this intended for?
Is it intended to nearly solve a simple problem over my 20TB of data on S3? Big complex graphs? Or more for transitioning a small local report system that's currently in three excel files into a tested python script?
It's indeed intended for «small data», by opposition to «big data». I know, that does not say much, but I basically wanted to handle small flux of data without having to install the "big weapons".
I'm preparing explanation pages for a lot of the questions I got, including comparisons, volumes of data, where it is good and where it is not ...
All that will be well ready before 1.0, but for now, we're at 0.2 ...
With the ancestor of bonobo, I was processing 5M lines of data in around 1 hour, including extraction, joins, api calls and a few loads. That should give a first info about the size target.
All those references to monkeys hurt my head. Bonobos are not monkeys. If they wanted to name it after monkeys, they should've called it Capuchin or something.
Came here to post just that. It's called 'Bonobo', there's a picture of a gorilla, and the page keeps saying 'monkey'- as petty as it sounds, you're probably losing potential users to zoological nerdrage.
Yes, hackernews and twitter brutally told me I should take animal reign culture classes asap ...
This being said, if any of you have a good picture of bonobos that I can use instead of the current one, I'd be really glad to replace it! It needs to be released under a free license, though.
I came here to say that! The bonobo is an an indigenous ape of the left bank of the Congo river in the Congo rain forest in the DR Congo. They look indistinguishable to chimpanzees to the untrained eye.
Gorillas are a whole different species and you have at least 4 subspecies of gorillas, none of which look like chimps or bonobos.
They have some behavioral features that tell them apart from a chimpanzee. Apparently they use sex as a greeting. It is (sort of) anthropomorphized in a hilarious way in the Will Self novel 'Great Apes'. However that may be colouring my memory of how common this is in the real creatures...
Monkey just isn't a good term to fight over. There are plenty of respectable people arguing that monkey means simian, not just non-hominoid simian (as you take from it). Either which way, fighting over where the taxonomical set ends is likely a waste of breath.
But yeah, a CGI gorilla for a site called bonobo. Le sigh.
Before there was Pandas I wrote a website (using Django) to transform grid data into a denormalized CSV file. In other words, a reverse pivot. Basically it converts multiple header rows and header columns as separate fields for each intersecting value.
I've written this basic routine several times over in my career (once in Access VBA!) for different reasons. The current version of it is used to convert a store/item/quantity grid into per-store pick/pack slips.
Pandas has a built-in function that can de-pivot a table. I'm not sure it can handle my use case, however, with multiple header rows. Mine also has extra goodies like populating blank row or column values with the previous value in the row-column, among other bizarre features written to grapple with the inconsistent ways our clients make their distro spreadsheets. Trying to break them of their reliance on Excel for this type of planning has proven futile.
I'll have to spend some time with Bonobo and Pandas before I take on refactoring our grid tool. It needs a refactor mostly because I'm the only one who understands it. The new data munging libraries would surely simplify some very gnarly logic, and make it accessible to other developers should I get hit by a bus or leave the company.
That's my question too. I've come to heavily rely on Airflow. As an Apache project now it's becoming mature.
From what I can tell browsing the site, Bonobo looks like it's designed to do data processing within the framework. Airflow insists that it's really a task coordinator/scheduler...however, tasks can be Python function calls. So it seems like Bonobo is a specific use case, where Airflow is the more general case (tasks can be SQL queries, bash commands, etc).
As soon as I can, I'll include comparison pages to the documentation, trying to keep it as objective as possible. I can't seriously answer this question in depth here, but it is planned, so at least experts from other systems can also jump in and complement/correct my understanding of each systems. I used a bunch of them, but I'm in no mean expert user of each so making it collaborative sound like a better idea than just giving my point of view.
What does this even do? There's mention of DAGs and different execution strategies if you really dig through the docs, but is that it? If so, why would you use this instead of joblib or some other established parallelism lib?