Show HN: Bistro – A light-weight column-oriented data processing engine

agibsonccc · on Jan 17, 2018

The core looks close enough to dataframes that I'd be curious to know how you compare to tablesaw: https://github.com/jtablesaw/tablesaw

This looks neat but I'm not sure why I would care about this. There's a ton of solutions out there in the ecosystem out there already with a columnar like interface.

Granted, we wrote our own as well[1] that uses the builder pattern that you then toss to an executor (our main backend is spark for this). One reason we wrote this is for persistence purposes. Being able to encode and persist a series of transforms that you can then load remotely has been very helpful for us in machine learning.

We've since migrated this project to the eclipse foundation and intend on doing a rewrite of the interface as well as integrate our baked in tensor library[2] in to certain parts of the pipeline for speed purposes and handling things like computer vision workloads.

In general, I always like seeing new takes on the columnar format processing approach but I'm just not seeing anything novel here. Clarification of intent would be great!

[1]: https://github.com/deeplearning4j/DataVec [2]: https://github.com/deeplearning4j/nd4j

asavinov · on Jan 17, 2018

Bistro is not about a physical model and columnar (physical) representation although it relies on it. It is about logical level of representation and processing.

There are at least three major questions when we want to introduce a new logical data model:

* How we define columns within one table. Conceptually, it is easy, e.g., SELECT x, y, c = a+b FROM T. Yet, even in this simple case we see a controversy: this statement will create a table but our goal is not to create a table - we want to create a column (function). Bistro uses calc operation for that purpose. But it is of course not new. In pandas, for example, one can use df.apply.

* How to connect several tables. RM, map-reduce and other set-oriented approaches use join which produces a new table. Here we have a similar controversy [1]: I do not want to produce a new table, it is not my goal. My goal is to link these two tables and this means creating a new column. Bistro introduces link operation for such columns.

* How to aggregate data. Here a typical operation is group-by. It is an eclectic operation which combines several other operations and such an approach also has some problems [2]. Bistro changes the way data is aggregated by introducing accumulate functions which get one input (not a subset) and return one output value. An accumulate function is called for each element of the group by updating the current aggregate instead of computing the aggregate by getting the whole group.

So linking and aggregation are what distinguishing Bistro from other approaches and frameworks to data processing including pandas, SQL and map-reduce.

[1] https://www.researchgate.net/publication/301764816_Joins_vs_...

[2] https://www.researchgate.net/publication/316551218_From_Grou...

agibsonccc · on Jan 17, 2018

Ok so you're defining new operations on top of existing primitives. Makes sense! The concepts look more interesting than the library right now (it doesn't move the needle for me production wise yet) but it has potential! Every project starts somewhere. I'm glad you wrote this in java at least.

There's a ton of things I'd be missing to start looking at this.

1. Backend agnostic: let me run on different backends like flink/spark

2. Give me off heap memory please. Let me play dangerous and use pointers directly to optimize interactions with transforms. We wrote our own GC among other things for our tensor lib due to the GC bottlenecks and copying and the like. That's personally what I kind of like about tablesaw.

A library for 1 off adhoc analysis in memory isn't a bad start though, especially since most folks don't actually have that large of problems.

asavinov · on Jan 17, 2018

You are right - it is an MVP, and the goal is to choose a direction. In fact, I am still not sure in which direction to go:

* JavaScript data processing framework for in-browser data processing

* Python framework like pandas

* Big data processing framework like Spark

* Database management system

* Data integration system like typical ETL and BI

* Stream analytics like Kafka Streams

* IoT (light weigh) stream processing engine

* Something else?

I would be very thankful for any suggestion from people who know the market and (acute) needs of the customers. What is the best niche for this kind of technology?

munro · on Jan 17, 2018

I'm working on a columnar Python DSL right now, I think of it like SQLAlchemy for Pandas/Spark/Flink. My goal is to create a language that makes the cumbersome parts of the PySpark API much easier to express. I started off with the intent of replicating the R's dataframe API because it feels more fluid--but what you're doing feels eerie, because I came to the same conclusions as you around focusing on a language for columnar manipulations, and letting the "linking" become implicit. Then I want to transition to rethinking the ML pipeline, I like Spark's more than sklearn's, but there are still cumbersome parts that my intuition says can be solved by a columnar API.

So if I were you, I would do Python DSL, because it addresses my immediate needs of increasing productivity. :)

agibsonccc · on Jan 17, 2018

The arrow integration and persistence and connecting this to jdbc would be an MVP to me that would at least allow folks to imitate pandas. While you do have new ideas here if folks can express these ideas in terms of ways they're familiar with that would likely help.

Maybe you could use calcite[1] as an engine or as a base kind of like arrow.

If you do python, maybe look in pyjnius[2].

There's a lot of things that already have connectors. I would continue along your MVP route allowing folks to do basic things with your framework first then you can improve it as you go. SQL databases aren't a bad initial target. Most folks can do SQL.

[1]: https://calcite.apache.org/ [2]: https://github.com/kivy/pyjnius

buremba · on Jan 16, 2018

Is it in-memory? Does it support replication or sharding? What's the main use-case? How does it differ from ORC, Parquet or Arrow? The repository doesn't have any information.

asavinov · on Jan 16, 2018

At the physical (storage) level it is in-memory and organized as a column store, that is, internally it stores a list of tables (with no data) and a list of columns (each being a Java array).

It is definitely interesting and important to implement persistence (for example using Parquet or Arrow) as well as other mechanisms like sharding or replication (for big data processing, fault tolerance etc.) Yet, currently this direction has lower priority because the next task I want to focus on is in-stream analytics (an alternative to Kafka Streams).

In general, the whole approach is focused on the logical level of data modeling and processing, that is, the goal is to increase performance and simplicity of development. The general idea (and hypothesis) is that defining how data is being processed using column operations is easier, more intuitive, less error-prone and easier to maintain than using purely set-operations.

In other words, at logical level, it is an alternative to map-reduce, SQL, pandas and other models and frameworks where set-operations are used to process data.

heavenlyblue · on Jan 16, 2018

Can you define the column operations by reducing them to a set of analogical steps performed after the query planner has run in an SQL database?

jnordwick · on Jan 16, 2018

> defining how data is being processed using column operations is easier, more intuitive, less error-prone and easier to maintain than using purely set-operations.

And every APL programmer just nodded in agreement.

PeCaN · on Jan 17, 2018

All 6 of us. :(

Really though, I spend the time reading the readme thinking “This looks very cool, shame it doesn't have a nice language to go with it…”

dgudkov · on Jan 17, 2018

Interesting idea. Columnar ETL can be quite efficient in some scenarios because frequently an ETL transformation (e.g. calculating a new column) effectively modifies an existing table, rather than creates a new one. This allows calculating only the delta, instead of re-building a new table from. This helps optimize performance and do calculations in-memory without slow disk I/O.

Another advantage is that it allows performing many transformations (e.g. filtering) directly on dictionary compressed data, without decompressing it. This works well in Vertica [1] (based on C-Store DB [2]) which was our inspiration for building a light-weight ETL for business users that also uses a columnar in-memory data transformation engine [3].

[1] https://www.vertica.com/

[2] http://db.csail.mit.edu/projects/cstore/

[3] http://easymorph.com/in-memory-engine.html

krat0sprakhar · on Jan 16, 2018

Sorry for being that guy, but I just clicked into a random file in src to read the code, and found the code style (indentation etc.) to be quite weird https://github.com/asavinov/bistro/blob/master/core/src/main....

Might I suggest using https://github.com/google/google-java-format for formatting?

jahewson · on Jan 16, 2018

This is what happens when tabs and spaces get mixed.

tomcam · on Jan 16, 2018

So you’re a spacist?

kgwxd · on Jan 17, 2018

If not, I run a space supremacist group out of my basement. Weekly meetings devoted entirely to tab-shaming.

jitl · on Jan 16, 2018

An example would be great. Can you show how to do a given task with SQL, map/reduce, and your framework?

Because right now I have no idea why I’d choose to learn this new stuff over using google-able tools I already know.

Make your value proposition really clear.

hoprocker · on Jan 16, 2018

Agreed -- a motivating example in the Readme would really help people totally new to this concept grasp the how/why of the project.

sgolestane · on Jan 16, 2018

On http://conceptoriented.org there is a link to http://conceptoriented.com which seem to be a demo.

asavinov · on Jan 17, 2018

Source code for this web-app and previous projects:

* UI (Angular 2): https://github.com/asavinov/sc-web

* REST server: https://github.com/asavinov/sc-rest

* Core engine: https://github.com/asavinov/sc-core - Bistro is a complete rewrite of this project

There is an (old) implementation of this idea in C#:

* UI (WPF): https://bitbucket.org/asavinov/dc-wpf

* Core engine (C#): https://bitbucket.org/asavinov/dce-csharp

And also some old Java implementation the engine in Java:

* https://bitbucket.org/asavinov/dc-core

jnordwick · on Jan 16, 2018

Might new a cool idea, but not nearly fleshed out enough. I think a larger example instead of just individuals lines of code would be useful. Show a toy widget sales spreadsheet.

What is the use case? Does it support time series? How works you do a moving average or pivot table?

asavinov · on Jan 16, 2018

> What is the use case?

This is what I am trying to understand myself :) Previously I have implemented this approach as a web-app (self-service tool for working with tables where users can define columns as formulas - similar to spreadsheets). Discussed here: https://news.ycombinator.com/item?id=14351461 But it is too difficult to implement (much more resources are needed) so I switched to developing a library.

Now I want to implement a server for in-stream analytics (an alternative to Kafka Streams). Hopefully, in the next version of Bistro. It will have quite significant portions of data processing logic in UDFs (retention policy, when to evaluate, how to add etc. when to do what).

> Does it support time series? How works you do a moving average or pivot table?

I am designing it now as a tool for stream processing (Bistro Streams) and hence it will support time series (each new row will get a time-stamp and the system will "know" how to deal with time).

Current approach to user-defined functions (Evaluator interface) does not support moving average or other rolling aggregations. New API will be defined. This task has high priority since it is very important for stream analytics.

Pivoting is conceptually more difficult because it is not an operation with data - it is an operation with schema (using data). Maybe some kind of ad-hoc solution will work.

jnordwick · on Jan 16, 2018

Honestly, there is probably solid use cases for simplifying data exploration especially when dealing with large amounts of data. I have myself (as probably every developer in finance and many others) have tried to think of ways to build upon the intuitiveness and usefulness of the spreadsheet paradigm while making it more powerful.

And if you can find a good way to do this along with a solid interface that non-programmers can use (Excel is the master at getting non-programmers to program), there is probably a lot of money to be made.

Honestly, this looks like a query and analysis system that could be built on top of a column store so you don't need to deal with the storage part of the system yet. (I'm a big KBD+ fan).

Good luck. I starred and watched the repo just to see how it progresses.

asavinov · on Jan 17, 2018

Honestly, this looks like a query and analysis system that could be built on top of a column store so you don't need to deal with the storage part of the system yet. (I'm a big KBD+ fan).

I like this idea because it allows for reusing an existing engine and, what is important, integrate it with many different engines (which might have already quite sophisticated mechanisms of data managements). Yet, such an engine has to expose quite low-level operations with data. Also, there has to be support for user-defined functions (lambdas).

nickpeterson · on Jan 17, 2018

I skimmed the readme but didn't see the answer to what I regard as a basic question. How is this different from a view? I can easily make derived columns based on functions and reference those in other views (performance issues aside).

julienfr112 · on Jan 16, 2018

How do that compare to SAS software (https://en.wikipedia.org/wiki/SAS_(software)) ? Particularly the "DATA" steps.

philkrylov · on Jan 17, 2018

SAS DATA steps create new datasets (table) from one or more input datasets. As far as I understand the author, Bistro creates just a new column in an existing table.

asavinov · on Jan 18, 2018

> Bistro creates just a new column in an existing table.

What is new is that this column can use data from other tables - not necessarily this one. And eventually Bistro gets rid of joins and group-bys which are not difficult to understand but frequently their use is inappropriate, for example, in most cases, joins are used where we actually want to link two tables (new relation is not needed).

Note also that we cannot avoid set operations and creating tables. Therefore, Bistro also supports table creation. Yet, it does it in a functional way, that is, by defining a new columns.

KasianFranks · on Jan 16, 2018

This is neat. Vectorspace based AI calculations will benefit from this approach. Great work!