To Be Continuous

perplexes · on July 7, 2015

I'm surprised no one has mentioned Esper yet: http://www.espertech.com/esper/

Esper does exactly this - you run streams of events over it and it continuously executes SQL to see if it matches. If so you can:

- run code

- make new streams

- store the results

Esper's been doing this kind of thing for 9 years now.

ak4g · on July 8, 2015

Esper is really curious. I looked at it while getting a feel for the general landscape of streaming data tools in late 2013 - to be clear, it was a very cursory look, on the timescale of a few hours at most. I quote what my final take-away was at the time:

> Seems good but... such a weird project. Codehaus, svn, not high activity, but consistent, stable releases for five years. Maybe just not the kind of thing webdevs get into? Not sure if that's a strike against or not.

With the demise of Codehaus it looks like they've moved to github:

https://github.com/espertechinc/esper

But oddly they don't seem to have migrated their svn history, and the README implies they don't plan to... I certainly hope they didn't lose it when Codehaus shut down. There was, as noted, already 9 years worth of code changes in that repo. That would be unfortunate.

res0nat0r · on July 7, 2015

I downloaded the OSX .pkg installer and didn't see anything in /Applications or /opt after running it and telling it to install to my root drive. Just glancing at some docs on your site I see pipeline-init, so doing a find on / to find out where it placed the binaries see it installed to:

/usr/lib/pipelinedb/usr/lib/pipelinedb/bin/pipeline-init

Is this intentional?

EDIT:

After playing around with the .pkg file it looks like the packed Payload contains '/usr/bin/pipelinedb/usr/lib/pipelinedb' which is probably the problem. I see broken symlinks for pipeline-init etc in /usr/bin pointing to /usr/lib/pipelinedb, so I'm guessing this repetition of the path above is a mistake.

Also I see a postinstall script creating a symlink from pipeline to psql. This seems like a bad idea as psql is pretty universal already as the name for the PostgreSQL CLI binary, maybe 'pipesql' might be better?

grammr · on July 7, 2015

Sorry about that, OSX packaging is a bit finicky right now. We'll publish a package with more sensible defaults very shortly.

mfenniak · on July 8, 2015

A Homebrew (http://brew.sh/) package might be a better approach that a .pkg. It's easier to maintain for you, and, definitely easier for a user. Not all OSX users use Homebrew, of course, but I think many in your audience would.

kawera · on July 8, 2015

Homebrew would be nice too.

res0nat0r · on July 8, 2015

Great thanks! Looking forward to checking it out.

grammr · on July 8, 2015

We just updated http://www.pipelinedb.com/download/0.7.7/osx.

Please shoot me an email (I'm Derek) if you have any issues installing this package. Thanks for your patience!

cuttinedg · on July 8, 2015

There was an error with init and created an issue in github

https://github.com/pipelinedb/pipelinedb/issues/1025

tuckermi · on July 7, 2015

How does the PipelineDB differ or build on the ideas from Aurora/Borealis/StreamBase? At least at a high level, something like LiveView[1] seems to provide similar functionality to PipelineDB's concept of a Continuous View.

I was under the impression that the academic projects had proposed StreamSQL as a general language, though since StreamBase's acquisition it now seems to have been branded as TIBCO StreamSQL[2]. Have you guys been part of any efforts to make sure that there is an open language standard?

[1] http://streambase.typepad.com/streambase_stream_process/2013...

[2] http://www.streambase.com/developers/docs/latest/streamsql/

anonetal · on July 7, 2015

There is also TelegraphCQ, a competing project at Berkeley around the same time; TelegraphCQ was also built on top of PostgreSQL and its support for Continuous Queries seems essentially identical to Continuous Views here ("materialized views", "triggers", and "continuous queries" are all quite similar to each other in terms of the underlying technology needed). TelegraphCQ was commercialized as Truviso, bought out by CISCO a few years ago.

http://telegraph.cs.berkeley.edu/telegraphcq/v0.2/

tuckermi · on July 7, 2015

Thanks for the link, I hadn't seen TelegraphCQ previously. Following the trail, I also came across a couple other similar research projects relating to Stream-oriented DBs. Specifically, STREAM from Stanford and Cougar from Cornell, though it appears that all of these academic projects are dormant at this point.

Fergi · on July 7, 2015

PipelineDB certainly builds on work from Aurora, TelegraphCQ, Truviso, Streambase, and many other projects and companies. We have interacted heavily with many people from these projects to learn what worked (and didn't work) for them over the years, and hopefully to build on that in a pragmatic way.

To your point about promoting language standards, we've intentionally kept the syntax as close to SQL as possible in order to keep things simple. The goal has always been to give the broadest range of developers the simplest way possible to develop realtime applications using only SQL.

assface · on July 8, 2015

All of the academic stream systems from the early 2000s are long over. The students have all graduated.

Truviso got bought by Cisco and disappeared into their internal projects.

StreamBase got bought by TIBCO and is still available today.

ScottBurson · on July 8, 2015

And Coral8 got bought by Sybase and became Sybase CEP, also still available. (Sybase is now owned by SAP.)

chadthenderson · on July 7, 2015

This looks very cool. Although, I'm not sure I totally understand how it can be used to replace batch ETL processes. So, PipelineDB eliminates ETL batch processing by incrementally inserting data into continuous views, but the documentation says that it's not meant for ad-hoc data warehouses as the raw data is discarded. So, does that leave me still using batch processes to load my data warehouse? Is PipelineDB going to be my data warehouse as long as I only want the resulting streamed data? Just trying to figure out what this would look like and where its place is in a data warehouse environment.

grammr · on July 7, 2015

Hey Chad, PipelineDB co-founder here. PipelineDB certainly isn't intended to be the only tool in your data infrastructure. But whenever the same queries are being repeatedly run on granular data, those are the types of situations in which it often makes a lot sense to just compute the condensed result incrementally with a continuous view, because that's the only lens it's ever viewed through anyways (dashboards are a great example of this). Continuous views can be further aggregated and queried like regular tables too.

In terms of not requiring that raw data be stored, a typical setup is to keep raw data somewhere cheap (like S3) so that it's there when you need it. But granular data is often overwhelmingly cold and never looked at again so it may not always be necessary to store it all in an interactively queryable datastore.

As I mentioned, PipelineDB certainly doesn't aim to be a monolithic replacement for all adjacent data processing technologies, but there are areas where it can definitely introduce significant efficiency.

chadthenderson · on July 7, 2015

Great. Thank you for the clarification. What you just described definitely sounds like something PipelineDB would be great for. I can see it being especially useful for quickly standing up dashboards and maybe even datamarts when considering new data sources. I just wanted to make sure that I wasn't missing something.

reubano · on July 8, 2015

So what's the best practice for when you want a real time dashboard but also want the ability to compare data overtime. E.g., ave. bounce rate this month vs last? Is Pipeline still ideal in this case?

Fergi · on July 8, 2015

Jeff (PipelineDB Co-Founder, here) - Yes, PipelineDB is great for this use case. One powerful aspect of PipelineDB is that it is a fully functional relational database (a superset of PostgreSQL 9.4) in addition to a streaming-SQL engine we have integrated the notion of 'state' into stream processing, for use cases exactly like this.

You can do anything with PipelineDB that you can do with PostgreSQL 9.4, but with the addition of continuous SQL queries, sliding windows, probabilistic data structures, uniques counting, and stream-table JOINs (what you're looking for here, I believe.)

WaxProlix · on July 7, 2015

As someone who's made a lot of use of `tail` and similar, this is appealing.

But I don't have a lot of use cases in personal projects, and am unlikely to find a good use-case at work in the near future. What's the 'adoption path' for something like this?

I think a really robust sample data set with example queries (think the neo4j imdb examples) would be a great way to show how powerful and easy something like this can be.

Fergi · on July 7, 2015

Jeff from PipelineDB here. Great suggestion! We've actually created this using sample Wikipedia data. See:

http://docs.pipelinedb.com/quickstart.html

therealmocker · on July 7, 2015

How similar is this to something like http://riemann.io for processing events from a stream?

Fergi · on July 7, 2015

I'm not familiar with riemann.io but the main differences seem to be that Riemann is clojure-based and would require external storage of some sort. PipelineDB is SQL-based and has integrated storage. It really depends on what kind of event processing you're looking to do.

The main tradeoff with PipelineDB and other stream processing frameworks like Riemann, Storm, Spark Streaming, Samza, and others is mainly flexibility for simplicity. Not all streaming computation lends itself to SQL, but in scenarios where it does continuous SQL queries and a relational database can be simpler. But as with all data processing endeavors, you have to find the right tool for the job.

zallarak · on July 7, 2015

Very cool that it is open sourced - seems like there would be a lot to learn from the code. Link: https://github.com/pipelinedb/pipelinedb

djupblue · on July 7, 2015

This is awesome, thanks for making it open source!

Would it be possible to set triggers or something on the continuous views? Lets say I want to take action (immediately) when a value calculated over sliding window goes above a limit.

It's a bit late here but I'll definitely play with PipelineDB tomorrow.

usman-m · on July 7, 2015

Not right now, but it's definitely a feature we're thinking about.

Awesome--let us know what you think about it!

djupblue · on July 7, 2015

Will do, I love PostgreSQL so I'm pretty excited to se what the possibilities are!

Found this gem in the docs :D

Unsupported Aggregates: xmlagg ( xml )

:(

asgard1024 · on July 8, 2015

This claim about ETL not needed in the future sounds dubious. I work on a large application that is all about ETL. If we wanted to use this new method instead, I am not sure how it would deal with the following:

- State in the data. In many sources we have, processing depends on some internal state, which must be kept along the time. For example some process has started and we will know when it ended, and we must keep its state so we could correctly process the ending event (to match it up). I am not clear how this will work with continuous views. I would say this is actually the major reason of what makes ETL processing non-trivial.

- Processing failure. Let's say something goes wrong and the data processing fails (or it can actually be even planned downtime). How do we know where to restart, to avoid processing data twice or miss data? Does the continuous stream take care of this metadata? And how does it deal with the state information per above? If you do data processing in batches, there is an obvious point of restart. Again, I think the extra complexity that "continuous" approach says is unnecessary relates to the fact that you want to be able to checkpoint the state of processing for various reasons.

buremba · on July 7, 2015

It seems PipelineDB doesn't have a clustered version, all the data must be sent to one server similar to Postgresql. Considering the fact that stream processing feature is usually useful in big data (if the data size is not that big and the data can fit in memory, complex aggregation queries usually don't take more than 1 second using a columnar database), is it possible to use PipelineDB for millions of events per second?

moatra · on July 7, 2015

Do Continuous Views work with table-table joins, or must there always be at least one stream present? The documentation[1] doesn't specify.

If so, this could be an interesting alternative to RethinkDB's changefeeds, as RethinkDB doesn't support joins on the change stream.

[1] http://docs.pipelinedb.com/joins.html

grammr · on July 7, 2015

Founder @ PipelineDB here (hi Slava!)

Currently continuous views must read from a stream. However, in the very near future it will possible to write to streams from triggers, which would probably give you enough flexibility to model the behavior you want if you could conceptualize a table as a stream of changes.

coffeemug · on July 7, 2015

Founder @ RethinkDB here.

Our next release (2.1) is due in about three weeks and includes automatic failover/high availability. Feeds on table joins (and other greatly expanded feed functionality) will be in 2.2, which should happen ~6-8 weeks after 2.1.

(Sorry to jump in with a shameless plug; what PipelineDB is doing is super-cool; I also met the founders a few times, and they're awesome, smart, and very driven people -- I'm really excited about what PipelineDB has to offer!)

clebio · on July 7, 2015

Would be mind-expanding if shell redirection could be used for that, e.g.

> diff <(cat a.txt) <(cat b.txt)

mfenniak · on July 7, 2015

Cool. Very cool.

My first thought (aside from "Cool") was that the current time would be the tricky thing that can't be incorporated into a continuous view. But even that seems to be handled! http://docs.pipelinedb.com/sliding-windows.html

Looks pretty impressive. :-)

buremba · on July 7, 2015

We needed to implement continuous queries in our application code. (It's actually hard to do it right in Postgresql so it's very limited) https://github.com/buremba/rakam/wiki/Postgresql-Backend#con... Since stream processing and real-time analytics are quite hot topics nowadays, I think real-time databases will get much more attention in a near future.

anarazel · on July 8, 2015

Postgres actually does expose a commit log mechanism since 9.4, expanded with 9.5.

http://www.postgresql.org/docs/devel/static/logicaldecoding.... and 9.5's track_commit_timestamp = on.

buremba · on July 18, 2015

Sorry for the late response. HN just sent me notification mail.

The documentation states it as "by default". It's required to change wal settings in config file and write a custom logical decoding output plugin in order to take advantage of that feature.

Currently, PaaS provisers such as RDS and Heroku Postgres don't support and it would not be easy to setup it manually. AFAIK, that feature is intended to be to used for backups.

jxramos · on July 7, 2015

Well said! Good timing too, I'm beginning to sketch out how to tackle this large file set processing that has to stitch together data from corresponding files. The magnitude I'm imagining is such that I can't just read all the files into memory and do the matching, number crunching, and what not against. I like the concepts and terminology in this article. Definitely worth keeping in the back pocket going forward if not diving into it all outright. Thanks so much.

infodroid · on July 7, 2015

It looks like PipelineDB is implemented as a fork of PostgreSQL. I would be interested to understand what is different about the architecture of PipelineDB that it couldn't be integrated into upstream PostgreSQL.

usman-m · on July 7, 2015

It's not so much as how different it is from an architectural standpoint as it is about the sheer magnitude of such a feature. All open-source communities have processes which help maintain high quality, but also add a bit of red tape. I don't think we'd be able to operate at the pace we want if we were pushing every change upstream.

However, we love Postgres and plan on actively merging upstream releases!

jchrisa · on July 7, 2015

Cofounder at Couchbase here.

Is the license decision driven by business or is there some dependency that pushes you to GPL? For us Apache 2.0 has been worth it even when other companies use our code in their products.

ak4g · on July 8, 2015

And not just the GPL, but the AGPL. That one caught me by surprise. Does this mean there is going to be a MySQL-style dual-licensing situation in the future? (or is this already the case?)

Fergi · on July 8, 2015

PipelineDB is actually licensed under the GPLv3. We accidentally updated our files with the AGPL earlier today by mistake but have since rectified that. Apologies for the mixup!

reitanqild · on July 8, 2015

As much as thank you comments are downvoted here I'll still say: Thank you! Same goes to everyone else who avoids hasseling developers everywhere with AGPL.

sheraz · on July 8, 2015

Can PipelineDB be used to run projections for an EventStore?

I'm experimenting with the EventStore pattern for a side project, and I have struggled to implement projections. Could PipelineDB be a way to deliver that?

jpitz · on July 8, 2015

In the example for sf_proximity_count, you state the view covers a 5 minute sliding window, but the WHERE clause does not reference clock_timestamp(). Is 5 minutes an implicit default?

grammr · on July 8, 2015

There was a bug in the example. Fixed, thanks!

pradn · on July 7, 2015

What does ETL mean in this context?

oxymoron · on July 7, 2015

Extract, Transform, Load

maslam · on July 7, 2015

Wonderful! Great job.