Introducing Complex Event Processing (CEP) with Apache Flink

scott_s · on April 6, 2016

For those interesting in CEP in streaming systems, I invite you to read a paper by my colleague, Martin Hirzel, "Partition and Compose: Parallel Complex Event Processing": http://hirzels.com/martin/papers/debs12-cep.pdf

His work has more of a declarative, regex-inspired take, and is implemented in SPL, the language for IBM Streams. His prototype was adapted into the product, and is a part of Streams: http://www.ibm.com/support/knowledgecenter/SSCRJU_4.1.0/com....

ignoramous · on April 6, 2016

For anyone wondering what CEP is... (from the blog) "in contrast to traditional DBMSs where a query is executed on stored data, CEP executes data on a stored query."

And

"Apache Flink with its true streaming nature and its capabilities for low latency as well as high throughput stream processing is a natural fit for CEP workloads."

Anyone looking at generic low latency, high throughput systems can look at Disruptor by LMAX [0]. Spring's Reactor framework supports it out-of-the-box.

[0] https://news.ycombinator.com/item?id=3173993

sgdread · on April 6, 2016

Some CEPs even have query languages which look very like SQL (Esper [1]). But... it's different: 1) it's push model vs pull in database (you toss data in meatgrinder and see outcome on other side as stuff goes in vs going to data and picking what you need); 2) CEPs usually keep track of data only in specified windows (time/distance); you explicitly have to manage manual data windows; this means queries to join everything with everything between two data streams will result in memory explosion on large data sets; 3) CEPs trigger evaluation on message inserts/removals - this means you can't trigger re-evaluation at will without any stimulus event and you can't do joins with absence of something; you can't simply ask CEP to fire an event if no new events happened in last 5 minutes - that needs some query trickery.

[1] http://www.espertech.com/products/esper.php

haddr · on April 6, 2016

I'm watching closely Apache Flink recently and it seems to me that it's more universal that Apache Spark, and with less overhead. Today if somebody asked me for general purpose stream processor engine I would recommend Flink. Of course Spark has many advantages too, but unless you don't need explicitly need Spark's RDD, then go for Flink.

What I'm still missing is better support for machine learning libraries, but I suppose it's a matter of time.

jkot · on April 6, 2016

There is a quite good technical description of difference between Spark and Flink

http://blog.madhukaraphatak.com/

slagfart · on April 6, 2016

Here's the full link: http://blog.madhukaraphatak.com/introduction-to-flink-for-sp...

kod · on April 6, 2016

So you'd make recommendations based on "watching" something, as opposed to actually using it?

Good call.

x0rg · on April 6, 2016

For those interested in a comparison between Spark and Flink, a recent article from our blog here: https://tech.zalando.com/blog/apache-showdown-flink-vs.-spar...

nxzero · on April 6, 2016

Very possible it's obvious and I'm just not seeing it, but are there examples of (static) streams and related code the maybe processes to get a sense of the workflow, data, code, etc.?

stsffap · on April 6, 2016

As input you can basically use any Flink DataStream<T>. There are multiple ways to ingest this data into your Flink program. The most common way is to use a Flink connector for Kafka or RabbitMQ. Or you can also write your own Flink source function. This repository https://github.com/tillrohrmann/cep-monitoring contains a complete working example for the monitorig use case. Instead of reading from an external source, we've defined a custom source to generate the input data. The output data is written to stdout for demonstration purposes.

uceuceuce · on April 6, 2016

If you are referring to Flink in general, you could have a look at the Concepts documentation here: https://ci.apache.org/projects/flink/flink-docs-release-1.0/...

oh_sigh · on April 6, 2016

From my shallow perusal, this seems like it is strongly connected with reactive style programming? Is this a library to enable that style of programming, or is there something different here?

vardump · on April 6, 2016

What kind of throughput can I expect?

How many millions of events per second per contemporary Intel CPU core?

jtagx · on April 6, 2016

I took the code from the blog post, set the parallelism to one and measured how many events the data generator is pushing into the CEP pattern matcher of Flink.

My CPU: Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz Thoughput on one core ~ 650k events / second. Code: https://github.com/rmetzger/cep-monitoring/tree/throughput

jamesblonde · on April 6, 2016

To install flink on aws or gce, check out karamel.io