so a newbie question, when do I need this? the only thing I understand is that hadoop is for batch process while storm is for realtime big-data processing.
If you're a Python user, I gave a talk at PyData called "Beating the GIL with Python" that covers all the tooling for parallel computing in Python, from the simplest (e.g. multiprocessing) to the more complex (Storm/Spark), with many tools in between, and what the trade-offs are. That may help you out.
The rough summary is, if you need to do distributed multi-node computation with very low latency -- that is, end-to-end data processing time measured in millis or seconds -- then Storm will help you out. The "competitive" product in open source right now is pyspark + Spark Streaming.
The benefit of Storm is that it is an older (more mature) project with many production deployments. My team at Parse.ly wrote the open source Python + Storm integration libraries, pystorm and streamparse, which help Python programmers make use of Storm for large-scale stream processing: https://github.com/Parsely/streamparse. We also wrote pykafka, which is a Kafka client for Python: https://github.com/Parsely/pykafka -- these projects are related since Kafka is often the streaming data source used as the input to a Storm topology.
After the recent comparisons on real time processing between Flink and Spark, isn't Flink the actual "competitive" product? In the sense that it's real "real time", not the way Spark streaming (not dismissing it, it's what I actually use) works.
That's the gist. I am a Flink fan, but I get the impression it's still in the research and development phase...although rapidly approaching a potentially production-level quality.
That said, Storm is Event-streaming processing which is usually the focal point of most comparisons between Flink and Spark Streaming.
> The rough summary is, if you need to do distributed multi-node computation with very low latency -- that is, end-to-end data processing time measured in millis or seconds -- then Storm will help you out.
I still don't really know what kind of computation that is. Any specific, real-world examples? I'm genuinely curious.
Compression for live video streams maybe? You need to take a high quality input, continuously recompress it to several different qualities (for varying connection speeds and screen sizes) and then kick the data out to a CDN for viewers?
You would use this where you need to take in a lot of data from multiple sources, and create some kind of information out of that, with soft real-time constraints. So, for instance, web analytics. Or Natural Language Processing, if the communication was between humans and machines, and the responses needed to be real-time.
Think about getting a terabyte of data a day, or more.
Or you can take a different approach, and look at a system that does not use Storm, and then imagine at what scale that system would break. Do consider the system that Matthias Nehlsen wrote about here:
Using Redis as a universal bus is a fine architecture for small amounts of data, where "small amounts of data" means a gigabyte a day, or maybe a little bit more.
But what about 10 gigabytes of data per day?
What about about 100 gigabytes of data per day?
Universal bus architectures break down when each node sees too many messages that are not meant for that node. There is a limit on how far you can go with an architecture where every node gets every message. Assuming that each message is suppose to be read by certain nodes, there is an inefficiency to sending messages to nodes that don't want the message.
But there is a wonderful flexibility to this system. And I've created similar systems. Nevertheless, at a certain scale, you need to give up that flexibility for the sake of scale.
"What comes to mind immediately when regurgitating the requirements above is Storm and the Lambda Architecture. First I thought, great, such a search could be realized as a bolt in Storm. But then I realized, and please correct me if I’m wrong, that topologies are fixed once they are running. This limits the flexibility to add and tear down additional live searches. I am afraid that keeping a few stand-by bolts to assign to queries dynamically would not be flexible enough."
I would (and have) done exactly what Matthias Nehlsen suggests: use something like Redis for as long as you can. But I would also do what Montalenti eventually did, and adopt Storm to deal with a certain level of scale -- when you reach a scale where a universal bus architecture no longer works, sacrifice flexibility for performance, and switch to Apache Storm.
If I recall correctly, Montalenti and Parse.ly experimented with MongoDB and Cassandra and a bunch of other technologies before they found that Kafka/Storm is what they needed. Maybe he's written about the whole journey somewhere. Certainly, it would be interesting to read why all the other systems failed, and why moving to Kafka/Storm was the right choice for Parse.ly.
At what point is using Storm or any other alternative inevitable? I am always reminded of [1] in these cases. Have you found instances where an implementation of the core logic in C/C++ (maybe with thread parallelism) is fast enough?
It's always a valid question -- when trying to scale to handle large data volumes, it's always worth trying to scale "up" (bigger single machine) and "in" (faster code) before scaling "out" (more machines).
In our case, we had pushed the limits of the largest EC2 box we could find. We have to keep up with tens of thousands of events per second, and do non-trivial amounts of CPU processing on each of them. Whether Python, Java, C++, or assembler, it wasn't going to happen on a single box. We proved this to ourself by getting the biggest box we could and rewriting the hotspots of our code in Cython. It still wasn't enough.
> In our case, we had pushed the limits of the largest EC2 box we could find. We have to keep up with tens of thousands of events per second, and do non-trivial amounts of CPU processing on each of them.
I understand there are many reasons for sticking with AWS, but looking at reserved instances[1], I find: r3.8xlarge which has 32 cores, 244GB of RAM and 2 x 320 GB SSD storge, for $2.66 per Hour -- or just under 2K a month.
For just a little more (USD 2099/month, no minimum term, no setup), you can get something like:
Dell R930, 4x Intel Xeon E7-4820v3 (that's 40 cores at base frequency of 1.9 GHz), 384GB DDR4, 2x480GB SSD HW RAID 1 (although I suspect 6x120GB, possibly in raid0 might be better) with 1Gbps Full-Duplex and 10 TB of egress bandwidth
included from leasweb: https://www.leaseweb.com/dedicated-server/configure/22651
Granted, one might want 10Gps - and managing your own server isn't free, even when the hosting company takes care of the physical hardware (not that managing an EC2 instance is free either). But I'm curious if you tested on dedicated hardware as well? 10.000 events at 4kb/event is "just" 300 mbps (call it 1 gbps including overhead). I'm certainly not claiming it's easy to any kind of processing at a sustained ~1gbps -- but I'm curious if the kind of workload you're discussing could be handled by a single (relatively cheap) dedicated server?
We did run our own colo for about 18 months from 2012-2014. You are right: dedicated hardware is cheap. It saved us some money for awhile. We used Dell instances not unlike the one you spec'ed.
But even when we ran our own colo and hardware, we couldn't do all of our processing on a single node.
Also, in 2014 the business became critical enough that even if we could run everything on one node, we wouldn't want to risk it. So even when we had dedicated hardware, we still had to run Storm to spread the load onto multiple nodes.
The processing we do in real-time these days lights up over 300 EC2 vCores to 70%+ utilization during peak processing hours. A nightly job we run using pyspark lights up about 3x as many cores to near-100% utilization for several hours. And jobs we run during schema changes or backup restores can use more than 3,000 cores at once to churn through hundreds of terabytes of data.
I know it's very easy to say "YAGNI" to tools like Storm/Spark, and in most cases, I am with you. But when you outgrow the alternatives, as much as I'd love to just "throw a single machine at it", it simply isn't an option. YAGNI doesn't apply when you've exhausted all alternatives!
Thank you for the detailed reply. I didn't mean to imply aws/spark was the wrong tool for your workload - was just curious about your comment on trying on a big ec2 node w/o mentioning dedicated HW.
Judging from comments I've seen online, many people appear to be unaware of the overhead aws introduces. That doesn't mean one shouldn't use aws, just that one should be aware of the tradeoffs :-)
I know you get this, but when you're on AWS you don't evaluate a single piece of standalone hardware for virtually any reason. It's not even worth bothering to think about. People (again, probably not you) always have this misconception that AWS is just an expensive VPS. But anyone who's really invested in the platform (and I'm guessing they are, if they have this scale as a business problem) is using a dozen different services that play really well together in the AWS ecosystem. And that's what your engineers know, and your ops team knows. It's what finance knows. You'd need to be saving massive amounts of money to even bother considering it.
Doesn't seem like it can handle processing on real-time streams of data. They don't really seem comparable.
From the readme:
"Airflow is not a data streaming solution. Tasks do not move data from one to the other (though tasks can exchange metadata!). Airflow is not in the Spark Streaming or Storm space, it is more comparable to Oozie or Azkaban."
Spark, flink and a few proprietary tools can operate in both batch and streaming modes which means you can share your codebase between both tasks if you do them. Before these tools it used to be hadoop for batch and storm for streaming, but I suspect those days are dying now (which goes some way to explaining spark's success) except if you require exceptional throughput in which case do some research & benchmarks.
Thanks for that link. However, they appear to conclude Flink and Storm are quite similar in performance.
If you really need low latencies, it is quite likely that none of these will work for you anyway and you would have to build a specialized CEP-style system.
Absolutely, I find Google's "The Dataflow Model" paper about this to be a good read: http://research.google.com/pubs/pub43864.html (it talks about generalizing streaming vs. batch vs. micro-batch frameworks so it becomes an easy cost-based decision).
Financial trading algos where low latency is important were the initial big market for stream processing/cep. As another example, imagine you're trying to do real-time fraud detection based on click stream analysis. If your latency is low enough you can potentially prevent suspicious transactions ever happening instead of having to allow them and then recover them somehow later.
For example, Twitter used to use Storm to process analytics on tweet data. They now use Heron. [1] Storm is valuable when your data is constantly being generated as a stream, on which you would like to perform processing immediately. I've previously also used Storm's distributed RPC cluster, where I fire off a task that I would like processed in the background and wait for the response to arrive.
As with most tools it's depending on what you want to do with it. Storm is meant to handle streaming data. It's wonderful for low latency big data applications. You can check out the benchmarks for it at Yahoo Developer Tumblr https://yahooeng.tumblr.com/post/135321837876/benchmarking-s...
Excellent work guys! I had developed a custom caching solution for storm in v0.9.6. Its great that distributed caching is being supported now. I would love to see some more documentation and examples on it.
"Profiling a simple speed of light topology, shows that a good chunk of time of the SpoutOutputCollector.emit()
is spent in the clojure reduce() function.. which is part of the ACK-ing logic. Re-implementing this reduce() logic in java gives a big performance boost in both in the Spout.nextTuple() and Bolt.execute()"
Apache Storm 1.0.0 introduces batching internal disruptor queue which adds latencies slightly (but with timeout set to 1ms per each queue so it doesn't hurt) but gives huge throughput improvement.