Hacker News new | past | comments | ask | show | jobs | submit login
Apache Storm 1.0 released (apache.org)
160 points by wener on April 12, 2016 | hide | past | favorite | 42 comments



so a newbie question, when do I need this? the only thing I understand is that hadoop is for batch process while storm is for realtime big-data processing.


If you're a Python user, I gave a talk at PyData called "Beating the GIL with Python" that covers all the tooling for parallel computing in Python, from the simplest (e.g. multiprocessing) to the more complex (Storm/Spark), with many tools in between, and what the trade-offs are. That may help you out.

https://www.youtube.com/watch?v=gVBLF0ohcrE

The rough summary is, if you need to do distributed multi-node computation with very low latency -- that is, end-to-end data processing time measured in millis or seconds -- then Storm will help you out. The "competitive" product in open source right now is pyspark + Spark Streaming.

The benefit of Storm is that it is an older (more mature) project with many production deployments. My team at Parse.ly wrote the open source Python + Storm integration libraries, pystorm and streamparse, which help Python programmers make use of Storm for large-scale stream processing: https://github.com/Parsely/streamparse. We also wrote pykafka, which is a Kafka client for Python: https://github.com/Parsely/pykafka -- these projects are related since Kafka is often the streaming data source used as the input to a Storm topology.


After the recent comparisons on real time processing between Flink and Spark, isn't Flink the actual "competitive" product? In the sense that it's real "real time", not the way Spark streaming (not dismissing it, it's what I actually use) works.


That's an interesting question. Unfortunately I haven't had a chance to evaluate Flink in-depth, it being relatively new.


That's the gist. I am a Flink fan, but I get the impression it's still in the research and development phase...although rapidly approaching a potentially production-level quality.

That said, Storm is Event-streaming processing which is usually the focal point of most comparisons between Flink and Spark Streaming.


> The rough summary is, if you need to do distributed multi-node computation with very low latency -- that is, end-to-end data processing time measured in millis or seconds -- then Storm will help you out.

I still don't really know what kind of computation that is. Any specific, real-world examples? I'm genuinely curious.


We use it for real-time web/content analytics. You can see our product tour @ http://parse.ly/tour.

Other examples of "common" steaming data with real-time use cases:

- Twitter firehose

- Network packet analysis

- Financial market data

- Device and sensor data


Compression for live video streams maybe? You need to take a high quality input, continuously recompress it to several different qualities (for varying connection speeds and screen sizes) and then kick the data out to a CDN for viewers?


You would use this where you need to take in a lot of data from multiple sources, and create some kind of information out of that, with soft real-time constraints. So, for instance, web analytics. Or Natural Language Processing, if the communication was between humans and machines, and the responses needed to be real-time.

Think about getting a terabyte of data a day, or more.

Or you can take a different approach, and look at a system that does not use Storm, and then imagine at what scale that system would break. Do consider the system that Matthias Nehlsen wrote about here:

http://matthiasnehlsen.com/blog/2014/11/07/Building-Systems-...

Using Redis as a universal bus is a fine architecture for small amounts of data, where "small amounts of data" means a gigabyte a day, or maybe a little bit more.

But what about 10 gigabytes of data per day?

What about about 100 gigabytes of data per day?

Universal bus architectures break down when each node sees too many messages that are not meant for that node. There is a limit on how far you can go with an architecture where every node gets every message. Assuming that each message is suppose to be read by certain nodes, there is an inefficiency to sending messages to nodes that don't want the message.

But there is a wonderful flexibility to this system. And I've created similar systems. Nevertheless, at a certain scale, you need to give up that flexibility for the sake of scale.

Again, Matthias Nehlsen says this well:

http://matthiasnehlsen.com/blog/2014/10/30/Building-Systems-...

"What comes to mind immediately when regurgitating the requirements above is Storm and the Lambda Architecture. First I thought, great, such a search could be realized as a bolt in Storm. But then I realized, and please correct me if I’m wrong, that topologies are fixed once they are running. This limits the flexibility to add and tear down additional live searches. I am afraid that keeping a few stand-by bolts to assign to queries dynamically would not be flexible enough."

I would (and have) done exactly what Matthias Nehlsen suggests: use something like Redis for as long as you can. But I would also do what Montalenti eventually did, and adopt Storm to deal with a certain level of scale -- when you reach a scale where a universal bus architecture no longer works, sacrifice flexibility for performance, and switch to Apache Storm.

If I recall correctly, Montalenti and Parse.ly experimented with MongoDB and Cassandra and a bunch of other technologies before they found that Kafka/Storm is what they needed. Maybe he's written about the whole journey somewhere. Certainly, it would be interesting to read why all the other systems failed, and why moving to Kafka/Storm was the right choice for Parse.ly.


At what point is using Storm or any other alternative inevitable? I am always reminded of [1] in these cases. Have you found instances where an implementation of the core logic in C/C++ (maybe with thread parallelism) is fast enough?

1: http://aadrake.com/command-line-tools-can-be-235x-faster-tha...


It's always a valid question -- when trying to scale to handle large data volumes, it's always worth trying to scale "up" (bigger single machine) and "in" (faster code) before scaling "out" (more machines).

In our case, we had pushed the limits of the largest EC2 box we could find. We have to keep up with tens of thousands of events per second, and do non-trivial amounts of CPU processing on each of them. Whether Python, Java, C++, or assembler, it wasn't going to happen on a single box. We proved this to ourself by getting the biggest box we could and rewriting the hotspots of our code in Cython. It still wasn't enough.


> In our case, we had pushed the limits of the largest EC2 box we could find. We have to keep up with tens of thousands of events per second, and do non-trivial amounts of CPU processing on each of them.

I understand there are many reasons for sticking with AWS, but looking at reserved instances[1], I find: r3.8xlarge which has 32 cores, 244GB of RAM and 2 x 320 GB SSD storge, for $2.66 per Hour -- or just under 2K a month.

For just a little more (USD 2099/month, no minimum term, no setup), you can get something like:

Dell R930, 4x Intel Xeon E7-4820v3 (that's 40 cores at base frequency of 1.9 GHz), 384GB DDR4, 2x480GB SSD HW RAID 1 (although I suspect 6x120GB, possibly in raid0 might be better) with 1Gbps Full-Duplex and 10 TB of egress bandwidth included from leasweb: https://www.leaseweb.com/dedicated-server/configure/22651

Granted, one might want 10Gps - and managing your own server isn't free, even when the hosting company takes care of the physical hardware (not that managing an EC2 instance is free either). But I'm curious if you tested on dedicated hardware as well? 10.000 events at 4kb/event is "just" 300 mbps (call it 1 gbps including overhead). I'm certainly not claiming it's easy to any kind of processing at a sustained ~1gbps -- but I'm curious if the kind of workload you're discussing could be handled by a single (relatively cheap) dedicated server?

[1] http://aws.amazon.com/ec2/pricing/#reserved-instances


We did run our own colo for about 18 months from 2012-2014. You are right: dedicated hardware is cheap. It saved us some money for awhile. We used Dell instances not unlike the one you spec'ed.

But even when we ran our own colo and hardware, we couldn't do all of our processing on a single node.

Also, in 2014 the business became critical enough that even if we could run everything on one node, we wouldn't want to risk it. So even when we had dedicated hardware, we still had to run Storm to spread the load onto multiple nodes.

The processing we do in real-time these days lights up over 300 EC2 vCores to 70%+ utilization during peak processing hours. A nightly job we run using pyspark lights up about 3x as many cores to near-100% utilization for several hours. And jobs we run during schema changes or backup restores can use more than 3,000 cores at once to churn through hundreds of terabytes of data.

I know it's very easy to say "YAGNI" to tools like Storm/Spark, and in most cases, I am with you. But when you outgrow the alternatives, as much as I'd love to just "throw a single machine at it", it simply isn't an option. YAGNI doesn't apply when you've exhausted all alternatives!


Thank you for the detailed reply. I didn't mean to imply aws/spark was the wrong tool for your workload - was just curious about your comment on trying on a big ec2 node w/o mentioning dedicated HW.

Judging from comments I've seen online, many people appear to be unaware of the overhead aws introduces. That doesn't mean one shouldn't use aws, just that one should be aware of the tradeoffs :-)


I know you get this, but when you're on AWS you don't evaluate a single piece of standalone hardware for virtually any reason. It's not even worth bothering to think about. People (again, probably not you) always have this misconception that AWS is just an expensive VPS. But anyone who's really invested in the platform (and I'm guessing they are, if they have this scale as a business problem) is using a dozen different services that play really well together in the AWS ecosystem. And that's what your engineers know, and your ops team knows. It's what finance knows. You'd need to be saving massive amounts of money to even bother considering it.


What's your thoughts on Dask? I saw a talk on it at Strata, and it seemed fairly promising.


Very promising IMO. Lots of new stuff going on in their 'distributed' project:

- https://github.com/dask/distributed

- http://matthewrocklin.com/blog/work/2016/02/17/dask-distribu...


I prefer Celery for Python. Far more light-weight, and intended for distributed task processing.

http://celery.readthedocs.org/


Are those really competing though? Celery rocks for some background image processing. Not for multinode analytics of huge data sets.


Actually that's not true. Have a look at the airflow project by AirBnB.

https://github.com/airbnb/airflow


Doesn't seem like it can handle processing on real-time streams of data. They don't really seem comparable.

From the readme:

"Airflow is not a data streaming solution. Tasks do not move data from one to the other (though tasks can exchange metadata!). Airflow is not in the Spark Streaming or Storm space, it is more comparable to Oozie or Azkaban."


I was replying to a comment not saying airflow is capable of stream processing.

> Not for multinode analytics of huge data sets.

Airflow does this just fine with celery.


That doesn't really seem the same at all.


> when do I need this

Probably never.

Spark, flink and a few proprietary tools can operate in both batch and streaming modes which means you can share your codebase between both tasks if you do them. Before these tools it used to be hadoop for batch and storm for streaming, but I suspect those days are dying now (which goes some way to explaining spark's success) except if you require exceptional throughput in which case do some research & benchmarks.


I disagree. If low latency is a priority, Storm has little competition. Checkout the benchmarks done by Yahoo engineers.

https://yahooeng.tumblr.com/post/135321837876/benchmarking-s...


Thanks for that link. However, they appear to conclude Flink and Storm are quite similar in performance.

If you really need low latencies, it is quite likely that none of these will work for you anyway and you would have to build a specialized CEP-style system.


Yes, this release is all about performance. Now Storm must thus be far ahead.


Absolutely, I find Google's "The Dataflow Model" paper about this to be a good read: http://research.google.com/pubs/pub43864.html (it talks about generalizing streaming vs. batch vs. micro-batch frameworks so it becomes an easy cost-based decision).


Storm is a few years late. It has largely been eclipsed by the projects you mentioned and others.

In theory, Storm can be set up to run with much lower latency than the others, but it is rare that an app will need that kind of response time.


What's the range of latencies here? What types of projects really need the lowest latencies?


Financial trading algos where low latency is important were the initial big market for stream processing/cep. As another example, imagine you're trying to do real-time fraud detection based on click stream analysis. If your latency is low enough you can potentially prevent suspicious transactions ever happening instead of having to allow them and then recover them somehow later.


Storm is actually few years earlier. Its the first real-time processing framework before other projects showed up.


True, it showed up first. And then it stagnated for several years. Now it is behind the other alternatives.


I don't see why this was downvoted. The points may be debatable, but isn't that the point of a discussion?


The home page has a 'Why use Storm' section that you might find helpful: https://storm.apache.org/index.html

For example, Twitter used to use Storm to process analytics on tweet data. They now use Heron. [1] Storm is valuable when your data is constantly being generated as a stream, on which you would like to perform processing immediately. I've previously also used Storm's distributed RPC cluster, where I fire off a task that I would like processed in the background and wait for the response to arrive.

[1] https://blog.twitter.com/2015/flying-faster-with-twitter-her...


As with most tools it's depending on what you want to do with it. Storm is meant to handle streaming data. It's wonderful for low latency big data applications. You can check out the benchmarks for it at Yahoo Developer Tumblr https://yahooeng.tumblr.com/post/135321837876/benchmarking-s...


Excellent work guys! I had developed a custom caching solution for storm in v0.9.6. Its great that distributed caching is being supported now. I would love to see some more documentation and examples on it.


Wow. Good work Storm people! Some nice features here.

Anyone privy to what changes they made to yield the performance improvements?


From the issue log :

"Profiling a simple speed of light topology, shows that a good chunk of time of the SpoutOutputCollector.emit() is spent in the clojure reduce() function.. which is part of the ACK-ing logic. Re-implementing this reduce() logic in java gives a big performance boost in both in the Spout.nextTuple() and Bolt.execute()"

[1]: https://issues.apache.org/jira/browse/STORM-1539


Apache Storm 1.0.0 introduces batching internal disruptor queue which adds latencies slightly (but with timeout set to 1ms per each queue so it doesn't hurt) but gives huge throughput improvement.

https://github.com/apache/storm/pull/765


More coming in the roadmap. We are also working performance improvements around internal queuing.


Spark || Storm?

Your take, guys?




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: