I think I get the concept (and it's interesting); but I'm a bit baffled by the terminology:
> That led me to the idea of "spouts" and "bolts" – a spout produces brand new streams, and a bolt takes in streams as input and produces streams as output.
Spout I can grasp; why "bolt"? Why these two, together? I'd have really appreciated an extra sentence or two here as to why he chose these terms -- naming is so crucial, and the fact that both "spouts" and "bolts" sound like sources/producers to me (one of water, another of electricity...?) made it harder to grasp how Storm works.
I always assumed that "bolt" refers to the threaded fasteners, not electrical bolts. The analogy they are going for is that they are the components responsible for "bolting" spouts together (although, the mixing of metaphors bothers me a little bit... why are we using bolts to fasten fluids together?)
To be super clear about it, a "topology" in Storm is a directed acyclical graph. The nodes are called "bolts" and the edges are called "spouts".
> a "topology" in Storm is a directed acyclical graph. The nodes are called "bolts" and the edges are called spouts"
Not really. Edges are streams and nodes are either spouts or bolts. Spouts have no predecessors and inject external data into the system. Bolts gather one or more input streams and produce zero or more output streams.
These names are a bit confusing, but only during the very first steps of a project. Quickly every developers grasp the concepts.
I have nevertheless a concern regarding spouts and bolts. In practice, when a topology is refactored, bolts have frequently to be transformed into spouts. This occurs when a topology is split in two to ease deployment; and when a persistent queue is added along a stream. This highlights that the distinction between spouts and bolts is bit artificial.
That was my assumption; but if you're going to name parts of your system after natural phenomena, you have to choose ones that make some sense, not just names out of a hat that are vaguely in the same domain.
"Spout" makes sense as something that's a source of a stream. I haven't been able to figure out the metaphor for "bolt" at all, particularly as something that processes water (stream/spout...) in some way.
Spout? Bolt? Firehose? No wonder confusion reigns within and especially outside programmers' worlds. I think I'll bring the campaign for plain English to meetings and ask programmers and marketing people to leave their Gibson paperbacks at the door. Everything doesn't have to sound like a new script for Keanu Reeves.
An aside: when Nathan started work on Storm, there was a competing streaming platform being worked on inside Yahoo, called "S4". It had many similarities to Storm, and some key differences.
It is still in the incubator: http://incubator.apache.org/s4/
(I have no connection with S4, just know some people who worked on it)
The key advantage Storm brought to the table was, as briefly mentioned in the article, the at-least once processing guarantee Storm offers thanks to its efficient tracking of tuples and their descendants (see https://storm.incubator.apache.org/documentation/Guaranteein...). To my knowledge S4 offers no such guarantees. I think this made Storm more attractive for many use-cases, it did for me.
Slides 42 and 43 describe an architecture with all three ... hadoop, storm and kafka. Seems like a beast and I don't have applications but pretty cool combo of technologies.
http://manning.com/marz/ - this is often the trifecta used in the "lambda architecture", coined by nathan marz. This is also the system they built @ backtype to do their real-time analytics system.
Complementary technologies, often used together. In fact, in Storm 0.9.x, Storm actually bundles a JVM implementation of a Storm Spout that plugs directly into Kafka:
In no way -- apples vs. oranges. Kafka is queue, Storm is processing engine. A better question how it's compared to other processing engines like Spark Streaming, Samza, etc. But just Google it :)
Nice read. At my previous job we used storm for all sorts of cool things. It worked great and it was a pleasure to be the one at the company that got to really cozy up with it. Deployment (with or without storm-deploy) was also well thought out.
As I understood it Samza ties in with Hadoop/HBase, so if you already have that, it might be easier (or even possible at all) to integrate, whereas Storm has nothing to do with with that and only does stream processing.
> That led me to the idea of "spouts" and "bolts" – a spout produces brand new streams, and a bolt takes in streams as input and produces streams as output.
Spout I can grasp; why "bolt"? Why these two, together? I'd have really appreciated an extra sentence or two here as to why he chose these terms -- naming is so crucial, and the fact that both "spouts" and "bolts" sound like sources/producers to me (one of water, another of electricity...?) made it harder to grasp how Storm works.