"We will continuously keep an eye on different messaging and streaming systems in the ecosystem and ensure that our team is making the right decision for our customers and Twitter, even if it’s a hard decision"
Rather than setting industry direction, they are inclined to changing underlying platform every few years, which brings a lot of challenges. May be it creates work for platform teams. But, its not useful to do this often.
I'm curious if there's some conflicting, underlying motives from their senior level technology folks.
They had a game changing tool in Apache Storm...once Nathan Marz (creator of Storm) left twitter, another senior engineer took over that group and created his OWN variant of Twitter Storm called Heron. After much engineering PR and github fanfare [0], couple years later he leaves to start his own company to do 'enterprise heron' [1].
So is the goal here to use something that works well for their end-users? Or is it to create a shiny new tool that they can spin off into a new company?
The joke that I've heard is "performance (review) driven development".
If the way to get promoted is to launch a shiny new product, then your most senior people will be the best at finding shiny new products to launch, even if that's not the right technical decision to make.
"Oh but our senior engineers would _never_ make such a selfish choice to make themselves look better at the expense of the company, everyone here _believes_ in the mission."
Speaking of underlying motives... you strongly imply that Heron was not a significant improvement on Storm and doesn’t “work well for end users.” Without evidence. Why?
Re: motives,
“Heron grew out of real concerns Twitter was having with its large deployment of Storm topologies. These included difficulties with profiling and reasoning about Storm workers when scaled at the data level and at a topology level, the static nature of resource allocation in comparison to a system that runs on Mesos or YARN, lack of back-pressure support, and more.”
“Although Twitter could have adopted Apache Spark or Apache Flink, that would have involved rewriting all of Twitter's existing code.”[1]
I just question the intentions when they do a huge PR campaign for an internal project...then the leader of that project leaves to start his own company shortly after.
And no...more PR about heron is not going to convince me that 'it's a huge improvement over Kafka'. Kafka has massive adoption, and almost no Kafka users I've come across have expressed a burning desire for Heron or Pulsar.
I’m interested in pulsar but also wary of it. On paper it sounds like an amazing system, the benefits of kafka without the downsides. But if it is that good, why isn’t it used more?
I really like Pulsar and it definitely has overlap for some use cases with Kafka but its important to understand that they are 2 different kinds of systems.
Pulsar is a classic Pub/Sub system. It offers persistence but the basic concepts are what you'd find in most any Pub/Sub system.
Kafka is a distributed log (data structure, not tracing) system. Its basic concepts are what you would expect around something that is persisting events. In many ways its better to think about Kafka as an event datastore and not a messaging system.
In particular, Kafka is much better in use cases where you want to provide all events to many different consumers on their own time frames, over and over again. Use cases like building distributed caches and replication formats will likely fit Kafka better than Pulsar.
Also, Kafka has much higher industry adoption. You are much more likely to have third party support for Kafka than for Pulsar. Also each has a seperate Apache project dependency. Zookeeper has... warts, but they are well known and operating Zookeeper clusters is a much more common skillset than operating BookKeeper.
Again, I really like Pulsar but claiming it is better in basically every way is just false.
I don't see the differences that you list. Both systems produce and subscribe to named topics, with multiple consumers which can read and re-read at their own pace, with data sharded by key to scale, and configurable retention.
Pub/sub is the interface to the underlying log storage in both, and you can layer an event sourcing, service bus, CEP, MQ, or any other messaging buzzword on top.
However, Pulsar goes further than Kafka. It supports millions of topics, multi-tenant namespacing, more consumer options (exclusive, shared/group), per-message acknowledgements instead of a single offset, non-persistent topics for broadcast or ephemeral messaging, geo-replication, tiering to cloud storage (useful for that event store), and a functions platform for lightweight processing. Pulsar also uses Zookeeper for coordination and has official Kubernetes deployments for easy operations.
You're right that Kafka has much wider industry usage, and perhaps that is a significant advantage for most, but the constant announcement of yet another suite of helper tools by every major company only seems to show that it's not quite polished. Other than community and integrations marketshare, I cannot think of any functional area where Kafka is better than Pulsar. Can you name any?
Actually, I hadn't used Pulsar since they added the reader interface or Kafka wrappers. It likely now does support the event store use cases it didn't previously. Very cool.
The additional requirement of BookKeeper then is the only major technical negative I see, leaving the market share argument as the major negative. I think saying that Kafka has a lot of helper tools being a negative is pretty unfair though. Pulsar has such lower adoption that there just isn't the rationale for creating the helper tools. That shouldn't be used as metric to decide which is more polished.
Most of the tools seem to do the same thing while they would be unneeded for Pulsar so I'm going by that. But yes, Kafka has a much bigger community which is probably what most users want.
Pulsar Functions appears to be a stream processing engine similar to Flink or Kafka Streams. In other words, it is more like a runtime for streaming apps.
The relation Pulsar ~ Pulsar Functions appears similar to Kafka ~ Kafka Streams. One is a message broker, the other a streaming engine.
Exactly-once (or effectively-once) in Pulsar Functions is depended on assigning unique sequence ids on the producer side. When consuming from Pulsar topic those can be taken from the consumed message, but AFAIK Pulsar doesn't assign those unique ids on its own, so it needs to be done externally.
That's true. KSQL is another thing that Kafka wins at, which makes a pretty smooth experience for straightforward stream processing and general management over the command line tools.
Pulsar is working on Presto integration for querying which should be interesting.
Pulsar is a classic Pub/Sub system. ...
Kafka is a distributed log
This is true but I’ll wager that 99% of real world deployments of Kafka are in the simple pubsub use case. Most people just aren’t operating at LinkedIn scale or complexity.
Or maybe, just maybe, engineering is keeping tech debt in check and replacing pieces that are working, but not working well, with better pieces. Also, unless we are working at a similar scale, I don't think we are qualified to comment on their engineering challenges.
Engineering is keeping your knives sharp and your tools clean.
One of the oft unspoken things in software is that we let the talent do things that are prima facie suboptimal for the organization in order to keep them here and keep them vibrant.
Some minor boondoggles I used to think harshly about make more sense in hindsight when I think about my own mental processes when I leave a place or even when I’m just feeling demotivated.
I love to solve a problem and be done with it. Probably more than most. But... you have to keep moving, and it’s probably even true that there can be no sacred cows.
Twitter (the company) always comes across as constantly bailing out their leaky ship, and then running around plugging holes with bits of scrap that 'might' just fit. There never seems to be any sort of proper roadmap that you'd expect for a tech company with such a large user-base.
From the article: "For single consumer use cases, we saw a 68% resource savings, and for fanout cases with multiple consumers we saw a 75% resource savings."
At twitter's server spend it makes perfect sense to make such an investment.
Most of the deployments of Kafka I've seen have been desperate attempts to work around poorly designed data architecture. You shouldn't need Kafka to handle a few hundred MB or a few GB of data per day. Twitter's scale is way beyond that, though and I can easily see the case for it there.
> You shouldn't need Kafka to handle a few hundred MB or a few GB of data per day.
I wouldn't link the two problems (size, bad architecture). Kafka is often used to bail out poorly designed data architectures, but it doesn't matter whether the volume is small or large at that point: it's still a pile of crap.
Even for smaller cases it's a nice way to decouple producers from multiple consumers. I wouldn't suggest building Kafka infrastructure if you don't also have at least one bigger use-case, but if it's there it's convenient. Especially if you need replay or the ability to consume the same messages for days afterwards.
The case for Kafka I've seen the most is "decoupling". I.e. two pieces of a system, both of which are often owned by one team, are talking to each other over Kafka because this way they are decoupled, which keeps the architecture clean. So, to make the system simpler, we make it more complicated... I was never sold on this.
Interesting by Kafka here, do you mean Kafka specfically or any sort of event queue (+ RabbitMQ/NSQ too)? And what would you suggest instead for an event processing pipeline?
I submitted a question in previous discussion link[0] but I didn't get an answer by author. But hopefully someone will answer here:
How did you solve producer fencing mechanism which was built into Bookkeeper with Kafka (Assume I need to limit only one client to be able to write to a topic)?
Sorry for that I didn't understand your question fully, I suggest you to ask question in the slack[link] of BookKeeper community.
[link]http://bookkeeper.apache.org/community/slack/
Sorry for my bad English. Here is a better explanation:
In Bookkeeper, one producer can fence the current active producer from appending to log. If majority of Bookkeeper nodes respond with OK to a FENCE request, then the current producer can not append to the log. This helps implementing master-slave database replication quiet trivial. Master keeps appending to log, and slaves keep reading from log. In case master fails, one of the slaves sends a FENCE request to Bookkeeper nodes which reliably blocks current Mal-functioning master from appending to db log.
However in Kafka, this isn't a native operation. Any client can push to a topic. There is no easy way to safely do this.
I was looking forward to see how Twitter team actually handled this specific use case.
We us NATS (and STAN) for real-time IoT metrics and reporting with a high cardinality of subscriptions. The throughput from a single NATS box is very nice (we currently send 100K+ messages/sec using a single EC2 instance running NATS).
Both Pulsar and LogDevice solve the problem of far-away-consumers with kind of segmenting the log and storing each segment in separate machine to reduce disk seeks and page cache pollution (when c1 reads head of the log and c2 reads tail).
So technically they are better than Kafka. But as the article mentions, they just used SSDs to get around this issue in Kafka. By looking at article the only valid reason to switch to Kafka seems to be KStream.
Pulsar has a much better storage system that scales independently so latency stays low regardless of consumer offset. It can also now tier to S3/cloud storage natively and it has many other features like supporting millions of topics and per-message acknowledgements.
If they're looking for a mature solution, Kafka would be more in that line. Plus beyond pubsub, Kafka is well supported as a streaming data source in the Hadoop-esque world
It all depends. For example with Apache Pulsar, tailing readers are served from an in-memory cache in the serving layer (the Pulsar brokers) and only catch-up readers end up having to be served from the storage layer (Apache BookKeeper). This is a little different from DistributedLog which always required going to BookKeeper for reads.
Apache BookKeeper can add additional latency to catch-up readers, on top of the extra hop, because the data of multiple topics are combined into each ledger. This means that we lose some performance from sequential reads. This is mitigated in BookKeeper by writing to disk in batches and sorting a ledger by topic so messages of the same topic are found together, but it still involves more jumping around on disk.
Also, BookKeeper allows the nice separation of disk IO. The read and write path are separate and can be served by different disks so you can scale your reads and writes separately to a certain extent.
For all those reasons, I would have loved to have seen Twitter look at Apache Pulsar and compare performance profiles with Apache Kafka.
There seems to be a fundamental technical misunderstanding of these systems in the blog post. Unless you're running in the (generally quite dangerous) scenario where replication factor = 1, both EventBus and Kafka have a network hop (Kafka to the replica and EventBus to the storage layer) and JVM traversal, so the argument that Kafka is more efficient there doesn't make sense.
As you say, it would be trivial to run EventBus storage and serving on the same nodes if the network hop were an issue.
The argument that a decoupled storage layer costs more than co-locating storage and serving also seems dubious. The cluster is basically doing the same amount of work when you decouple those layers, it's just separating them into components that can be optimized for specific roles. So if you pick the right hardware/instances for each layer and tune for that, you really shouldn't need more hardware. Otherwise we'd all still be running full-stack applications on mainframes to minimize the number of servers. :)
Yes, agreed. It seemed odd because Kafka isn't split into layers so there's no other option, and it confuses resource utilization for number of servers.
Slightly off topic, but I feel that when naming one's technical creation it's best to avoid using the last names of famous people especially when that person has become so famous it's become standard to refer to him/her simply by last name. Just google "Kafka" and check out the first page of search results and you'll see what I mean. (No, googling "Franz Kafka" isn't the same)
Results about Apache Kafka are of interest only to a subgroup of software engineers. Results about Franz Kafka are of interest to a much larger group of people (Frankly, there is simply no comparison). It's the considerate thing to do.
And it's not simply a matter of the ordinary person having to type an extra word. There exist tons of valuable web resources (literary analysis, criticism, journal articles) out there about the novelist that does not contain "Franz" anywhere.
Finally, it's not like there's an absolute need for any piece of software to be given any particular name.
I agree completely. Similarly, I have no idea why the creators of GIMP and CockroachDB (Spencer Kimball and Peter Mattis, and Ben Darnell for the latter) thought those names were the best possible options.
CockroachDB at least refers to their marketing line of "almost impossible to kill/take down" and the way it scales to many servers. There are few less gross insects they could have used though like Bees or Ants.
"We will continuously keep an eye on different messaging and streaming systems in the ecosystem and ensure that our team is making the right decision for our customers and Twitter, even if it’s a hard decision"
Rather than setting industry direction, they are inclined to changing underlying platform every few years, which brings a lot of challenges. May be it creates work for platform teams. But, its not useful to do this often.