Twitter’s Kafka adoption story

asdfasafasdf · on Nov 29, 2018

Twitter loves migrations.

"We will continuously keep an eye on different messaging and streaming systems in the ecosystem and ensure that our team is making the right decision for our customers and Twitter, even if it’s a hard decision"

Rather than setting industry direction, they are inclined to changing underlying platform every few years, which brings a lot of challenges. May be it creates work for platform teams. But, its not useful to do this often.

capkutay · on Nov 29, 2018

I'm curious if there's some conflicting, underlying motives from their senior level technology folks.

They had a game changing tool in Apache Storm...once Nathan Marz (creator of Storm) left twitter, another senior engineer took over that group and created his OWN variant of Twitter Storm called Heron. After much engineering PR and github fanfare [0], couple years later he leaves to start his own company to do 'enterprise heron' [1].

So is the goal here to use something that works well for their end-users? Or is it to create a shiny new tool that they can spin off into a new company?

0: https://news.ycombinator.com/item?id=11770758

1: https://architecht.io/streamlio-founders-on-why-the-world-ne...

fnbr · on Nov 29, 2018

The joke that I've heard is "performance (review) driven development".

If the way to get promoted is to launch a shiny new product, then your most senior people will be the best at finding shiny new products to launch, even if that's not the right technical decision to make.

Benjammer · on Nov 30, 2018

"Oh but our senior engineers would _never_ make such a selfish choice to make themselves look better at the expense of the company, everyone here _believes_ in the mission."

blackaspen · on Nov 29, 2018

"Promobird" and "Promobird V2" (and so on).

yxhuvud · on Nov 30, 2018

Closely related to CV driven development.

snaky · on Nov 29, 2018

There is resume driven development for blue collar coders masses. But this needs its' own name.

abalone · on Nov 30, 2018

Speaking of underlying motives... you strongly imply that Heron was not a significant improvement on Storm and doesn’t “work well for end users.” Without evidence. Why?

Re: motives,

“Heron grew out of real concerns Twitter was having with its large deployment of Storm topologies. These included difficulties with profiling and reasoning about Storm workers when scaled at the data level and at a topology level, the static nature of resource allocation in comparison to a system that runs on Mesos or YARN, lack of back-pressure support, and more.”

“Although Twitter could have adopted Apache Spark or Apache Flink, that would have involved rewriting all of Twitter's existing code.”[1]

[1] https://www.infoworld.com/article/3078134/analytics/had-it-w...

capkutay · on Dec 1, 2018

i never said it doesn't work well for end users.

I just question the intentions when they do a huge PR campaign for an internal project...then the leader of that project leaves to start his own company shortly after.

And no...more PR about heron is not going to convince me that 'it's a huge improvement over Kafka'. Kafka has massive adoption, and almost no Kafka users I've come across have expressed a burning desire for Heron or Pulsar.

manigandham · on Nov 29, 2018

Streamlio is at least working on Apache Pulsar, which is a much better alternative to Kafka.

Joeri · on Nov 30, 2018

I’m interested in pulsar but also wary of it. On paper it sounds like an amazing system, the benefits of kafka without the downsides. But if it is that good, why isn’t it used more?

manigandham · on Nov 30, 2018

Because it's new. Kafka has more momentum with more clients and integrations. Pulsar/Streamlio definitely needs better marketing.

dominotw · on Nov 30, 2018

better?

manigandham · on Nov 30, 2018

Yes, in basically every way. The website has a good overview unless you have specific questions:

https://pulsar.apache.org/docs/en/concepts-overview/

kasey_junk · on Nov 30, 2018

I really like Pulsar and it definitely has overlap for some use cases with Kafka but its important to understand that they are 2 different kinds of systems.

Pulsar is a classic Pub/Sub system. It offers persistence but the basic concepts are what you'd find in most any Pub/Sub system.

Kafka is a distributed log (data structure, not tracing) system. Its basic concepts are what you would expect around something that is persisting events. In many ways its better to think about Kafka as an event datastore and not a messaging system.

In particular, Kafka is much better in use cases where you want to provide all events to many different consumers on their own time frames, over and over again. Use cases like building distributed caches and replication formats will likely fit Kafka better than Pulsar.

Also, Kafka has much higher industry adoption. You are much more likely to have third party support for Kafka than for Pulsar. Also each has a seperate Apache project dependency. Zookeeper has... warts, but they are well known and operating Zookeeper clusters is a much more common skillset than operating BookKeeper.

Again, I really like Pulsar but claiming it is better in basically every way is just false.

manigandham · on Nov 30, 2018

I don't see the differences that you list. Both systems produce and subscribe to named topics, with multiple consumers which can read and re-read at their own pace, with data sharded by key to scale, and configurable retention.

Pub/sub is the interface to the underlying log storage in both, and you can layer an event sourcing, service bus, CEP, MQ, or any other messaging buzzword on top.

However, Pulsar goes further than Kafka. It supports millions of topics, multi-tenant namespacing, more consumer options (exclusive, shared/group), per-message acknowledgements instead of a single offset, non-persistent topics for broadcast or ephemeral messaging, geo-replication, tiering to cloud storage (useful for that event store), and a functions platform for lightweight processing. Pulsar also uses Zookeeper for coordination and has official Kubernetes deployments for easy operations.

You're right that Kafka has much wider industry usage, and perhaps that is a significant advantage for most, but the constant announcement of yet another suite of helper tools by every major company only seems to show that it's not quite polished. Other than community and integrations marketshare, I cannot think of any functional area where Kafka is better than Pulsar. Can you name any?

kasey_junk · on Nov 30, 2018

Actually, I hadn't used Pulsar since they added the reader interface or Kafka wrappers. It likely now does support the event store use cases it didn't previously. Very cool.

The additional requirement of BookKeeper then is the only major technical negative I see, leaving the market share argument as the major negative. I think saying that Kafka has a lot of helper tools being a negative is pretty unfair though. Pulsar has such lower adoption that there just isn't the rationale for creating the helper tools. That shouldn't be used as metric to decide which is more polished.

manigandham · on Nov 30, 2018

Most of the tools seem to do the same thing while they would be unneeded for Pulsar so I'm going by that. But yes, Kafka has a much bigger community which is probably what most users want.

tgrez · on Nov 30, 2018

Pulsar doesn't support transactions e.g. producing to multiple topics/partitions atomically.

So, you cannot have exactly-once delivery consuming from a topic and producing to another topic. Kafka starting from 0.11 supports this.

refset · on Dec 2, 2018

Isn't that what Pulsar Functions are designed for? https://streaml.io/blog/pulsar-functions

This presentation compares the Pulsar Function approach against Kafka's 2PC: https://www.slideshare.net/merlimat/effectivelyonce-semantic...

tgrez · on Dec 11, 2018

Pulsar Functions appears to be a stream processing engine similar to Flink or Kafka Streams. In other words, it is more like a runtime for streaming apps.

The relation Pulsar ~ Pulsar Functions appears similar to Kafka ~ Kafka Streams. One is a message broker, the other a streaming engine.

Exactly-once (or effectively-once) in Pulsar Functions is depended on assigning unique sequence ids on the producer side. When consuming from Pulsar topic those can be taken from the consumed message, but AFAIK Pulsar doesn't assign those unique ids on its own, so it needs to be done externally.

manigandham · on Dec 1, 2018

That's true. KSQL is another thing that Kafka wins at, which makes a pretty smooth experience for straightforward stream processing and general management over the command line tools.

Pulsar is working on Presto integration for querying which should be interesting.

gaius · on Nov 30, 2018

Pulsar is a classic Pub/Sub system. ... Kafka is a distributed log

This is true but I’ll wager that 99% of real world deployments of Kafka are in the simple pubsub use case. Most people just aren’t operating at LinkedIn scale or complexity.

outworlder · on Nov 29, 2018

Or maybe, just maybe, engineering is keeping tech debt in check and replacing pieces that are working, but not working well, with better pieces. Also, unless we are working at a similar scale, I don't think we are qualified to comment on their engineering challenges.

hinkley · on Nov 30, 2018

Engineering is keeping your knives sharp and your tools clean.

One of the oft unspoken things in software is that we let the talent do things that are prima facie suboptimal for the organization in order to keep them here and keep them vibrant.

Some minor boondoggles I used to think harshly about make more sense in hindsight when I think about my own mental processes when I leave a place or even when I’m just feeling demotivated.

I love to solve a problem and be done with it. Probably more than most. But... you have to keep moving, and it’s probably even true that there can be no sacred cows.

Jaruzel · on Nov 29, 2018

Twitter (the company) always comes across as constantly bailing out their leaky ship, and then running around plugging holes with bits of scrap that 'might' just fit. There never seems to be any sort of proper roadmap that you'd expect for a tech company with such a large user-base.

bmiller2 · on Nov 30, 2018

To be fair, they are operating at a scale probably unmatched anywhere (Facebook?). There isn’t exactly a well trodden path for them to fall back on.

shuzchen · on Nov 30, 2018

From the article: "For single consumer use cases, we saw a 68% resource savings, and for fanout cases with multiple consumers we saw a 75% resource savings."

At twitter's server spend it makes perfect sense to make such an investment.

monksy · on Nov 29, 2018

Being an optimist here. Maybe they needed a change. If Kafka out performs their custom messaging system. Why not?

sidlls · on Nov 29, 2018

Most of the deployments of Kafka I've seen have been desperate attempts to work around poorly designed data architecture. You shouldn't need Kafka to handle a few hundred MB or a few GB of data per day. Twitter's scale is way beyond that, though and I can easily see the case for it there.

Xorlev · on Nov 30, 2018

> You shouldn't need Kafka to handle a few hundred MB or a few GB of data per day.

I wouldn't link the two problems (size, bad architecture). Kafka is often used to bail out poorly designed data architectures, but it doesn't matter whether the volume is small or large at that point: it's still a pile of crap.

Even for smaller cases it's a nice way to decouple producers from multiple consumers. I wouldn't suggest building Kafka infrastructure if you don't also have at least one bigger use-case, but if it's there it's convenient. Especially if you need replay or the ability to consume the same messages for days afterwards.

badpun · on Nov 30, 2018

The case for Kafka I've seen the most is "decoupling". I.e. two pieces of a system, both of which are often owned by one team, are talking to each other over Kafka because this way they are decoupled, which keeps the architecture clean. So, to make the system simpler, we make it more complicated... I was never sold on this.

nemothekid · on Nov 30, 2018

Interesting by Kafka here, do you mean Kafka specfically or any sort of event queue (+ RabbitMQ/NSQ too)? And what would you suggest instead for an event processing pipeline?

manigandham · on Nov 30, 2018

Redis 5.0 Streams feature is a great alternative.

zukzuk · on Nov 29, 2018

Their migration-to-Scala-that-wasn't being another great example.

aphexairlines · on Nov 29, 2018

From a 2018 presentation: "Most Twitter services are written in Scala"

https://downloads.ctfassets.net/oxjq45e8ilak/6eh2A72b4IyWsWO...

tormeh · on Nov 29, 2018

What happened? A quick Google search returned no promising headlines.

speedkills · on Dec 2, 2018

Are you implying Scala is not the majority of Twitter's code?

majidazimi · on Nov 29, 2018

I submitted a question in previous discussion link[0] but I didn't get an answer by author. But hopefully someone will answer here:

How did you solve producer fencing mechanism which was built into Bookkeeper with Kafka (Assume I need to limit only one client to be able to write to a topic)?

[0] https://news.ycombinator.com/item?id=18559284

Pixelstime · on Nov 30, 2018

Sorry for that I didn't understand your question fully, I suggest you to ask question in the slack[link] of BookKeeper community. [link]http://bookkeeper.apache.org/community/slack/

majidazimi · on Nov 30, 2018

Sorry for my bad English. Here is a better explanation:

In Bookkeeper, one producer can fence the current active producer from appending to log. If majority of Bookkeeper nodes respond with OK to a FENCE request, then the current producer can not append to the log. This helps implementing master-slave database replication quiet trivial. Master keeps appending to log, and slaves keep reading from log. In case master fails, one of the slaves sends a FENCE request to Bookkeeper nodes which reliably blocks current Mal-functioning master from appending to db log.

However in Kafka, this isn't a native operation. Any client can push to a topic. There is no easy way to safely do this.

I was looking forward to see how Twitter team actually handled this specific use case.

stevewilhelm · on Nov 29, 2018

Did they review other Pub/Sub solutions? I'm thinking of NATS in particular.

andonisus · on Nov 29, 2018

We us NATS (and STAN) for real-time IoT metrics and reporting with a high cardinality of subscriptions. The throughput from a single NATS box is very nice (we currently send 100K+ messages/sec using a single EC2 instance running NATS).

tveita · on Nov 29, 2018

It would be very interesting to see a comparison of Pulsar or LogDevice on Twitter scale workloads.

majidazimi · on Nov 29, 2018

Both Pulsar and LogDevice solve the problem of far-away-consumers with kind of segmenting the log and storing each segment in separate machine to reduce disk seeks and page cache pollution (when c1 reads head of the log and c2 reads tail).

So technically they are better than Kafka. But as the article mentions, they just used SSDs to get around this issue in Kafka. By looking at article the only valid reason to switch to Kafka seems to be KStream.

manigandham · on Nov 29, 2018

Pulsar has a much better storage system that scales independently so latency stays low regardless of consumer offset. It can also now tier to S3/cloud storage natively and it has many other features like supporting millions of topics and per-message acknowledgements.

nevi-me · on Nov 29, 2018

If they're looking for a mature solution, Kafka would be more in that line. Plus beyond pubsub, Kafka is well supported as a streaming data source in the Hadoop-esque world

temuze · on Nov 29, 2018

I've been a big fan of Apache Pulsar lately. I like that they separate storage into a separate layer.

sciurus · on Nov 29, 2018

Interestingly, Twitter's in-house system did that too, but they now seem to think it was a mistake because it increased latency.

airfreak · on Nov 29, 2018

It all depends. For example with Apache Pulsar, tailing readers are served from an in-memory cache in the serving layer (the Pulsar brokers) and only catch-up readers end up having to be served from the storage layer (Apache BookKeeper). This is a little different from DistributedLog which always required going to BookKeeper for reads.

Apache BookKeeper can add additional latency to catch-up readers, on top of the extra hop, because the data of multiple topics are combined into each ledger. This means that we lose some performance from sequential reads. This is mitigated in BookKeeper by writing to disk in batches and sorting a ledger by topic so messages of the same topic are found together, but it still involves more jumping around on disk.

Also, BookKeeper allows the nice separation of disk IO. The read and write path are separate and can be served by different disks so you can scale your reads and writes separately to a certain extent.

For all those reasons, I would have loved to have seen Twitter look at Apache Pulsar and compare performance profiles with Apache Kafka.

anhldbk · on Nov 30, 2018

Streamlio published their OpenMessaging benchmarks between Apache Kafka & Apache Pulsar here: https://streaml.io/pdf/Gigaom-Benchmarking-Streaming-Platfor...

manigandham · on Nov 29, 2018

Why didn't they just run EventBus storage and serving layers on the same machine if they wanted to get better resource utilization?

jbs40 · on Nov 30, 2018

There seems to be a fundamental technical misunderstanding of these systems in the blog post. Unless you're running in the (generally quite dangerous) scenario where replication factor = 1, both EventBus and Kafka have a network hop (Kafka to the replica and EventBus to the storage layer) and JVM traversal, so the argument that Kafka is more efficient there doesn't make sense.

As you say, it would be trivial to run EventBus storage and serving on the same nodes if the network hop were an issue.

The argument that a decoupled storage layer costs more than co-locating storage and serving also seems dubious. The cluster is basically doing the same amount of work when you decouple those layers, it's just separating them into components that can be optimized for specific roles. So if you pick the right hardware/instances for each layer and tune for that, you really shouldn't need more hardware. Otherwise we'd all still be running full-stack applications on mainframes to minimize the number of servers. :)

manigandham · on Nov 30, 2018

Yes, agreed. It seemed odd because Kafka isn't split into layers so there's no other option, and it confuses resource utilization for number of servers.

scarcely · on Nov 29, 2018

Slightly off topic, but I feel that when naming one's technical creation it's best to avoid using the last names of famous people especially when that person has become so famous it's become standard to refer to him/her simply by last name. Just google "Kafka" and check out the first page of search results and you'll see what I mean. (No, googling "Franz Kafka" isn't the same)

EdwardDiego · on Nov 30, 2018

Are you objecting that Franz Kakfka is buried by Apache Kafka, or vice versa?

Either which way, when I google "Kafka" Franz Kafka is the third result, hardly buried.

scarcely · on Nov 30, 2018

Results about Apache Kafka are of interest only to a subgroup of software engineers. Results about Franz Kafka are of interest to a much larger group of people (Frankly, there is simply no comparison). It's the considerate thing to do.

And it's not simply a matter of the ordinary person having to type an extra word. There exist tons of valuable web resources (literary analysis, criticism, journal articles) out there about the novelist that does not contain "Franz" anywhere.

Finally, it's not like there's an absolute need for any piece of software to be given any particular name.

AvocadoPanic · on Dec 2, 2018

Everyone knows you are supposed to use a stupid spelling for your stupid app / company / product anyway.

kaPHka

will_pseudonym · on Nov 29, 2018

I agree completely. Similarly, I have no idea why the creators of GIMP and CockroachDB (Spencer Kimball and Peter Mattis, and Ben Darnell for the latter) thought those names were the best possible options.

slededit · on Nov 29, 2018

CockroachDB at least refers to their marketing line of "almost impossible to kill/take down" and the way it scales to many servers. There are few less gross insects they could have used though like Bees or Ants.

timrichard · on Nov 30, 2018

Bee DB is catchy, but would probably violate some sort of Buck Rogers IP

andrewflnr · on Nov 30, 2018

Bees and ants don't have anywhere near the same rep for invincibility. Only one of those animals is a stereotypical nuclear apocalypse survivor.

slededit · on Nov 30, 2018

They should have went with Water Bears. It can survive even in space. Plus they sound cute (but look scary).

mjibson · on Nov 30, 2018

We have a conference room named Tardigrade.

jbs40 · on Nov 30, 2018

But any of those are infinitely better than naming something after a letter of the alphabet. That would be a crazy idea, wouldn't it?

(Sorry to all the users of 'R' and 'C' and 'S' and ...) :(

EdwardDiego · on Nov 30, 2018

I actually thought that CockroachDB had some relation to Apache Kafka from the name (like Apache Samza does). But nope!

erikpt · on Nov 30, 2018

GIMP is an acronym (backronym?) for the GNU Image Manipulation Program. So I guess to blame GNU if you’re not a fan of GIMP’s name.