Spotify's Kafka-Based Event Delivery System

serialpreneur · on Feb 25, 2016

Interesting blog post.

I don't have a good idea of the requirements at Spotify, but looks to me that using a streaming system like Storm or Spark Streaming would solve the 30 min event delivery delay they are experiencing w/ unstructured text -> Avro conversion. The latency for delivery would go down to sub-second levels.

raphar · on Feb 25, 2016

Also the article doesn't say WHY they didn't use Kafka to persist the events. Kafka it's designed to do that.

Once persisted the consumers can just read kafka data and send them to hadoop, with less latency. Or you can plug storm o spark in as you said and do the analisys there real time. Or both.

I'm just intrigued why.

RickHull · on Feb 25, 2016

They do talk about this:

> When the system was built, one of the missing features from Kafka 0.7 was the ability of the Kafka Broker cluster to behave as a reliable persistent storage. This influenced a major design decision to not keep persistent state between the producer of data, Kafka Syslog Producer, and Hadoop. An event is considered reliably persisted only when it gets written to a file on HDFS.

vgt · on Feb 25, 2016

Stay tuned for Parts 2 + 3 :)

kinofcain · on Feb 25, 2016

Interesting to look at the chart of event volumes at the end. Did Spotify just accidentally reveal the effect Apple Music has had on their growth rate?

Steep drop around mid-2015, some recovery since then. Hard to glean anything concrete, but interesting nonetheless.

brazzledazzle · on Feb 25, 2016

I would avoid reading into that since the text surrounding that chart indicate that the system was struggling with the load. That drop could have simply been them prioritizing and removing certain event sources in the clients while they figured out how to fix things.

pheeney · on Feb 25, 2016

I wish they went more in-depth into the particular events. They seem to imply the events are user actions, which I took to mean user played this song, user viewed this playlist, and other analytics.

However do they, or would you, use a system like Kafka as an event source instead of the db. So you would also capture events like user added this song to their playlist that would get persisted to the event source db and then eventually a view model instead of directly to a relational db. It doesn't sound like they do that due to the delays, but I can't find a lot of real usage blog posts about how much to put into something like kafka.

dundun · on Feb 25, 2016

You certainly can use a db for an event source. This article does a really good job of explaining how: http://www.confluent.io/blog/turning-the-database-inside-out...

As mentioned in the post, we've pushed Kafka to at least 700,000 events per second. We have room to push it to much more, but stay in tune for post 2 and 3 to see what we're doing instead.

pheeney · on Feb 25, 2016

That is actually the post that got me into event sourcing / streams. As far as user analytic type events go this makes complete sense to me. What I haven't been able to discern is whether its useful to use this architecture on a much much smaller scale for things that may not be user events.

I love the thought of throwing everything into a stream and populating the read models, analytics, search index, etc with the data. However, for example if you had a CMS / ecom for a smaller organization, should the admin actions also be events? If you have an event source db, they would have to be, and you get all the benefits outlined from the article.

At what point do you decide what to put in the stream and what to build without? Are there events that should never be in a stream? Those are the questions I have been researching but I haven't found a lot of resources or discussions around making these decisions.

pheeney · on Feb 26, 2016

My current thought process is you use a relational db like postgres with json support to go from hobby / early startup to traction where you would need to start being concerned with scaling. At that point you switch to kafka or related hosted tools.

As far the data you put into the stream, I would think it could be everything if you treated all data as immutable even admin actions? Only thing that seems up in the air is transactions.

That is as far as I got though. I don't work with a company that has that kind of scale to use this, but I'd like to start working with it.

macca321 · on Feb 26, 2016

I built an CRM/CMS application where every single controller action call is event sourced.

The whole application lives in memory as a single object aggregate, which gets rebuilt on startup. I started off with writing json to the file system, moved into compressing and appending to a log file, and moved into using Azure cloud tables.

It's awesomely fast to respond to requests (15ms), and to add new features, but you do get interesting new problems, e.g. along the way I had to:

- come up with a way of migrating events (as my storage formats changed as I improved my frameworks) - find a good way to do fast full-text-search against in memory objects as I had no SQL or ElasticSearch infrastructure (ended up using Linq against in-memory Lucene RamDirectories) - deal with concurrency issues in a fairly novel manner(as all users are acting against a single in-memory)

I'm hoping this architecture will start to become more popular - I think we are in need of a framework equivalent to Rails to take it mainstream.

pheeney · on Feb 26, 2016

That is very interesting. I am guessing this is a closed source application? Did you do something along the lines of CQRS (Command, Query part) or just write directly the event source? At what point did appending to log file stop working which caused the switch to the cloud (or was that for unrelated reasons)?

I am also hoping it will become more popular as the pros seem to vastly outweigh the cons. But I think you are right about the framework. From my research it seems to be medium to large enterprises that would typically be best suited to using and developing something like kafka, and those enterprises typically would not open source their applications. So I definitely think a framework from a company who is using it as scale would be huge.

Until then, I suppose I will keep reading up and learning all I can and figure out how to implement this on a much smaller scale.

macca321 · on Feb 28, 2016

Cloud storage was just used so I didn't have to manage backups myself.

I absolutely didn't separate command and query - the commands themselves are actions which execute against the domain model, and that domain is used to build responses.

My project is here: [Sourcery](https://github.com/mcintyre321/Sourcery) but I think a more mature project you might like to look into is [OrigioDB](http://origodb.com/).

Another thing that gets tricky is making your application deterministic - any calls to the current time, random number or guid generatiom, or to 3rd party services, have to be recorded and replayable in the correct order for when you reconstruct your application instance. This can get tricky if you refactor your application or change its logic later.

It's worth reading up on Prevalence/MemoryImage, and looking into NEventStore also.

shockzzz · on Feb 25, 2016

Nice!