Hacker News new | past | comments | ask | show | jobs | submit login
Roshi: a CRDT system for timestamped events (soundcloud.com)
108 points by sagichmal on May 12, 2014 | hide | past | favorite | 29 comments



> The tl;dr on CRDTs is that by constraining your operations to only those which are associative, commutative, and idempotent.

Interesting. We have an event stream database implemented in Haskell, and this looks like an excellent way to index it. Especially associative, commutative and idempotent can (probably) all be encoded in the type system!


Interesting, I have seen CRDTs mostly in the Erlang context (monitoring and following works Basho has been doing for Riak).

Basically a lot of this:

http://basho.com/tag/crdt

But using Haskell's type system is an interesting idea.

You can start with one level higher than just looking for associative, commutative and idempotent, because that actually comes the definition of a semilattice

http://en.wikipedia.org/wiki/Semilattice

The crucial part is that you have an appropriate meet or join operator (and that it satisfies the above requirements).

Informally think of a functions like max() over natural numbers or a union() over sets. Those are some examples.

I don't know Haskell but found this interesting gist of someone who has tried this, and frankly a lot of stuff there is above my head, but it might help you:

https://gist.github.com/pozorvlak/6540639


That is interesting (but also sadly over my head too).

I think some of the theory there is starting to get into operational transformation territory [1]. If that's the case then a really interesting application might be applying the same semantics indexing a stream of events to expressing events as a diff against state and propagating them to clients...

[1] http://en.wikipedia.org/wiki/Operational_transformation


This is definitely aligned with OT. OT typically comes in two steps—merge and simplify. Merge takes two divergent histories and puts them together into a collective view. Simplify takes that collective view and reduces it to a single "consistent" view according to domain rules.

With CRDTs the merge operation is made such that the simplify operation is id.

It's also interesting to note that the merge operation is just a pullback in the appropriate category.


You probably want to look at LVish, is you want to see what this looks like in haskell.

http://www.cs.indiana.edu/~lkuper/

Has tons of papers and

https://hackage.haskell.org/package/lvish

is the lib.


I've never used it, but there is a lattice package on hackage which provides typeclasses for semilattices:

http://hackage.haskell.org/package/lattices-1.2.1/docs/Algeb...


It's an idempotent commutative monoid, so you can encode that quite straightforwardly as a type class.

Twitter's Algebird[1] has a wide range of monoid implementations (in Scala). It's a surprisingly useful type class for something so simple.

I have a talk[2] and some code [3] that goes into more detail on CRDTs and the connection to monoids.

[1:] https://github.com/twitter/algebird

[2:] http://noelwelsh.com/programming/2013/12/20/crdts-for-fun-an...

[3:] https://github.com/noelwelsh/crdt


Thanks Noel, I had seen algebird before - it looks great!


It's only a monoid if there's an identity.


They can be tested with quickcheck as well: http://yellerapp.com/posts/2014-05-08-testing-riak.html


Relevant work I've done in the past:

http://christophermeiklejohn.com/coq/2013/06/11/distributed-...

Also, see related papers from PaPEC '14:

http://dl.acm.org/citation.cfm?id=2596631


Cool, what's your database?


I don't know why, but this post gave me warm and fuzzy feelings all over. It's like they actually thought out a novel solution to a problem, not just picking up a trending tool in the startup dev sphere. ( And their README looks like it was written by an experienced ops engineer or open source dev! How retro :3 ) I wonder if SoundCloud is hiring.


Good news, friend! http://soundcloud.com/jobs :)


So I am a little unsure about where all the data is. Am I right in thinking that a user has a cached merge of all their follower's events. So when they first read they get an inconsistent view which can be populated in the background on demand. The massive data gains being: most people don't read all their data so you waste a lot of time on fan-out writing data that is never used.

The read view just has to be caught up from when it was last accessed. You don't do this all the time, only when a user requests their timeline. So still an individual CDRT set still has to be persisted for each individual user a bit like an out-of-date inbox. You don't create the whole inbox from scratch each request or do you?

maybe you do because you only need recent events??

---------------- EDIT.

I think I worked it out ... the inboxes are created dynamically. This is what was meant by stateless. All the time series data is able to fit into one server's memory, so the IO overhead of assemblage is low


You got it on your edit. The inbox is generated dynamically from multiple outboxes. The time series data, however, is not able to fit in one server's memory; it's sharded over multiple non-communicating Redis instances, collectively known as a pool. And multiple pools (complete replica sets) constitute the system.


Ahhhhhhh. yes yes. Nice.

Weirdly enough I am implementing a CRDT list today (not a set). The TreeDoc example in the main paper has a problem I think, if two writes with same key go in, the disambiguator keeps an ordering. But if that occurs, you now can't insert between those two edits because the disambiguators are not dense.

The set stuff looks good though.


I'm trying to understand CRDTs, and Roshi provides an implementation of how CRDTs can be put to use in real-world applications. I've assess the documentation on Roshi and would like assent or clarification if I'm wrong.

After glossing over the readme of github.com/soundcloud/roshi I came away with the impression that Roshi's LWW-element-set implementation does not strictly adhere to the qualities of a CRDT, mainly that operations must be commutative.

Roshi documents two uses-cases:

If first we apply an add operation to the set, then apply a remove operation with the same tuple, the resulting set will ignore the remove operation: A(a,1) R() + remove(a,1) = A(a,1) R()

If we apply the remove operation, then next the add operation, the resulting set will ignore the add operation: A() R(a,1) + add(a,1) = A() R(a,1)

This means the resulting state of the set depends on the order of operations, which violates the commutative property of CRDTs.

As a consequence, if we assume this CRDT is replicated on each redis node in a Roshi cluster, then there is a case (admittedly rare and short-lived) where a Roshi cluster cannot be eventually consistent:

Given no more future operations on a set with key K, if, node 1 contains a key K with set A(a,1) R(), and node 2 contains a key K with set A() R(a,1), then the system cannot be eventually consistent.

Roshi seems to have implemented a weak form of CRDT, which under rare and short-lived situations impedes eventual consistent. But given the operational realities, this is totally acceptable.


Great stuff, but:

> Of course, reads are difficult. If you follow thousands of users, making thousands of simultaneous reads, time-sorting, merging, and cutting within a typical request-response deadline isn't trivial.

Even within a "typical request-response deadline" blasting through some thousands of "events" does sound kind of trivial? (Which I suppose is reflected in the relative simplicity and elegance of this solution :-)

Now, if you absolutely need the inboxes of user A and user B who follow the same thousand users to end up identical if they are accessed in a similar time-frame, then things do indeed sound a bit hairy -- but do you really need that? (Or even demand that both A and B see X re-sharing track N, before user Y re-shared track M -- if the sharing events are within the threshold of clock-skew internal to your systems)?

I suppose it's great to be able to say that these guarantees will hold, but I'm not sure I see them as needed for (all of) soundcloud's events?


    > Even within a "typical request-response deadline" 
    > blasting through some thousands of "events" does sound 
    > kind of trivial?
If they were all on the same physical machine, sure. But the events are spread across multiple nodes, so a single stream read can potentially touch every Redis instance.

    > I suppose it's great to be able to say that these 
    > guarantees will hold, but I'm not sure I see them as 
    > needed for (all of) soundcloud's events?
I'm all about cheating, if it nets me a big win. How would you cheat, here? Honest question; it's not apparent to me what corners we could cut, given the whole point, more-or-less, is to avoid pre-materializing views.


Why do you want to avoid pre-materializing views? Storage issues? Concerns about not-live-enough information? Both?


Both, due to the Power Law property/problem inherent in natural networks (of which social networks are an obvious example).

For example, (I'm told that) Twitter has effectively two separate systems for distributing updates for users, one for normal users (you and I), and one for celebrities. Propagating Lady Gaga's tweets to all 30 million followers can take something like five minutes to accomplish. It's even more fun when these celebrities start tweeting back and forth with each other.

Maybe SoundCloud's CRDT approach would work better for natural networks? At some point, network communication could start to be an issue with a compile-the-list-at-read design, but I could be wrong.

I personally don't see why a CRDT is necessary if you used something like Twitter's snowflake approach to assign ids. Pull in events is sorted order from each machine, merge, then push to the user. It's not clear to me what adding the CRDT abstraction adds to that besides overhead. Maybe there are other benefits I'm not grasping?


    > At some point, network communication could start to be 
    > an issue with a compile-the-list-at-read design, but I 
    > could be wrong.
You're absolutely right. Optimizing for that bottleneck occupied most of the development time.


I suppose it just wasn't immediately obvious just how bad the read amplification problem was.


Something I learned working in engineering at Formspring (which also used Cassandra to solve a similar problem) was that fan out is expensive (both in terms of necessary horizontally scaled infrastructure AND in terms of working with the complexity.)


For those interested in CRDT and using NodeJS, check out:

https://github.com/dominictarr/scuttlebutt

https://github.com/dominictarr/crdt


Great work! Any chance to see it extended to non-time-series events, like an HN over Roshi?

At a very superficial level it seems to me that you can't implement a reorder operation within the model, but perhaps you can circumvent the problem with some tricks...


You're right: it's strictly coupled to time order. All of the CRDT semantics rely on that invariant. Arbitrary reordering would turn it into a different beast indeed...


sagichmal, could you elaborate on "we were able to map a fan-in-on-read stream product to a data model that could be implemented with a specific type of CRDT"? The post (which is awesome btw) glosses over exactly how Roshi is used to solve the problem presented in the first half of the post - or at least it seems that way to me as I am unfamiliar with CRDTs.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: