Introducing Riak 2.0: Data Types, Strong Consistency, Full-Text Search

twoodfin · on Oct 29, 2013

On a related note, if you're not listening to the Think Distributed podcast[1], you're missing a terrific combination of research and engineering discussion around distributed systems. It routinely features a few of the Basho guys. The episode on causality, especially, was first rate.

[1] http://thinkdistributed.io/

paxcoder · on Oct 29, 2013

From what i heard of the podcst, it's meh.

biot · on Oct 29, 2013

Can anyone comment on what they've found is the sweet spot (in terms of features and functionality) where using Riak is the #1 choice? I love the technology and have wanted to use it for a while but haven't had a need for something not better suited to a relational database.

moomin · on Oct 29, 2013

I think the sweet spot is when you realize that MongoDB has lost your data.

manishsharan · on Oct 29, 2013

I am not a MongiDB fan (I am auser but not a fan) but what your are implying is FUD; MOngoDB is unlikely to lose your data but it is not as idiot-proof as traditional DBs. Anyone who bothers to read the documentation would know how to avoid data loss. Having said that, I am not happy about Mongo's DB Locking but it hasn't affected my applications' throughput yet.

ismarc · on Oct 29, 2013

Any particular advice on how to not lose data when map-reducing from a sharded collection into another sharded collection? Because we had to migrate away from mongodb in a hurry when we had ~20-40% of our data just not get written (basically 1-2 shards worth) from that scenario. And by in a hurry, I mean we spent a significant amount of time debugging, troubleshooting and tracking down what was causing the issue (we were on 2.2 at the time, not sure if it's been fixed since...) and realized there was no way to fix the bug unless we were willing to delve into how mongo did the writes from map-reduces. So once we found the underlying issue, we quickly migrated away.

The number of conditions where it will silently lose data and you have no control over the write consistency is absurd. However, every time it's brought up, people shout it down because they assume you're talking about the known (32-bit version and dataset size/ram size, not setting it to confirm the write, etc) write issues, and not completely different ones that aren't resolvable.

manishsharan · on Oct 29, 2013

@ismarc , I have never had to encounter your particular use case. Is there a bug report with details for duplicating this ? Like I said, I am not a fan of Mongo and were I to encounter an issue like yours in my use cases, I would bite the bullet and migrate to something else.

ismarc · on Oct 29, 2013

There may still be. The first two I opened were closed pointing to docs on how to set the writing stuff. The reproduction is pretty easy, if a shard tries to write its results to a shard that is write locked, none of that shard's map reduced data after that point is written to any shard. The more evenly distributed your data, and the more shards you have, the more likely you'll hit the condition. Combined with the fact that all unsharded collections always go to the same shard, the whole system becomes useless unless you can fit your entire dataset in all collections in ram on a single box.

calpaterson · on Oct 29, 2013

No, MongoDB's persistence issues are more than just the fact that drivers previously did not default to telling the server to fsync. Here's a list of open questions I have about MongoDB in production:

- How do I avoid the ~10-30 seconds of cluster downtime when the primary goes down? (http://docs.mongodb.org/manual/core/replica-set-elections/)

- How do I ensure that the quorum of replicas have fsynced my write (not the same as ensuring that a quorum of replicas have received my write but not written it to disk)? (http://stackoverflow.com/questions/19385256/how-to-set-write...)

- How did they manage to release a production version of pymongo that truncated GridFS files larger than 256KB? (https://jira.mongodb.org/browse/PYTHON-271)

rch · on Oct 29, 2013

I use it for managing high-throughput sequencing data. It's a good fit because a) the items are inherently key-value oriented and b) items have heterogeneous schema that are dependent on the platform used to perform the sequencing and the experimental origin of the samples.

siculars · on Oct 29, 2013

I use it for unstructured document storage in json format. Then couple it with a liberal search schema that allows me to search on like every element. Need to test with the new riak search more but it looks good so far.

rch · on Oct 29, 2013

Looks like it is going to be a great release. The first thing I wonder about with search is facets, especially when Solr is involved. I assume that being compatible with Solr clients means supporting faceted search, but has there been any public discussion of memory management with large, distributed indexes?

Edit: Discussion of facets starts around 25 minutes into the Yokozuna talk at RICON East 2013

pharkmillups · on Oct 29, 2013

We're streaming talks live today and tomorrow from RICON West in San Francisco.

ricon.io/live.html

There are no less than six talks dedicated to this release.

ghc · on Oct 29, 2013

Want more details about Riak's new Data Types? They're on GitHub: https://github.com/basho/riak/issues/354

igtheflig · on Oct 30, 2013

That issue was created before we did the work, and it is a bit out of date, a fuller treatment is here, and it includes references to papers that helped us, as well as how we overcome some of the hurdles in that issue you linked too https://gist.github.com/russelldb/f92f44bdfb619e089a4d

gpsarakis · on Oct 29, 2013

Is this version compatible with 1.4.2 or will it require a full data dump/restore?

gregburd · on Oct 29, 2013

If you have a 1.4.2 cluster you can do a rolling upgrade to 2.0 without doing a "dump/restore". Please remember, this is a tech-preview, don't upgrade your production cluster to 2.0 until it's been released for production use. Feel free to test drive this preview, and if you do so please send us feeback! #disclaimer I work for Basho

gpsarakis · on Oct 30, 2013

OK, thanks for answering!

oomkiller · on Oct 29, 2013

IIRC, the previous implementation of Riak Search had some gotchas when it came to rebalancing and adding/removing nodes. Does anyone know if this new arch fixes or helps with those issues?

Nate75Sanders · on Oct 29, 2013

First time I had seen CRDT as "Conflict-free" instead of "Commutative", but it seems to be a reasonably popular choice for C after some googling.

tsantero · on Oct 29, 2013

CRDTs can either be commutative or convergent; much of industry and academia have settled on the opaque "conflict-free" instead of CmRDT/CvRDT for simplicity

Ycros · on Oct 30, 2013

That is an incredibly well written announcement.