Is it me, or does the hand waving at the beginning of the article between "write" and "update" smell of bad spin? As a developer I consider both "creates"/"updates" as "writes".
"Riak is designed to accomodate (sp) server and network failures without ever losing committed writes, so this led to a quick response from Basho’s engineers."
As such losing a write to me when I read documentation is losing either a create, update or a delete. Any side affecting operation essentially. Anything that needs to write to disk to record a change...
Thanks for catching the spelling error; as much as I pride myself on my spelling, I should let 2013-era tools do their job.
I was concerned that might be interpreted as spin, but I hoped the rest of the article would reinforce the point that there is no way to guarantee an update is preserved in a distributed system without an approach more sophisticated than blindly trusting clocks.
Writes to a new object are inherently less problematic; while it's possible to temporarily receive a negative response about the presence of an object, the data will always be there, barring catastrophic multiple server failure.
Updates can be entirely lost, and that's something that developers and operations people need to be aware of.
My apologies for being a little gruff. I am coming at this from not being a user of Riak and currently exploring options for distributed processing of data as our companies data needs have gotten a bit big. I was just expressing my concern over the complexity of the problem and our understanding of a technical term. It makes it harder to consume documentation on systems for evaluation, to get an idea of how they fail and how to adjust to failure. It may not be rational but in my gut it causes me concern.
I absolutely understand your concern, and I'll be more cautious in the future. I tend to write with a very casual, informal tone, and data safety is not something to be overly breezy about.
More broadly, as someone who helps write our documentation, it's very difficult to figure out how to present enough detail about the proper ways to use Riak without forcing everyone to become an expert on distributed systems. Unfortunately there are incredibly subtle tradeoffs inherently involved in running a distributed database.
Distributed systems design aside, the core of the problem is that they relied on ntp (as they probably should), and in their case ntp was not working properly.
Even if NTP had been working properly, you would not have clocks synchronised at the level of individual ticks - only to the level of time intervals. If two updates happened at roughly the same time, and fell into the same time interval, there would be no way to tell which one happened before the other. A paper by Cilia et al on timing of composite events in distributed event-based systems using NTP deals with this issue.
Ntp is good. Assuming that time is coordinated, much less monotonically increasing, is a bad plan. Just the other week i got paged in the middle of the night because a clock moved backwards.
> If your distributed database relies on clocks to pick a winner, you’d better have rock-solid time synchronization, and even then, it’s unlikely your business needs are served well by blindly selecting the last write that happens to arrive.
The point is not that time synchronization is inherently bad, only that it's usually not the correct thing for a distributed database to resolve update conflicts with.
Yes I completely agree. I fail to see how anyone would think that using a clock as a source of truth in a distributed system would be in anyway a good idea. As far as PTP it would be too expensive to deploy at large scale which was some of the motivation (i believe) behind truetime.
They are when you know that the leap-(nanosecond/second/minute/day) is coming up. When you know it is coming, you can "smear" the time difference over, let's say, the entire year, so when it happens, every system behaves correctly.
Google's Spanner [1] uses something it calls TrueTime:
"The key enabler of these properties is a new TrueTime API and its implementation. The API directly exposes clock uncertainty, and the guarantees on Spanner’s timestamps depend on the bounds that the implementation provides. If the uncertainty is large, Spanner slows down to wait out that uncertainty. Google’s cluster-management software provides an implementation of the TrueTime API. This implementation keeps uncertainty small (generally less than 10ms) by using multiple modern clock references (GPS and atomic clocks)."
Cassandra offers a mix of commutative operations (sets, maps, increments), an eventlog model, and lightweight (paxos-based) transactions. Unlike a key/value database like Riak, Cassandra can update individual fields of a row or document independently, which simplifies things enormously.
Unfortunately, not in a particularly clever way. CP systems such as MongoDB, HBase, etc. don't have this problem since each datum has an authoritative master. As you can imagine, this can result in some operational...unpleasantness due to the lack of liveness guarantees in the presence of a network partition.
Out of the well known open-source AP systems, Riak is probably the leader here since they implement well understood techniques from the literature such as CRDTs and vclocks.
EDIT: removed my statement about Cassandra since it was a bit misleading and jbellis answered above in greater detail.
FoundationDB provides real ACID transactions and external consistency, and definitely does NOT rely on clock accuracy for soundness! (Google Spanner, which we are often compared to, does use a trusted clock, but Google went to extreme measures to make it accurate, including installing atomic clocks and GPS hardware.)
As for how, it's a long story. At bottom we rely on Paxos for consistency across failures, but we only actually do Paxos when there are failures. (We use less costly synchronous techniques for replication in "happy times".)
Dumb question: What breaks with the following approach?
1) set last_write_wins=true (so all updates, always apply, as described in the article)
2) avoid the "partition/rejoin may cause old values to stomp on new" issue by having "rejoin detection" which refuses to rejoin if clocks are "too out of sync"
Same here. On Windows/Chrome 30 with a window size of about 900x950px, I can't scroll down. Increasing the windows's width or decreasing the height makes it work again.
"Riak is designed to accomodate (sp) server and network failures without ever losing committed writes, so this led to a quick response from Basho’s engineers."
As such losing a write to me when I read documentation is losing either a create, update or a delete. Any side affecting operation essentially. Anything that needs to write to disk to record a change...