Clocks Are Bad, Or, Welcome to the World of Distributed Systems

mey · on Nov 13, 2013

Is it me, or does the hand waving at the beginning of the article between "write" and "update" smell of bad spin? As a developer I consider both "creates"/"updates" as "writes".

"Riak is designed to accomodate (sp) server and network failures without ever losing committed writes, so this led to a quick response from Basho’s engineers."

As such losing a write to me when I read documentation is losing either a create, update or a delete. Any side affecting operation essentially. Anything that needs to write to disk to record a change...

macintux · on Nov 13, 2013

Thanks for catching the spelling error; as much as I pride myself on my spelling, I should let 2013-era tools do their job.

I was concerned that might be interpreted as spin, but I hoped the rest of the article would reinforce the point that there is no way to guarantee an update is preserved in a distributed system without an approach more sophisticated than blindly trusting clocks.

Writes to a new object are inherently less problematic; while it's possible to temporarily receive a negative response about the presence of an object, the data will always be there, barring catastrophic multiple server failure.

Updates can be entirely lost, and that's something that developers and operations people need to be aware of.

rurounijones · on Nov 13, 2013

Is there any reason you do not use "create" terminology instead of the possibly-confusing "write".

I am with op in that I consider an update a write.

"create/update" are both writes

"write/update" ... eh?

mey · on Nov 13, 2013

My apologies for being a little gruff. I am coming at this from not being a user of Riak and currently exploring options for distributed processing of data as our companies data needs have gotten a bit big. I was just expressing my concern over the complexity of the problem and our understanding of a technical term. It makes it harder to consume documentation on systems for evaluation, to get an idea of how they fail and how to adjust to failure. It may not be rational but in my gut it causes me concern.

macintux · on Nov 13, 2013

I absolutely understand your concern, and I'll be more cautious in the future. I tend to write with a very casual, informal tone, and data safety is not something to be overly breezy about.

More broadly, as someone who helps write our documentation, it's very difficult to figure out how to present enough detail about the proper ways to use Riak without forcing everyone to become an expert on distributed systems. Unfortunately there are incredibly subtle tradeoffs inherently involved in running a distributed database.

marshray · on Nov 13, 2013

Possibly even more folks are familiar with the terms 'insert' and 'update'.

Also, s/side affecting/side effecting/.

flavien_bessede · on Nov 13, 2013

Distributed systems design aside, the core of the problem is that they relied on ntp (as they probably should), and in their case ntp was not working properly.

scottdw2 · on Nov 13, 2013

The key take away from the article SHOULD be: don't rely on ntp if you don't have to.

There are people who have to. They run their own atomic clocks, and worry about things such as precision delivery of nuclear ordanance.

Then there's you. You should use vector clocks, with a builtin conflict resolution mechanism based on domain knowledge.

That's the point of the article.

devicenull · on Nov 13, 2013

And this is precisely why a thing that is not monitored is not actually a thing.

olefoo · on Nov 13, 2013

> "A thing that is not monitored is not actually a thing."

That should be on a cross-stitch sampler on the wall of every NOC.

specialist · on Nov 13, 2013

Nice. Much stronger than the "you can only manage what you measure" adage I learned from accounting.

globalpanic · on Nov 13, 2013

Even if NTP had been working properly, you would not have clocks synchronised at the level of individual ticks - only to the level of time intervals. If two updates happened at roughly the same time, and fell into the same time interval, there would be no way to tell which one happened before the other. A paper by Cilia et al on timing of composite events in distributed event-based systems using NTP deals with this issue.

Dylan16807 · on Nov 13, 2013

But this is not a problem in many situations. Whereas successor writes failing within an entire 30 second span is a pretty big problem.

donavanm · on Nov 13, 2013

Ntp is good. Assuming that time is coordinated, much less monotonically increasing, is a bad plan. Just the other week i got paged in the middle of the night because a clock moved backwards.

dustinupdyke · on Nov 13, 2013

I don't get how clocks are bad this from the article.

I get that syncing clocks across systems is hard and when it goes awry, unintended consequences are incurred.

RickHull · on Nov 13, 2013

There is, in fact, a TL;DR at the end:

> If your distributed database relies on clocks to pick a winner, you’d better have rock-solid time synchronization, and even then, it’s unlikely your business needs are served well by blindly selecting the last write that happens to arrive.

sseveran · on Nov 13, 2013

In fact I would recommend GPS calibrated hardware clocks with PTP.

marshray · on Nov 13, 2013

The point is not that time synchronization is inherently bad, only that it's usually not the correct thing for a distributed database to resolve update conflicts with.

sseveran · on Nov 13, 2013

Yes I completely agree. I fail to see how anyone would think that using a clock as a source of truth in a distributed system would be in anyway a good idea. As far as PTP it would be too expensive to deploy at large scale which was some of the motivation (i believe) behind truetime.

marshray · on Nov 13, 2013

To be fair, time was considered to provide a pretty universal total ordering up until fairly recently, i.e., 1903.

donavanm · on Nov 13, 2013

As last summers negative leap second fiascos demonstrated even a trusted source isnt enough.

oh_sigh · on Nov 13, 2013

They are when you know that the leap-(nanosecond/second/minute/day) is coming up. When you know it is coming, you can "smear" the time difference over, let's say, the entire year, so when it happens, every system behaves correctly.

duaneb · on Nov 13, 2013

Ok, google.

zschoche · on Nov 13, 2013

Decades ago albert einstein introduced the general theory of relativity, which is already telling us that timestamps are bad for synchronisation.

neolefty · on Nov 13, 2013

How do other distributed databases handle this?

krilnon · on Nov 13, 2013

Google's Spanner [1] uses something it calls TrueTime:

"The key enabler of these properties is a new TrueTime API and its implementation. The API directly exposes clock uncertainty, and the guarantees on Spanner’s timestamps depend on the bounds that the implementation provides. If the uncertainty is large, Spanner slows down to wait out that uncertainty. Google’s cluster-management software provides an implementation of the TrueTime API. This implementation keeps uncertainty small (generally less than 10ms) by using multiple modern clock references (GPS and atomic clocks)."

[1] Spanner: Google's globally-distributed database https://www.usenix.org/system/files/conference/osdi12/osdi12...

theatrus2 · on Nov 13, 2013

With TrueTime you are trading some latency on concurrent operations for correctness.

Other structures such as CRDTs/lattices might be more appropriate for your use case.

madhusudancs · on Nov 13, 2013

By correctness you mean consistency? You don't have to be consistent all the time, i.e. you can trade consistency, but never correctness.

If we could have traded correctness, we could have optimized everything and gone home by now :)

jbellis · on Nov 13, 2013

Cassandra offers a mix of commutative operations (sets, maps, increments), an eventlog model, and lightweight (paxos-based) transactions. Unlike a key/value database like Riak, Cassandra can update individual fields of a row or document independently, which simplifies things enormously.

http://www.datastax.com/dev/blog/why-cassandra-doesnt-need-v...

http://www.datastax.com/dev/blog/cql3_collections

http://www.datastax.com/dev/blog/lightweight-transactions-in...

sseveran · on Nov 13, 2013

Cassandra suffers from the same problem and can drop updates. The Paxos transactions were and maybe still are an absolute joke as exposed by Aphyr.

hendzen · on Nov 13, 2013

Unfortunately, not in a particularly clever way. CP systems such as MongoDB, HBase, etc. don't have this problem since each datum has an authoritative master. As you can imagine, this can result in some operational...unpleasantness due to the lack of liveness guarantees in the presence of a network partition.

Out of the well known open-source AP systems, Riak is probably the leader here since they implement well understood techniques from the literature such as CRDTs and vclocks.

EDIT: removed my statement about Cassandra since it was a bit misleading and jbellis answered above in greater detail.

voidmain · on Nov 13, 2013

FoundationDB provides real ACID transactions and external consistency, and definitely does NOT rely on clock accuracy for soundness! (Google Spanner, which we are often compared to, does use a trusted clock, but Google went to extreme measures to make it accurate, including installing atomic clocks and GPS hardware.)

As for how, it's a long story. At bottom we rely on Paxos for consistency across failures, but we only actually do Paxos when there are failures. (We use less costly synchronous techniques for replication in "happy times".)

jbert · on Nov 13, 2013

Dumb question: What breaks with the following approach?

1) set last_write_wins=true (so all updates, always apply, as described in the article)

2) avoid the "partition/rejoin may cause old values to stomp on new" issue by having "rejoin detection" which refuses to rejoin if clocks are "too out of sync"

tych0 · on Nov 13, 2013

I can't scroll down.

rossy · on Nov 13, 2013

Same here. On Windows/Chrome 30 with a window size of about 900x950px, I can't scroll down. Increasing the windows's width or decreasing the height makes it work again.

joaomsa · on Nov 13, 2013

How would you get around the unreliability of clocks in VMs? Seems like deploying Riak in the cloud could be problematic.

hectcastro · on Nov 13, 2013

You'd change the settings detailed in the "Stopping Last Write Wins" section of the original post.

crb002 · on Nov 13, 2013

Mars design. Assume one server is on Mars, with associated time dilation on it's clocks and latency.