Hacker News new | past | comments | ask | show | jobs | submit login
Clocks Are Bad, Or, Welcome to the World of Distributed Systems (basho.com)
99 points by pharkmillups on Nov 13, 2013 | hide | past | favorite | 38 comments



Is it me, or does the hand waving at the beginning of the article between "write" and "update" smell of bad spin? As a developer I consider both "creates"/"updates" as "writes".

"Riak is designed to accomodate (sp) server and network failures without ever losing committed writes, so this led to a quick response from Basho’s engineers."

As such losing a write to me when I read documentation is losing either a create, update or a delete. Any side affecting operation essentially. Anything that needs to write to disk to record a change...


Thanks for catching the spelling error; as much as I pride myself on my spelling, I should let 2013-era tools do their job.

I was concerned that might be interpreted as spin, but I hoped the rest of the article would reinforce the point that there is no way to guarantee an update is preserved in a distributed system without an approach more sophisticated than blindly trusting clocks.

Writes to a new object are inherently less problematic; while it's possible to temporarily receive a negative response about the presence of an object, the data will always be there, barring catastrophic multiple server failure.

Updates can be entirely lost, and that's something that developers and operations people need to be aware of.


Is there any reason you do not use "create" terminology instead of the possibly-confusing "write".

I am with op in that I consider an update a write.

"create/update" are both writes

"write/update" ... eh?


My apologies for being a little gruff. I am coming at this from not being a user of Riak and currently exploring options for distributed processing of data as our companies data needs have gotten a bit big. I was just expressing my concern over the complexity of the problem and our understanding of a technical term. It makes it harder to consume documentation on systems for evaluation, to get an idea of how they fail and how to adjust to failure. It may not be rational but in my gut it causes me concern.


I absolutely understand your concern, and I'll be more cautious in the future. I tend to write with a very casual, informal tone, and data safety is not something to be overly breezy about.

More broadly, as someone who helps write our documentation, it's very difficult to figure out how to present enough detail about the proper ways to use Riak without forcing everyone to become an expert on distributed systems. Unfortunately there are incredibly subtle tradeoffs inherently involved in running a distributed database.


Possibly even more folks are familiar with the terms 'insert' and 'update'.

Also, s/side affecting/side effecting/.


Distributed systems design aside, the core of the problem is that they relied on ntp (as they probably should), and in their case ntp was not working properly.


The key take away from the article SHOULD be: don't rely on ntp if you don't have to.

There are people who have to. They run their own atomic clocks, and worry about things such as precision delivery of nuclear ordanance.

Then there's you. You should use vector clocks, with a builtin conflict resolution mechanism based on domain knowledge.

That's the point of the article.


And this is precisely why a thing that is not monitored is not actually a thing.


> "A thing that is not monitored is not actually a thing."

That should be on a cross-stitch sampler on the wall of every NOC.


Nice. Much stronger than the "you can only manage what you measure" adage I learned from accounting.


Even if NTP had been working properly, you would not have clocks synchronised at the level of individual ticks - only to the level of time intervals. If two updates happened at roughly the same time, and fell into the same time interval, there would be no way to tell which one happened before the other. A paper by Cilia et al on timing of composite events in distributed event-based systems using NTP deals with this issue.


But this is not a problem in many situations. Whereas successor writes failing within an entire 30 second span is a pretty big problem.


Ntp is good. Assuming that time is coordinated, much less monotonically increasing, is a bad plan. Just the other week i got paged in the middle of the night because a clock moved backwards.


I don't get how clocks are bad this from the article.

I get that syncing clocks across systems is hard and when it goes awry, unintended consequences are incurred.


There is, in fact, a TL;DR at the end:

> If your distributed database relies on clocks to pick a winner, you’d better have rock-solid time synchronization, and even then, it’s unlikely your business needs are served well by blindly selecting the last write that happens to arrive.


In fact I would recommend GPS calibrated hardware clocks with PTP.


The point is not that time synchronization is inherently bad, only that it's usually not the correct thing for a distributed database to resolve update conflicts with.


Yes I completely agree. I fail to see how anyone would think that using a clock as a source of truth in a distributed system would be in anyway a good idea. As far as PTP it would be too expensive to deploy at large scale which was some of the motivation (i believe) behind truetime.


To be fair, time was considered to provide a pretty universal total ordering up until fairly recently, i.e., 1903.


As last summers negative leap second fiascos demonstrated even a trusted source isnt enough.


They are when you know that the leap-(nanosecond/second/minute/day) is coming up. When you know it is coming, you can "smear" the time difference over, let's say, the entire year, so when it happens, every system behaves correctly.


Ok, google.


Decades ago albert einstein introduced the general theory of relativity, which is already telling us that timestamps are bad for synchronisation.


How do other distributed databases handle this?


Google's Spanner [1] uses something it calls TrueTime:

"The key enabler of these properties is a new TrueTime API and its implementation. The API directly exposes clock uncertainty, and the guarantees on Spanner’s timestamps depend on the bounds that the implementation provides. If the uncertainty is large, Spanner slows down to wait out that uncertainty. Google’s cluster-management software provides an implementation of the TrueTime API. This implementation keeps uncertainty small (generally less than 10ms) by using multiple modern clock references (GPS and atomic clocks)."

[1] Spanner: Google's globally-distributed database https://www.usenix.org/system/files/conference/osdi12/osdi12...


With TrueTime you are trading some latency on concurrent operations for correctness.

Other structures such as CRDTs/lattices might be more appropriate for your use case.


By correctness you mean consistency? You don't have to be consistent all the time, i.e. you can trade consistency, but never correctness.

If we could have traded correctness, we could have optimized everything and gone home by now :)


Cassandra offers a mix of commutative operations (sets, maps, increments), an eventlog model, and lightweight (paxos-based) transactions. Unlike a key/value database like Riak, Cassandra can update individual fields of a row or document independently, which simplifies things enormously.

http://www.datastax.com/dev/blog/why-cassandra-doesnt-need-v...

http://www.datastax.com/dev/blog/cql3_collections

http://www.datastax.com/dev/blog/lightweight-transactions-in...


Cassandra suffers from the same problem and can drop updates. The Paxos transactions were and maybe still are an absolute joke as exposed by Aphyr.


Unfortunately, not in a particularly clever way. CP systems such as MongoDB, HBase, etc. don't have this problem since each datum has an authoritative master. As you can imagine, this can result in some operational...unpleasantness due to the lack of liveness guarantees in the presence of a network partition.

Out of the well known open-source AP systems, Riak is probably the leader here since they implement well understood techniques from the literature such as CRDTs and vclocks.

EDIT: removed my statement about Cassandra since it was a bit misleading and jbellis answered above in greater detail.


FoundationDB provides real ACID transactions and external consistency, and definitely does NOT rely on clock accuracy for soundness! (Google Spanner, which we are often compared to, does use a trusted clock, but Google went to extreme measures to make it accurate, including installing atomic clocks and GPS hardware.)

As for how, it's a long story. At bottom we rely on Paxos for consistency across failures, but we only actually do Paxos when there are failures. (We use less costly synchronous techniques for replication in "happy times".)


Dumb question: What breaks with the following approach?

1) set last_write_wins=true (so all updates, always apply, as described in the article)

2) avoid the "partition/rejoin may cause old values to stomp on new" issue by having "rejoin detection" which refuses to rejoin if clocks are "too out of sync"


I can't scroll down.


Same here. On Windows/Chrome 30 with a window size of about 900x950px, I can't scroll down. Increasing the windows's width or decreasing the height makes it work again.


How would you get around the unreliability of clocks in VMs? Seems like deploying Riak in the cloud could be problematic.


You'd change the settings detailed in the "Stopping Last Write Wins" section of the original post.


Mars design. Assume one server is on Mars, with associated time dilation on it's clocks and latency.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: