Does a Cassandra cluster stay write-available in the event of a network partition?
If so, how does it reconcile writes when the partition heals? Last I looked, Cassandra doesn't use vector/logical clocks - doesn't that potentially cause data loss when the partition heals if you're using a simple last-write-wins based on physical timestamps for a reconciliation policy? Does Cassandra use merkle trees for anti-entropy?
From what I can tell, although Cassandra claims to be write-fault-tolerant, the dependence on physical timestamps and the lack of the self-healing properties that merkle trees provide make me nervous about data loss and inconsistency when deploying it at scale.
> Does a Cassandra cluster stay write-available in the event of a network partition?
The client can specify whether it wants consistency (refuse writes if not enough write targets are there) or availability.
If it chooses availability, then Cassandra sends extra copies to nodes it _can_ reach, with a tag that specifies who the "real" destination is. When that node is reachable again it will be forwarded. ("Hinted handoff.")
> how does it reconcile writes when the partition heals?
As you said, last-write-wins. The experience with Dynamo showed that most apps don't want to deal with explicit conflict resolution, and don't need it. (But, I suspect we will end up adding it as an option for those apps that do. In the meantime, if Cassandra isn't a good fit, we're not trying to hard-sell anyone. :)
> Does Cassandra use merkle trees for anti-entropy?
Not yet, but my co-worker Stu Hood is working on this. Should be part of the 0.5 release.
> the dependence on physical timestamps and the lack of the self-healing properties
Whether the first is an issue is app-specific. As to the latter, I'm excited to get the merkle tree code in, too.
In the meantime, Cassandra _does_ do read repair and hinted handoff, so in practice it's what I would call "barely adequate." :)
If so, how does it reconcile writes when the partition heals? Last I looked, Cassandra doesn't use vector/logical clocks - doesn't that potentially cause data loss when the partition heals if you're using a simple last-write-wins based on physical timestamps for a reconciliation policy? Does Cassandra use merkle trees for anti-entropy?
From what I can tell, although Cassandra claims to be write-fault-tolerant, the dependence on physical timestamps and the lack of the self-healing properties that merkle trees provide make me nervous about data loss and inconsistency when deploying it at scale.