Fault tolerance is really the hardest part of the implementation. The documentation contains a detailed explanation of the features and how to use them. A future blog post will explain all algorithms in detail (it's simply too long to post in a comment). In the mean time, you can take a look at the code on GitHub if you feel like it.
Any hints you want to give on how to handle node failures ? For me that's the potential weak point.