Hacker News new | past | comments | ask | show | jobs | submit login

Another one: don't use distributed locks using Redis (Redlock) as if they were just another mutex.

Someone on the team decided to use Redlock to guard a section of code which accessed a third-party API. The code was racy when accessed from several concurrently running app instances, so access to it had to be serialized. A property of distributed locking is that it has timeouts (based on Redis' TTL if I remember correctly) - other instances will assume the lock is released after N seconds, to make sure an app instance which died does not leave the lock in the acquired state forever. So one day responses from the third party API started taking more time than Redlock's timeout. Other app instances were assuming the lock was released and basically started accessing the API simultaneously without any synchronization. Data corruption ensued.




All distributed locking systems have a liveness problem: what should you do when a participant fails? You can block forever, which is always correct but not super helpful. You can assume after some time that the process is broken, which preserves liveness. But what if it comes back? What if it was healthy all along and you just couldn't talk to it?

The classic solution is leases: assume bounded clock drift, and make lock holders promise to stop work some time after taking the lock. This is only correct if all clients play by the rules, and your clock drift hypothesis is right.

The other solution is to validate that the lock holder hasn't changed on every call. For example, with a lock generation epoch number. This needs to be enforced by the callee, or by a middle layer, which might seem like you've just pushed the fault tolerance problem to somebody else. In practice, pushing it to somebody else, like a DB is super useful!

Finally, you can change call semantics to offer idempotency (or other race-safe semantics). Nice if you can get it.


I found this blog post about Redlock quite interesting: https://martin.kleppmann.com/2016/02/08/how-to-do-distribute...


That doesn't make any sense. the timeout is how long to block for and retry, now how long to block for and continue.


Instance A grabs the lock and makes an API call that takes 120 seconds. Instance B sees the lock but considers it expired after the lock times out at 100 seconds. Instance B falsely concludes A died, overwrites A's lock so the system doesn't dead lock waiting for A, and makes its own request. Unfortunately, A's request was still processing and B's accidentally concurrent request cause corruption.


> falsely concludes

i disagree here. instance b did the right thing given the information it has. instance a should realize it no longer owns the lock and stop proceeding. but in reality it also signifies concurrency based limitations in the api itself (no ability to perform a do and then commit call). https://microservices.io/patterns/data/saga.html


I think we both agree that A did something wrong and that B followed the locking algorithm correctly. "Falsely" refers to a matter of fact: A is not dead.

You're right that A could try to stop, but I think it's more complicated than that. A is calling a third party API, which may not have a way to cancel an in-flight request. If A can't cancel, then A should refresh its claim on the lock. A must have done neither in the example.


I have implemented a Distributed lock using DynamooDB and the timeout for the lock release needs to be slightly greater than the time taken to process the thing under lock. Else things like what you mention will happen.


You should do two things to combat this- one is to carefully monitor third party API timings and lock acquisition timings. Knowing when you approach your distributed locking timeouts (and alerting if they time out more than occasionally) is key to... Well, using distributed locks at all. There are distributed locking systems that require active unlocking without timeout, but they break pretty easily if your process crashes and require manual intervention.

The second is to use a redis client that has its own thread - your application blocking on a third party API response shouldn't prevent you from updating/reacquiring the lock. You want a short timeout on the lock for liveness but a longer maximum lock acquire time so that if it takes several periods to complete a task you still can.

The third is to not use APIs without idempotency. :)


That doesn’t make sense, they can’t assume the lock is freed after the timeout. They have to retry to get the lock again, because another process might have taken the lock. Also, redis is single threaded so access to redis is by definition serialized.


The lock is explicitly release by the redis server itself after the ttl. It's not that the Client will assume that the lock is released.


In the past i used SQS where the client can extend the TTL of a given message (or lock in this case?) while it is still alive. Isn’t that possible with Redis?


It is, with quite a bit of flexibility https://redis.io/commands/expire/


As the other guy says the lock is released by the server. If you don't have a mechanism to release it after a timeout, what happens if a node fails?


RedLock automatically releases a lock after a given timeout. The server can just release it early or refresh it also.


Yes, which is what I said.

But now you have a released lock and a client that thinks they have the lock.


The problem here is that the request timeout is greater than the lock timeout.


While this might make this situation more likely to occur, you can never prevent concurrent accesses from happening in a distributed system.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: