Another one: don't use distributed locks using Redis (Redlock) as if they were just another mutex.
Someone on the team decided to use Redlock to guard a section of code which accessed a third-party API. The code was racy when accessed from several concurrently running app instances, so access to it had to be serialized. A property of distributed locking is that it has timeouts (based on Redis' TTL if I remember correctly) - other instances will assume the lock is released after N seconds, to make sure an app instance which died does not leave the lock in the acquired state forever. So one day responses from the third party API started taking more time than Redlock's timeout. Other app instances were assuming the lock was released and basically started accessing the API simultaneously without any synchronization. Data corruption ensued.
All distributed locking systems have a liveness problem: what should you do when a participant fails? You can block forever, which is always correct but not super helpful. You can assume after some time that the process is broken, which preserves liveness. But what if it comes back? What if it was healthy all along and you just couldn't talk to it?
The classic solution is leases: assume bounded clock drift, and make lock holders promise to stop work some time after taking the lock. This is only correct if all clients play by the rules, and your clock drift hypothesis is right.
The other solution is to validate that the lock holder hasn't changed on every call. For example, with a lock generation epoch number. This needs to be enforced by the callee, or by a middle layer, which might seem like you've just pushed the fault tolerance problem to somebody else. In practice, pushing it to somebody else, like a DB is super useful!
Finally, you can change call semantics to offer idempotency (or other race-safe semantics). Nice if you can get it.
Instance A grabs the lock and makes an API call that takes 120 seconds. Instance B sees the lock but considers it expired after the lock times out at 100 seconds. Instance B falsely concludes A died, overwrites A's lock so the system doesn't dead lock waiting for A, and makes its own request. Unfortunately, A's request was still processing and B's accidentally concurrent request cause corruption.
i disagree here. instance b did the right thing given the information it has. instance a should realize it no longer owns the lock and stop proceeding. but in reality it also signifies concurrency based limitations in the api itself (no ability to perform a do and then commit call). https://microservices.io/patterns/data/saga.html
I think we both agree that A did something wrong and that B followed the locking algorithm correctly. "Falsely" refers to a matter of fact: A is not dead.
You're right that A could try to stop, but I think it's more complicated than that. A is calling a third party API, which may not have a way to cancel an in-flight request. If A can't cancel, then A should refresh its claim on the lock. A must have done neither in the example.
I have implemented a Distributed lock using DynamooDB and the timeout for the lock release needs to be slightly greater than the time taken to process the thing under lock. Else things like what you mention will happen.
You should do two things to combat this- one is to carefully monitor third party API timings and lock acquisition timings. Knowing when you approach your distributed locking timeouts (and alerting if they time out more than occasionally) is key to... Well, using distributed locks at all. There are distributed locking systems that require active unlocking without timeout, but they break pretty easily if your process crashes and require manual intervention.
The second is to use a redis client that has its own thread - your application blocking on a third party API response shouldn't prevent you from updating/reacquiring the lock. You want a short timeout on the lock for liveness but a longer maximum lock acquire time so that if it takes several periods to complete a task you still can.
The third is to not use APIs without idempotency. :)
That doesn’t make sense, they can’t assume the lock is freed after the timeout. They have to retry to get the lock again, because another process might have taken the lock. Also, redis is single threaded so access to redis is by definition serialized.
In the past i used SQS where the client can extend the TTL of a given message (or lock in this case?) while it is still alive. Isn’t that possible with Redis?
Someone on the team decided to use Redlock to guard a section of code which accessed a third-party API. The code was racy when accessed from several concurrently running app instances, so access to it had to be serialized. A property of distributed locking is that it has timeouts (based on Redis' TTL if I remember correctly) - other instances will assume the lock is released after N seconds, to make sure an app instance which died does not leave the lock in the acquired state forever. So one day responses from the third party API started taking more time than Redlock's timeout. Other app instances were assuming the lock was released and basically started accessing the API simultaneously without any synchronization. Data corruption ensued.