Another one: don't use distributed locks using Redis (Redlock) as if they were j...

mjb · on July 29, 2023

All distributed locking systems have a liveness problem: what should you do when a participant fails? You can block forever, which is always correct but not super helpful. You can assume after some time that the process is broken, which preserves liveness. But what if it comes back? What if it was healthy all along and you just couldn't talk to it?

The classic solution is leases: assume bounded clock drift, and make lock holders promise to stop work some time after taking the lock. This is only correct if all clients play by the rules, and your clock drift hypothesis is right.

The other solution is to validate that the lock holder hasn't changed on every call. For example, with a lock generation epoch number. This needs to be enforced by the callee, or by a middle layer, which might seem like you've just pushed the fault tolerance problem to somebody else. In practice, pushing it to somebody else, like a DB is super useful!

Finally, you can change call semantics to offer idempotency (or other race-safe semantics). Nice if you can get it.

Phelinofist · on July 29, 2023

I found this blog post about Redlock quite interesting: https://martin.kleppmann.com/2016/02/08/how-to-do-distribute...

Racing0461 · on July 29, 2023

That doesn't make any sense. the timeout is how long to block for and retry, now how long to block for and continue.

8organicbits · on July 29, 2023

Instance A grabs the lock and makes an API call that takes 120 seconds. Instance B sees the lock but considers it expired after the lock times out at 100 seconds. Instance B falsely concludes A died, overwrites A's lock so the system doesn't dead lock waiting for A, and makes its own request. Unfortunately, A's request was still processing and B's accidentally concurrent request cause corruption.

Racing0461 · on July 30, 2023

> falsely concludes

i disagree here. instance b did the right thing given the information it has. instance a should realize it no longer owns the lock and stop proceeding. but in reality it also signifies concurrency based limitations in the api itself (no ability to perform a do and then commit call). https://microservices.io/patterns/data/saga.html

8organicbits · on July 30, 2023

I think we both agree that A did something wrong and that B followed the locking algorithm correctly. "Falsely" refers to a matter of fact: A is not dead.

You're right that A could try to stop, but I think it's more complicated than that. A is calling a third party API, which may not have a way to cancel an in-flight request. If A can't cancel, then A should refresh its claim on the lock. A must have done neither in the example.

ngc248 · on July 30, 2023

I have implemented a Distributed lock using DynamooDB and the timeout for the lock release needs to be slightly greater than the time taken to process the thing under lock. Else things like what you mention will happen.

GauntletWizard · on July 29, 2023

You should do two things to combat this- one is to carefully monitor third party API timings and lock acquisition timings. Knowing when you approach your distributed locking timeouts (and alerting if they time out more than occasionally) is key to... Well, using distributed locks at all. There are distributed locking systems that require active unlocking without timeout, but they break pretty easily if your process crashes and require manual intervention.

The second is to use a redis client that has its own thread - your application blocking on a third party API response shouldn't prevent you from updating/reacquiring the lock. You want a short timeout on the lock for liveness but a longer maximum lock acquire time so that if it takes several periods to complete a task you still can.

The third is to not use APIs without idempotency. :)

remote_phone · on July 29, 2023

That doesn’t make sense, they can’t assume the lock is freed after the timeout. They have to retry to get the lock again, because another process might have taken the lock. Also, redis is single threaded so access to redis is by definition serialized.

codegladiator · on July 29, 2023

The lock is explicitly release by the redis server itself after the ttl. It's not that the Client will assume that the lock is released.

xiwenc · on July 30, 2023

In the past i used SQS where the client can extend the TTL of a given message (or lock in this case?) while it is still alive. Isn’t that possible with Redis?

TheCycoONE · on July 30, 2023

It is, with quite a bit of flexibility https://redis.io/commands/expire/

cbzoiav · on July 29, 2023

As the other guy says the lock is released by the server. If you don't have a mechanism to release it after a timeout, what happens if a node fails?

Salgat · on July 29, 2023

RedLock automatically releases a lock after a given timeout. The server can just release it early or refresh it also.

cbzoiav · on July 30, 2023

Yes, which is what I said.

But now you have a released lock and a client that thinks they have the lock.

processunknown · on July 29, 2023

The problem here is that the request timeout is greater than the lock timeout.

pipe_connector · on July 29, 2023

While this might make this situation more likely to occur, you can never prevent concurrent accesses from happening in a distributed system.