Hacker News new | past | comments | ask | show | jobs | submit login
Ways to shoot yourself in the foot with Redis (philbooth.me)
179 points by philbo on July 29, 2023 | hide | past | favorite | 80 comments



My team manages a handful of clusters at work and I wrote on an internal redis client proxy (it's on my todo list to opensource). A few things I tell other teams to set them up for success (we use Elasticache):

- Connection pooling / pipelining and circuit breaking is a must at scale. The clients are a lot better than they used to be but it's important developers understand the behavior of the client library they are using. Someone suggested using Envoy as sidecar proxy, I personally wouldn't after our experience with it with redis but it's an easy option. - Avoid changing the cluster topology if the CPU load is over 40%. This is primarily in case of unplanned failures during a change. - If something goes wrong shed load application side as quick as possible because Redis won't recover if it's being hammered. You'll need to either have feature flags of be able to scale down your application. - Having replicas won't protect you from data loss so don't treat it as a source of truth. Also, don't rely on consistency in clustered mode. - Remember Redis is single threaded so an 8xl isn't going to be super useful with all those unused cores.

Things we have alarms on by default: - Engine utilization - Anomalies in replication lag - Network throughput (relative to throughput of the underlying EC2 instance) - Bytes used for cache - Swap usage (this is the oh shit alarm)


You need two line breaks after each item so they'll show stacked as a list


Another one: don't use distributed locks using Redis (Redlock) as if they were just another mutex.

Someone on the team decided to use Redlock to guard a section of code which accessed a third-party API. The code was racy when accessed from several concurrently running app instances, so access to it had to be serialized. A property of distributed locking is that it has timeouts (based on Redis' TTL if I remember correctly) - other instances will assume the lock is released after N seconds, to make sure an app instance which died does not leave the lock in the acquired state forever. So one day responses from the third party API started taking more time than Redlock's timeout. Other app instances were assuming the lock was released and basically started accessing the API simultaneously without any synchronization. Data corruption ensued.


All distributed locking systems have a liveness problem: what should you do when a participant fails? You can block forever, which is always correct but not super helpful. You can assume after some time that the process is broken, which preserves liveness. But what if it comes back? What if it was healthy all along and you just couldn't talk to it?

The classic solution is leases: assume bounded clock drift, and make lock holders promise to stop work some time after taking the lock. This is only correct if all clients play by the rules, and your clock drift hypothesis is right.

The other solution is to validate that the lock holder hasn't changed on every call. For example, with a lock generation epoch number. This needs to be enforced by the callee, or by a middle layer, which might seem like you've just pushed the fault tolerance problem to somebody else. In practice, pushing it to somebody else, like a DB is super useful!

Finally, you can change call semantics to offer idempotency (or other race-safe semantics). Nice if you can get it.


I found this blog post about Redlock quite interesting: https://martin.kleppmann.com/2016/02/08/how-to-do-distribute...


That doesn't make any sense. the timeout is how long to block for and retry, now how long to block for and continue.


Instance A grabs the lock and makes an API call that takes 120 seconds. Instance B sees the lock but considers it expired after the lock times out at 100 seconds. Instance B falsely concludes A died, overwrites A's lock so the system doesn't dead lock waiting for A, and makes its own request. Unfortunately, A's request was still processing and B's accidentally concurrent request cause corruption.


> falsely concludes

i disagree here. instance b did the right thing given the information it has. instance a should realize it no longer owns the lock and stop proceeding. but in reality it also signifies concurrency based limitations in the api itself (no ability to perform a do and then commit call). https://microservices.io/patterns/data/saga.html


I think we both agree that A did something wrong and that B followed the locking algorithm correctly. "Falsely" refers to a matter of fact: A is not dead.

You're right that A could try to stop, but I think it's more complicated than that. A is calling a third party API, which may not have a way to cancel an in-flight request. If A can't cancel, then A should refresh its claim on the lock. A must have done neither in the example.


I have implemented a Distributed lock using DynamooDB and the timeout for the lock release needs to be slightly greater than the time taken to process the thing under lock. Else things like what you mention will happen.


You should do two things to combat this- one is to carefully monitor third party API timings and lock acquisition timings. Knowing when you approach your distributed locking timeouts (and alerting if they time out more than occasionally) is key to... Well, using distributed locks at all. There are distributed locking systems that require active unlocking without timeout, but they break pretty easily if your process crashes and require manual intervention.

The second is to use a redis client that has its own thread - your application blocking on a third party API response shouldn't prevent you from updating/reacquiring the lock. You want a short timeout on the lock for liveness but a longer maximum lock acquire time so that if it takes several periods to complete a task you still can.

The third is to not use APIs without idempotency. :)


That doesn’t make sense, they can’t assume the lock is freed after the timeout. They have to retry to get the lock again, because another process might have taken the lock. Also, redis is single threaded so access to redis is by definition serialized.


The lock is explicitly release by the redis server itself after the ttl. It's not that the Client will assume that the lock is released.


In the past i used SQS where the client can extend the TTL of a given message (or lock in this case?) while it is still alive. Isn’t that possible with Redis?


It is, with quite a bit of flexibility https://redis.io/commands/expire/


As the other guy says the lock is released by the server. If you don't have a mechanism to release it after a timeout, what happens if a node fails?


RedLock automatically releases a lock after a given timeout. The server can just release it early or refresh it also.


Yes, which is what I said.

But now you have a released lock and a client that thinks they have the lock.


The problem here is that the request timeout is greater than the lock timeout.


While this might make this situation more likely to occur, you can never prevent concurrent accesses from happening in a distributed system.


> I wrote a basic session cache using GET, which fell back to a database query and SET to populate the cache in the event of a miss. Crucially, it held onto the Redis connection for the duration of that fallback condition and allowed errors from SET to fail the entire operation. Increased traffic, combined with a slow query in Postgres, caused this arrangement to effectively DOS our Redis connection pool for minutes at a time.

This has nothing to do with the redis server. This is bad application code monopolizing a single connection waiting for an unrelated operation. A stateless request / response to interact with redis for the individual operations does not hold any such locks.


> This has nothing to do with the redis server. This is bad application code monopolizing a single connection waiting for an unrelated operation.

Well, yes. That is why the preceding sentence, which you didn't quote, said "poorly-implemented application logic". So thanks for agreeing with my post, I guess.

The point, in case you missed it, was to advertise ways I'd fucked up and hopefully help others not to fuck up the same way in future. It was never my intention to say Redis was the problem and I'm sorry if it made you think that.



Don't expose your redis to the internet (please!). Don't whitelist large swathes of your cloud/hosting provider's subnets either. Of course redis isn't special, mongo, elastic, docker, k8s,etc... even if it is a testing server and you will never put important data on it.


This. Configure private vlans and/or Wireguard or whatever VPN software you prefer.


And what about mTLS?


MTLS doesn't affect this advice at all. You should, where possible, use MTLS because it's good security. You shouldn't leave your redis server open to the internet anyway, to cut down on logspam.

With MTLS, a good security posture is to log every connection establishment, with basic metadata about the certificate involved - it's SAN and public key hash are the best bet. For troubleshooting, do that logging before the authentication decision. But anyone can make their own certificate, so keeping network controls keeps that list free of clutter.


Some companies build their trust model on top of mTLS and that's fine. TLS handles the authentication before anything hits redis. I can see people debating the pros and cons of say Wireguard vs mTLS vs ipsec vs other protocols. Such debate might be useful if you are starting from scratch. However, if you have existing infrastructure using either of these, you would need a compelling reason to switch.


mtls solves the common redis abuses I know of, but still, don't presume a vuln in redis may not be discovered in the future. But if you are going out of your way to configure mtls the I am guessing you have a good reason to?


> One common mistake is serialising objects to JSON strings before storing them in Redis. This works for reading and writing objects as atomic units but is inefficient for reading or updating individual properties within an object

I would love to see some numbers on this. My intuition says there are probably some workloads where JSON strings are better and some where one key per property is better.


Depends on at which level you need atomic updates. At the entire document level or at individual property level


I think you can get atomicity either way with Lua or transactions. I'm more interested in performance.


I had a jr dev connect and typed 'flushall' because he thought it would refresh the dataset to disk.

thankfully it was on a staging env, I think he's at google now.


It sounds like the subtext is that this dev was incompetent, but if all they did is mess up a staging environment, it sounds like things were working as intended.

If a junior dev can cause catastrophic harm from one wrong command, it's the org's fault for not having safeguards in place, not the dev's fault for an (understandable) error.


yeah. we built many many moats of protections, but this guy was... optimistic, I guess? by and large our redis stuff was ephemeral, but we had a particular key that was a domain table that needed to be loaded separately, and that caused some problems.

incompetent is maybe a bit harsh, but i did say he was junior, and junior devs make mistakes, and this guy was well meaning and messed up. you don't get from junior to senior or principal or staff without some mistakes, and it's the responsibility of the more senior devs to not have them in a position where their mistakes are catastrophic.


You can use rename-command to help avoid these kinds of mistakes:

  # To disable:
  rename-command FLUSHALL ""

  # To rename:
  rename-command FLUSHALL DANGER_WILL_ROBINSON_FLUSH_ALL


Instead, use the redis acl commands to modify permissions you don't want executed by a user (i.e flushall, flushdb) as opposed to renaming commands.


One time during my internship years ago I took down a production server because of a command I ran on it that I didn’t fully understand.

Since then I treat any prod server terminal like I’m entering launch codes for a middle system.

Anything outside of ls or cd I’m very careful, read the command a couple times before executing, etc.


For critical and overall... fiddly things, we've grown into a culture of writing down reviewable plans and possibly executing these plans in pairs.

We tend to go ahead and either use a runbook, or whatever experience we might have, to setup a pretty detailed plan of what to run on which systems with which purpose. You can then throw these plans at someone else to review. Sure, it takes an hour or two more to setup a solid plan and waiting for a review takes time as well. But this has turned into a great tool to build up experience in weird parts of the infrastructure.


I get that, but it also can turn into wiki checklists of things that could be automated.

one of my frustrations of bigco software is people taking basically maintenance roles where the computer tells them what to do, because the lava flow legacy code base is too scary to touch.

however, you can automate your daily clean up tasks. it's certainly shellacking more mud on the ball, but if you're not going to even try scripting your repetitive tasks, then i don't know why you're a programmer.


Our running gag is: Once such a runbook has been sufficiently refined and clarified to the point of being really comprehensive and easy to follow.... someone turns it into a jenkins job and we don't need it anymore.


In my opinion you weren't at fault here. Production systems should be designed so that one person can't inadvertently destroy things.


In almost every place I've worked at, the most difficult thing has been getting people out of ad-hoc JFDI style development and debugging, where everything in production is fair game, and into a process where you avoid touching production as much as humanly possible.

Takes a lot of effort to stop people opening up a shell in prod or grabbing a prod DB dump or even just connecting to the prod datastore directly from their local env.


This is the way.


Reminds me of the time I ran "killall" on SunOS, which didn't kill a process by name as it did under Linux, instead it killed all processes.

That's the kind of mistake you only make once!


I've had someone do this in production. Even worse, it turns out when each microservice needed a redis instance, sysops was just expanding the main redis instance and pointing the service at it instead of giving each microservice their own instance.


He learned a very important lesson. That's one guy you can pretty much guarantee (if he has any brains at all) will be very careful about doing anything on a live production system in the future. In this case Google probably got a good deal.


if only my immaculate record of never "rm -rf"ing myself or prod dbs resulted in me working at google...


I've found that your mileage will vary when using Redis in clustered mode because the even if there is an official Redis driver in your language of choice that supports it, this might not be exposed by any libraries that depend on it. In those cases you'll just be connecting to a single specific instance in the cluster but will mistakenly believe that isn't the case.

I've noticed this particularly with Ruby where the official gem has cluster and sentinel support, but many other gems that depend on Redis expose their own abstraction for configuring it and it isn't compatible with the official package.

Of course, I think that running Redis in clustered mode is actually just another way to shoot yourself in the foot, especially if a standalone instance isn't causing you any trouble, as you can easily run into problems with resharding or poorly distributing the keyspace. Maybe just try out Sentinal for HA and failover support if you want some resilience.


It seems like you can run Envoy as a sidecar next to each application instance to allow non-cluster-aware libraries to use the cluster: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overv...


This is cool! Didn't know


That's interesting, I suppose you still lose the benefit of your redis driver being unaware of cluster-mode though, so errors are going to be at protocol level and not application level.

Better than nothing though.


Has anyone seen max (p100) client latencies of 300 to 400ms but totally normal p99? We see this across almost all our redis clusters on elasticache and have no idea why. CPU usage is tiny. Slowlog shows nothing.


I would guess your problem is probably scheduler based. The default(ish) Linux scheduler operates in 100ms increments, the first use of a client takes 3-4 round-trips. TCP opens, block, request is sent, the client blocks on write, the client attempts to read and blocks on read. If CPU usage is high momentarily, each of these yields to another process and your client isn't scheduled for another 100ms


Hmm. We have super low CPU utilization- something like 9%. This is also across 10+ different clusters.


We also pool our clients heavily. Maybe we could reduce the new connections to zero to test.


Are you evicting or deleting large sets (or lists or sorted sets)? We use a Django ORM caching library that adds each resultset's cache key to a set of keys to invalidate when that table is updated – at which point it issues `DEL <set key>` and if that set has grown to hundreds of thousands – or millions! – of keys the main Redis process will block completely for as long as it takes to loop through and evict them.


nope!


Is the memory full and evicting? Or do you have a large db with lots of keys with ttls? Redis does a bunch of maintenance stuff on the same thread iirc in the background but not really


Memory is maybe 50% full. We are totally over provisioned. We actually just downsized and it didn’t impact anything.

We do expire but we don’t think we have a thundering herd problem with them all happening at the same time.


Is it doing backups?


My understanding is elasticache does not let you turn them off.


That would be surprising, have you tried with CONFIG SET xyz ?


Running from the client side says that the config command doesn’t exist. Not sure how to run from the server side on elasticache.


Redis is great, it's a great piece of software. One way I shot myself in the foot (kinda) with it: used it with a 1:N query fanout pattern. I.e., issued N queries to Redis for 1 incoming query to my service. My service by design needs to do N queries (it's a long story). But Redis is not really designed to be used like this and I was putting it under very high load. I swapped it out with an SQLite cache recently and got rid of the errors that would pop up from putting extreme stress on the Redis server.


> Crucially, it held onto the Redis connection for the duration of that fallback condition and allowed errors from SET to fail the entire operation.

What? Was this inside a MULTI (transaction) or something? This isn't a flaw of Redis being single-threaded. Honestly all of these "footguns" sound like amateur programmer mistakes and have zero to do with Redis.


No. As it explains at the beginning of the paragraph you're quoting:

> If you're particularly naive, like I was on one occasion, you'll exacerbate these failures with some poorly-implemented application logic.

Then a few paragraphs above that is this sentence:

> The gotchas that follow were all occasions when I didn't use it correctly.

I'm not sure how to make it more clear that I'm criticising myself, not Redis, in the post, but that's the intention. If you have suggestions how I could make it more obvious, please let me know.


The title comes across like these are faults of Redis and that if you're not particularly careful about you'll shoot yourself in the foot.

> I'm not sure how to make it more clear that I'm criticising myself

"Mistakes I made while building applications on Redis"


Thanks, I'll update the post and link to your comment for attribution.


"Footgun" unfortunately does have some "prior art" as blaming unexpected behavior of a technology.

PS Love the aesthetic of your blog!


Ah, that's a good point. I explicitly didn't call Redis a "footgun" at any point, because of what that word means. However the post is tagged with `footguns` and perhaps that contributed to the misunderstanding. I'll remove the tag.

(I also wonder if "shoot yourself in the foot" is an idiom that doesn't translate well; at least in the UK I think it's fairly well understood to put blame squarely on the person doing the shooting, rather than the firearm they happen to be holding at the time)


Interesting. I always assumed it to be something that’s easy to use insecurely (or to cause a self-DoS) but thinking about it, I suppose a "footgun" is made for shooting oneself in the foot.


I have been using Redis for a long time and one of the things I love about it, is how difficult it is to shoot yourself in the foot with it. From the first use, after briefly reading some basic tips on what not to do, it was ridiculously simple to just get to work with it. I've never once run into a security or performance issue with it.


Change the default `stop-writes-on-bgsave-error` to "no" or you're asking for trouble... a ticking time bomb.


Isn’t it another ticking time bomb to accept writes that will be lost if the server is shut down?


Expecting that any key in redis will be there next time you read it is a ticking timebomb. Redis is not a database. It's a cache.

Unless you're using AOF mode with fsync always, you can lose writes. If you're doing that, you should be using a real database instead.


The first line of redis.io:

> The open source, in-memory data store


I don't mean to knock redis. I love redis. I implore you to benchmark Redis in FSYNC_ALWAYS vs Postgres. I encourage you to benchmark it as well with FSYNC_EVERYSEC and understand the tradeoffs that makes - Postgres is still very competitive with EVERYSEC in most workloads, and with a lot less tradeoffs in data reliability.


Yeah, in-memory. That should tell you it's not a persistent data store.


Also, comment out all `SAVE` to disable snapshotting so you can use the full machine RAM. Otherwise, you have to limit Redis to 50% RAM usage because Redis duplicates the dataset in memory when saving to disk, wasting half the machine's RAM. If you go over 50% RAM usage with snapshotting enabled you risk triggering bgsave error.

Finally, check out Redis-compatible alternatives that don't require the data set to fit in RAM. [0]

0: https://github.com/ideawu/ssdb




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: