Scaling a High-Traffic Rate Limiting Stack with Redis Cluster

jihadjihad · on April 27, 2018

Redis IMHO is in the pantheon of excellent open-source projects, right up there with the likes of HAProxy in terms of code quality, speed, and downright reliability. 100% agree with the notion that more such building blocks need to be built.

spmurrayzzz · on April 27, 2018

Agreed. I'd throw nginx into that cohort as well.

sneak · on April 28, 2018

Someone recently suggested that I read the nginx source code, as it was some of the most comprehensible and clear C he'd ever seen. I can definitely cosign that, having now done so. It's amazing!

meredydd · on April 28, 2018

Hey, can I ask you what prompted you to just go reading nginx code? As pointed out recently, most people - even people who advocate reading source code for its own sake - don't actually read code unless they have to. (http://akkartik.name/post/comprehension)

So it's interesting to see a counterexample. What led you to go spelunking in nginx?

sneak · on April 29, 2018

An associate was writing lua nginx stuff and needed to refer to the c to find some event names or something. He noticed how nice it was and told me to take a look because the quality impressed him so much, so I read it for fun.

Also, I have seen so many problems arise from the fact that code is easier to write than to read, so I consciously make an effort to not avoid reading code, and I’m sure that leaked over here.

stevekemp · on April 28, 2018

(Not parent.)

I used to read source-code specifically to look for security problems. Auditing of code as a small hobby. I found some fun issues over the years.

Reading code has become a habit these days. If I depend upon a library I try to glance at it. If I install a new application I try to have a read of some of it. It stops you from blindly depending on poor-quality software, and sometimes makes you choose an alternative.

papercruncher · on April 27, 2018

We use Redis Cluster quite extensively. The one thing to be very cautious and load test if running in a cloud environment is failover of nodes that are very loaded in terms of keys. If your nodes are holding multiple GBs of data, and depending on your persistence and other configuration settings, Redis may need to hit the disk to recover. If you don't have enough IOPS provisioned, be prepared for a long recovery time. The other thing that used to be problem but is getting much better now is the maturity of the different client libraries with respect to handling Redis Cluster specific idiosyncrasies.

chucky_z · on April 27, 2018

I just got back from RedisConf and antirez brought up the idea (or that it's already in-development... he was not clear) of releasing an official redis cluster proxy for use with older/less-featured clients.

I believe it was brought up in the keynote (which I missed unfortunately), and also as part of one of the Redis Clients talks.

kraftman · on April 28, 2018

Interesting. At which point is this recovery a problem? Id assume it would only be recovering on the slave since there will have been a newly promoted master after failover?

chucky_z · on April 27, 2018

Excellent article! The use of Lua solves a lot of potential issues here with competing writes to similar spaces for rate limiting, causing potential bizarre errors.

The one thing I would note that doesn't seem to be covered is if you are using a relatively large Lua script and running eval over and over it's getting cached every time, instead `SCRIPT LOAD ...` can be ran, which spits out a sha1 which can then be ran with `EVALSHA (sha1) (keys) (args)`. This can potentially speed stuff up as well as cutting back on memory.

hamandcheese · on April 27, 2018

But requires extra logic and possibly tooling to do that correctly. The scripts aren’t persisted iirc, so if a node restarts the script won’t be loaded.

simonw · on April 27, 2018

Client libraries can handle this automatically: you can send the EVALSHA command and it will either execute successfully or reply with "I don't know what that script is" - then the client can re-send with the full script.

sciurus · on April 27, 2018

True, but the performance difference can be huge. It's reasonable to illustrate things in a blog post just using EVAL, but you should be using EVALSHA for almost any production workload.

ddorian43 · on April 27, 2018

Its Builtin in most clients.

baconomatic · on April 27, 2018

I couldn't agree more with "We need more building blocks like Redis that do what they’re supposed to, then get out of the way." Redis has become such a foundational piece of software for me and the projects I work on.

Plus, it's just plain fun to use.

dnomad · on April 27, 2018

Frankly this strikes me as really hacky. A million operations a second isn't even that much. Something like Chronicle [1] can do millions of atomic operations a second. A cluster of 10 nodes for what are basic in-memory counters? And the wackiness of Lua scripts to read from the cache?

It all seems a bit much. I've solved similar problems in the trading space (processing raw market data feeds) with much less.

It's interesting how different communities have their hammers and nails. Redis seems to have really taken over certain consumer-web-oriented communities. In other more enterprise communities I've seen people lean heavily on distributed cache products like Hazelcast etc. And in trading this sort of thing is so bread and butter and common that everybody has internal solutions.

[1] https://chronicle.software/

manigandham · on April 27, 2018

Sure, but Redis is also fast, versatile, and very easy to run which is why so many gravitate towards it. It also has a much larger ecosystem compared to any of the other options which makes it more productive.

The fintech industry is generally years ahead in performance but most companies will trade those solutions for easier dev/ops.

misterbowfinger · on April 27, 2018

A lot of people also make a mistake of benchmarking Redis on multi-core machines. Redis is single-threaded, so it'll never be faster if you have multiple cores. To properly benchmark Redis on a multi-core machine, you have to run more instances of Redis on the same machine (1 instance/core).

GauntletWizard · on April 27, 2018

Yes; A redis cluster with 10 'nodes' can easily fit on a single machine. When comparing redis performance, benchmarking apples-to-apples in per-core performance is important.

spmurrayzzz · on April 27, 2018

Am I mistaken in that you need to pay for the Enterprise version of Chronicle to get a feature set comparable to Redis (which is free OSS)?

tybit · on April 28, 2018

I don’t see what’s wrong with using a scripting language to extend a data store. Regardless, Redis now has modules so that you can implement these extensions in native code instead of/as well as Lua. In fact it seems the author of the post already has done so with https://github.com/brandur/redis-cell

cortesoft · on April 28, 2018

I think you underestimate the importance of being open source to a lot of people.. I don’t want to rely on something I can’t debug if we have an issue.

dividuum · on April 27, 2018

I wonder if this would also be a use case for foundationdb. All the "clustering" would be built-in and performance seems to be quite good (https://apple.github.io/foundationdb/performance.html), although probably not comparable to redis with configuration that accepts data loss. Anyone has experience with that?

spullara · on April 27, 2018

I've used it for similar things in the past. Best practice on FDB would be to use snapshot reads on the counters and the add atomic mutation operation so you never have conflicts.

dividuum · on April 27, 2018

Thanks for your response. Interesting to know that this is indeed possible. foundationdb looks amazing from what I've seen and played with so far.

sciurus · on April 27, 2018

It's nice to hear a success story about Redis Cluster. When I worked at Eventbrite we used Redis heavily, both for the usual use cases (caching, ephemeral storage like sessions) as well as at the core of services like reserved seating. We did our own sharding client side as a layer on top of the redis-py library and relied on sentinel to handle failover. After Redis Cluster was released, we had some interest in it, but were were nervous enough about the limitations in its capabilities and the additional complexity of operating it that we never experimented with it.

ttul · on April 27, 2018

I fucking love Redis. We use it inside a large scale email sending platform to do all manner of rate limiting and real time analysis of streaming data to make routing decisions. Could not live without Redis.

garganzol · on April 27, 2018

Author has an enjoyable writing style. Thumbs up for quality writing.

simonw · on April 27, 2018

His blog is one of my favorites - so much good stuff on there. A few recent highlights:

Touring a Fast, Safe, and Complete(ish) Web Service in Rust: https://brandur.org/rust-web

Scaling Postgres with Read Replicas & Using WAL to Counter Stale Reads: https://brandur.org/postgres-reads

Redis Streams and the Unified Log: https://brandur.org/redis-streams

shizcakes · on April 27, 2018

Another approach to this problem is to use Twemproxy: https://github.com/twitter/twemproxy, which can be used like a sidecar Redis load-balancer.

sciurus · on April 27, 2018

Similarly, Envoy has redis support that looks promising.

https://www.envoyproxy.io/docs/envoy/v1.6.0/intro/arch_overv...

nasalgoat · on April 28, 2018

Twemproxy has memory and latency issues that caused us to write our own balancing code in our API. Just FYI.

abalone · on April 27, 2018

Silly question but any idea what tools were used to create the diagrams in this post?

awshepard · on April 27, 2018

Hazarding a guess, it looks like it might have been Monodraw, or something similar.

baconomatic · on April 27, 2018

Yep, it's Monodraw: https://twitter.com/brandur/status/928368179075678208

pulkitsh1234 · on April 27, 2018

More details on Stripe's rate limiter(s): https://stripe.com/blog/rate-limiters. An awesome gist is given at the bottom too, which has implementations of the different rate limiters, And also the `EVAL` part this post talks about.

xstartup · on April 27, 2018

In adtech, we average over a 100 million operations per second and we don't even touch redis.

We've been using Memcache all while and have no desire to change that.

zxcmx · on April 28, 2018

This would be an interesting post if you mentioned what you were doing 100 million times per second. How tangled are your writes? What are your consistency requirements?

100 million set operations per second is not the same as 100 million counter increments etc.

3uclid · on April 27, 2018

Which company?

sandGorgon · on April 28, 2018

isnt this the exact usecase that kafka solves ? Its great to see redis being able to do it just as well as kafka probably.

I'm quite interested to see how they implemented a queueing solution without the new Redis Streams infrastructure.