Redis is brilliant for simple job queues but it doesn’t have the structures for more advanced features. Things like scheduled jobs can be done through sorted sets and persistent jobs are possible by shifting jobs into backup queues, but it is all a bit fragile. Streams, available in 5+ can handle a lot more use cases fluently, but you still can’t get scheduled jobs in the same queue.
After replicating most of Sidekiq’s pro and enterprise behavior using older data structures I attempted to migrate to streams. What I discovered is that all the features I really wanted were available in SQL (specifically PostgreSQL). I’m not the first person to discover this, but it was such a refreshing change.
All the goodies only possible by gluing Redis structures together through lua scripts were much more straightforward in an RDBMS. Who knows, maybe the recent port of disque to a plug-in will change things.
It seems like PG has been pigeonholed as only suitable for “small-medium” size queues, but without numbers to define what “small-medium” means. A few million jobs an hour is entirely reasonable for PG based on anectdata (and I’ve load tested up to 54 million jobs an hour).
PG is an amazing tool that can handle more than most people think (or at least more than I thought).
There has been some progress on that front. For index size there is a new rebuild command in PG12 which works concurrently. There is also pluggable storage now, which enables much more efficient row deletion. I can’t find a link to the new storage format, but here is the announcement for 12 https://www.postgresql.org/about/news/1943/
Just wanted to say that Oban is really fantastic. Been using it in production for a couple weeks to handle recurring scheduled jobs and it was a real joy to work with. Thank you!
> Things like scheduled jobs can be done through sorted sets and persistent jobs are possible by shifting jobs into backup queues, but it is all a bit fragile.
What exactly is fragile about this approach? We've supported scheduled tasks in TaskTiger (https://github.com/closeio/tasktiger) via Redis's sorted sets and haven't had any issues.
> What exactly is fragile about this approach? We've supported scheduled tasks in TaskTiger (https://github.com/closeio/tasktiger) via Redis's sorted sets and haven't had any issues.
Using sorted sets for scheduled tasks isn't the fragile part, gluing it all together to prevent losing jobs by shifting into backup lists (or hashes) is the fragile part in my experience.
Yep, TaskTiger also uses Lua quite extensively ([0]) to ensure atomicity when moving tasks between various stages of processing (queued, active, scheduled, errored).
Someone mentioned about dramatiq-pg (https://gitlab.com/dalibo/dramatiq-pg) which also uses PG as the task queue backend and jsonb for message payload & result. How similar is that comparing to Oban?
I don't know about the technical differences, but having used dramatiq in production for 3 years now I can say it's 100% awesome, it's been rock solid for us handling many thousands of tasks daily.
I love PG but would never use it for message queueing. Postgres needs periodic table maintenance, autovacuum tuning for tables with high number of writes, and IMO can never be as good as RabbitMQ for distributed task queues. RabbitMQ is purpose built around messages and it's extremely reliable and performant.
> Streams, available in 5+ can handle a lot more use cases fluently
Stream does solve the backup queue issue, but not being able to filter messages from the consumer's pending list makes implementing retry mechanism and exponential backoff hard (without reaching to sorted set).
Though, I can understand the Redis team's decision to keep it simple.
Reposting from a while back in case it solved a problem for someone.
I use Postgres SKIP LOCKED as a queue. Postgres gives me everything I want. I can also do priority queueing and sorting.
All the other queueing mechanisms I investigated were dramatically more complex and heavyweight than Postgres SKIP LOCKED.
Here is a complete implementation - nothing needed but Postgres, Python and psycopg2 driver:
import psycopg2
import psycopg2.extras
import random
db_params = {
'database': 'jobs',
'user': 'jobsuser',
'password': 'superSecret',
'host': '127.0.0.1',
'port': '5432',
}
conn = psycopg2.connect(**db_params)
cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
def do_some_work(job_data):
if random.choice([True, False]):
print('do_some_work FAILED')
raise Exception
else:
print('do_some_work SUCCESS')
def process_job():
sql = """DELETE FROM message_queue
WHERE id = (
SELECT id
FROM message_queue
WHERE status = 'new'
ORDER BY created ASC
FOR UPDATE SKIP LOCKED
LIMIT 1
)
RETURNING *;
"""
cur.execute(sql)
queue_item = cur.fetchone()
print('message_queue says to process job id: ', queue_item['target_id'])
sql = """SELECT * FROM jobs WHERE id =%s AND status='new_waiting' AND attempts <= 3 FOR UPDATE;"""
cur.execute(sql, (queue_item['target_id'],))
job_data = cur.fetchone()
if job_data:
try:
do_some_work(job_data)
sql = """UPDATE jobs SET status = 'complete' WHERE id =%s;"""
cur.execute(sql, (queue_item['target_id'],))
except Exception as e:
sql = """UPDATE jobs SET status = 'failed', attempts = attempts + 1 WHERE id =%s;"""
# if we want the job to run again, insert a new item to the message queue with this job id
cur.execute(sql, (queue_item['target_id'],))
else:
print('no job found, did not get job id: ', queue_item['target_id'])
conn.commit()
process_job()
cur.close()
conn.close()
Once the client session holding the lock disconnects, closing the session, the lock is released. You can also set client options like idle_in_transaction_session_timeout to force transactions to close after a certain amount of idle time, which also releases the lock.
It means another process could grab it, yes. In the parent post's model, the "scheduler" operates in a "at least once" model", as unless the transaction is committed, a premature failure would result in the message will be picked up again by a scheduler, which could result in some duplicate work.
The parent post's model also tracks "attempts", but this only captures known or acknowledged failures -- unacknowledged failures (i.e. crashes) will not be recorded, so a task which explodes the process would be re-run ad infinitum.
An alternative method could be to record an attempt in a separate transaction, so that the next scheduler execution can detect that the job/message was serviced before, even if the message itself appears fresh.
Side niggle - I used to notice a lot of Django projects would use complex job queues for absurdly low workloads. Beginners would get recommendations to use RabbitMQ and Redis for sites that probably were only going to see a few hundred concurrent users at most.
Seriously - don't add complex dependencies to your stack unless you need them. The database makes a great task queue and the filesystem makes a great cache. You really might not need anything more.
The problem with that is that, even though PostgreSQL has some great functionality for pub/sub and queues, most Django queue systems don't use them by default. Most of the ones I've seen resort to polling (which might be fine, depending on the use case), which is suboptimal.
Otherwise, I agree, I would love to be able to use Postgres for queues, and I think there is a Dramatiq Postgres backend that will do queueing properly (EDIT: Yes, Dramatiq-PG).
Postgres PubSub is great for triggering job dispatch, but it isn’t good enough on its own. There are a few caveats that make polling an additional requirement:
1. Message compaction — notifications are deduplicated, so if two jobs are inserted in the same queue the system may only get one notification.
2. Message saturation — with high levels of activity you’ll need to start discarding messages, essentially denouncing, otherwise the database gets throttled.
3. Dedicated connections — pubsub requires a dedicated connection for listening, which requires a single connection with custom dispatching or one connection per queue.
Relying on PG for everything is awesome regardless!
Is this a problem for low-traffic scenarios? My thinking is basically that I should save the extra dependency until I have more than a few tasks per second, at which point I will switch to something like RabbitMQ.
I don’t think it is much of a problem until you are pushing 2-3k jobs a second. Only the message saturation is a bottleneck anyhow, everything else I mentioned is handled by how a library is architected.
That makes sense, thanks. I generally switch long before I get anywhere near to where the database is the bottleneck, I use Postgres as a broker only for low-traffic sideprojects, where it's much more worthwhile to not bother with an extra broker.
Yeah, that was my issue. Seemed like a ridiculous amount of stuff just to process a few jobs once a day in the background. Eventually I found django-background-tasks and have bene using that ever since. It uses whatever database you're already using, so very quick and easy.
At a glance, https://github.com/lilspikey/django-background-task indeed seems like a good and easy way to get started. You can later switch to something more specialized once you hit the right scale or when you need more sophisticated features.
Yep. That's the solution I've been recommending over Celery for most people.
Half the time people seem to reach for Celery etc all they want to do is run a long task without blocking the response. That's a helleva lot of code to add to your project to just have that functionality.
When I wanted to add processing for background jobs to a Django project, the most obvious solution was Celery, and the most obvious job queue for Celery was Redis.
I looked for a solution that would use Postgres+Celery but there wasn't anything obvious. There is a way to just use the filesystem for Celery, but that doesn't seem ideal.
Exactly my complaint - it's become the "obvious" way to do it than involved several 10s of thousands of lines of code (Celery+RabbitMQ+Redis) plus binary dependencies and associated background daemons.
You can roll your own background task in a few dozen lines of pure Python.
for a long time there were no easy way to do background processing in django. with the asyncio paradigm spreading throughout the community and django starting to add support for it, that will quickly change.
> for a long time there were no easy way to do background processing in django.
There was no easy way to do complex background tasks - but Django is just Python. Spawning a new process is easy as is scheduling via cron or similar. I've used both to implement simple background tasks.
I currently use RQ. Here is the logic that lead me to choosing it, then wishing I had just used Postgres.
I need a queue to handle long running jobs. I looked around the Python ecosystem (bc Flask app) and found RQ. So now I add code to run RQ. Then I add Redis to act as the Queue. Then I realized I needed to track jobs in the queue, so I put them into the DB to track their state. Now I effectively have a circular dependency between my app, Redis, and my Postgres DB. If Redis goes down, I’m not really sure what happens. If the DB goes down I’m not really sure what’s going on in Redis. This added undue complexity to my small app. Since I’m trying to keep it simple, I recently found that you can use Postgres as a pub/sub queue which would have completely solved my needs while making the app much easier to reason about. Using Postgres will have plenty of room to grow and buy you time to figure out a more durable solution.
Can't you track state in a "stateless" way ? For example having the job write it's progress in a file, so that if your app goes down, you can just re-read the file and know the state of your job?
Relying on the filesystem has its own drawbacks. If you scale horizontally, the local filesystem might not have the file you created because it's running on a different server. Mounting storage across multiple servers will introduce bottlenecks, and now you have to worry about the storage going down. If you're running within a Docker container, you need to make sure the file doesn't live inside the container or it won't be persisted.
Say you have an object, the object load some config from a settings module which in turn fetch from env. No matter how many times I restarted RQ, the config won't changed. Due to the code changes, the old config cause the job to crash and keep retrying.
Until I got frustrated and get into Redis, pop the job, and yet all the setting was in there. In other words, RQ serialize the whole object together with all properties.
RQ isn't that good IMHO. You will have to add monitoring, health check, node ping, scheduler, retrying, middleware like Celery eventually it grow into a home grown job queue system that make it harder to on board new devs.
Just use Celery. Celery isn't that bloated. It has many UI to support it and backend and very flexible. Celery beat schedule is great as well.
I used this in a project at a previous job -- I have to say, while the API is simple and useful enough for small projects, it raises some issues with how it's designed.
Instead of relying on trusted exposed endpoints and just invoking them by URL, it does a bytecode dump of task functions and stores those in Redis before restoring them from bytecode at execution time.
This has a few drawbacks:
- Payloads in the queue are potentially a fair bit larger for complex jobs
- Serialization for stuff that has decorators (and especially stateful one, like `lru_cache`) is not really possible, even with `dill` instead of `pickle`
- It's not trivial, but this exposes a different set of security risks compared to the alternative
I don't want to say this is a bad piece of software, it's super easy to set up and way more lightweight than Celery for example, but it's not my tool of choice having worked with the alternatives.
I had to setup a task queue in an old, resource constrained environment once. I ended up forking the client process into a worker when the queue wasn't already running. Any time we need an expensive method call to be throttled asynchronously we just have to add a decorator to it.
Celery has always been a massive resource drain, and for pretty much zero gain. I'd suggest just saying "stay away from Celery" as it's easy to get going with, but really hard to scale or work with when you need what it promises. There are better options.
So far I've mostly used Celery in Python, but at jobs past there was also Sidekiq (granted, the use cases for that were a bit different, but it's kind of in the same bucket).
Haven't had the opportunity to work with them, but I've heard good things about SQS and other cloud queues, but I think those are fundamentally message queues, so you'd be responsible for writing a job queue layer over it, which can be a huge hassle.
May I ask anyone posting stats on "millions per day" please indicate how many nodes/cpus the entire system uses. For example "8.6 million per day" is only "100 per second". If that takes 100 cpus.... folks are underestimating what a single node modern cpu/network card is capable.
Task throughput isn't about the queue library. Your queue library should be fast enough, after that the bottleneck is the workload your tasks are doing. That's why you need prioritized queues so your slow long-running tasks run low priority and don't block the high priority tasks that should be fast and usually affect UX. We have 9 worker servers, each server has 6 CPUs, we process over 1M tasks per day.
Not sure if lower-level API of RQ supports this, but I tend to prefer message-oriented jobs that don't couple the web app to the task handler like the example.
I don't want to import "count_words_at_url" just to make it a "job" because that couples my web app runtime to whatever the job module needs to import even though the web app runtime doesn't care how the job is handled.
I want to send a message "count-words" with the URL in the body of the message and let a worker pick that up off the queue and handle it however it decides without the web app needing any knowledge of the implementation. The web app and worker app can have completely different runtime environments that evolve/scale independently.
Agreed - which is why I don't put my business logic in the webapp, or use models coupled to the web framework.
The web framework is a way to handle http, rest, or graphql - deserializing and serializing those protocols, not a way to handle my business logic.
Decoupling these things lets you write one-off scripts or have task queues that don't need to load the context of a large web framework - they can just be simple python.
I also liked Huey because of it's simplicity, and tried using it in a commercial project, but it couldn't execute periodic tasks in less than a minute intervals which puzzled me a bit. I had need to execute task every 10 seconds to reindex database changes to ElasticSearch.
After opening GitHub issue about it and asking if there were plans to implement sub-minute intervals, library's author (coleifer on github) had dismissive/arrogant attitude, and his reply was something along the lines of "too bad you cannot fork it and do it by yourself" and deleted my thread.
This threw me off from using this library, and I went back to Celery.
* how did you settle on 10 seconds? Keeping two databases in sync is a complex process. I'd suggest that the difference between running 1x/min or 6x/min is negligible -- and if it's not then probably you need something more sophisticated than a simple cronjob.
* I am providing free software. You are not contracting with me to provide developer support, so in my book I'm under no obligation to be courteous and polite all the time. I try to be most of the time, but nobody is perfect. Luckily the source is available if you don't want to talk to me. That was my point and I'm surprised that is so triggering to some people.
Thanks for giving me advice on database syncing, but I think that you kinda missed the point.
It doesn't matter how I chose 10 seconds as a periodic task interval, problem was that to a simple question about feature support I got dismissive/borderline rude answer which made me lose confidence in Huey as a project and it's long-term maintainability. And it's not like I asked for something exotic, but a support for subminute periodic tasks which almost every other task queue has.
Do you have a preference? I was asked recently about setting up Redis for use as a job processing queue, I thought RQ might be a good fit. Huey looks good too, but I haven't used it yet.
As always I'm happy to see one of my projects getting repped. Huey was something I started out of frustration with celery. It's more of a celery replacement than an rq replacement, though it covers both. If you haven't heard of it -- it does tasks, cron (periodic tasks), delayed tasks, result-storage, etc., and supports multiple worker models (process/thread/greenlet).
We tried this at WakaTime and it didn't scale well. We also tried https://github.com/closeio/tasktiger but the only ones that work for us are based on RabbitMQ:
Can you please elaborate? I just adopted RQ a few weeks ago and is working fine for me (a few days before launch) with a couple thousand queries (1-2 jobs per 5 seconds) but a ton of connections (both making and taking jobs). Any advice?
I'm thinking that after we go to prod, I'll move to celery and MQ overtime.
This was multiple years ago so I don't remember details. I think something around task reliability, performance, and I think architecture flaws that meant we couldn't prioritize tasks or some other missing features we needed. We process around 50-1000 jobs per second.
I've not much to say other than that this is a great little library! I used RQ for a small project recently and found it to be pretty easy to use. As my project grew bigger I also found it contained extra features I didn't realise I'd need when I started.
Big fan of this library. I think Celery is overkill for most projects. RQ is nice for the majority projects and it can scale right along with your project.
RQ scales surprisingly well and for certain types of projects is a nice lightweight alternative compared to more complicated job queues such as Celery.
Yeah when I was looking at queues I was put off by Celery since somewhere in its docs or a tutorial high up in the Google results it was suggested to use both redis and RabbitMQ as some sort of brokers/results stores rather than just one of those to handle both. I'm sure there were good reasons for that, but I was looking for something simple and not to learn multiple new technologies so in terms of simplicity RQ (which uses redis for everything) is pretty hard to beat.
Shameless plug: Our team at Close has been inspired by RQ when we created TaskTiger – https://github.com/closeio/tasktiger. We’ve been running it in production for a few years now, processing ~4M tasks a day and it’s been a wonderful tool to work with. Curious to hear what y’all think!
Curious. What's the ROI on developing a task queue system for 50 req/sec?
Please don't take my question negatively. I acknowledge that not everything is about ROI.
For example I often let my team scratch their own itches by developing tools just for the sake of having fun and learning new toys. Perhaps this is the case with TaskTiger?
It was definitely fun to work on it :) That said, I think we've also seen quite a big positive ROI as well.
It is true that we've spent quite some time working on TaskTiger where we could've used an off-the-shelf task queue system. However, since we've created it, it allowed us to develop new features a lot faster and spend a lot less time debugging issues.
We were using Celery before and it was a big pain with memory leaks, periodic tasks not working as expected, some tasks being lost for no apparent reason, and poor visibility into both aggregate statistics and individual tasks. We spent too much time debugging Celery's issues instead of working on our core business. Here's some more context: https://github.com/closeio/tasktiger/issues/141#issuecomment...
Advanced features like scheduled tasks, unique queues, and task locks have helped us move faster, too.
We tried Tasktiger but ran into zombie tasks and some small but annoying bugs while testing. Around that same time, we found Dramatiq and currently have Dramatiq + Celery<4.0 in use.
Interesting, we haven't had issues with zombie tasks at all (and we DID have issues with them when using Celery). Did you manage to find out what was causing them?
> and some small but annoying bugs while testing
Do you recall any details?
Appreciate you trying TaskTiger, even if you've moved on since!
* RabbitMQ scales messages without needing tons of RAM
* I don't have to decide between not persisting messages to disk with Redis vs only using half the machine's RAM [1]
* Queue and task visibility is better in RabbitMQ
* Support for purging all tasks in a queue
* Tasktiger had lower throughput than Celery and Dramatiq, maybe needs lazy-forking?
Things I did like about Tasktiger:
* no feature bloat and it's possible to actually read it's source code[2].
[1]: To enable persisting to disk (Redis fork + save snapshotting) you must limit Redis to only using half the available RAM on a machine.
[2]: Celery is split up into multiple convoluted, bloated, difficult to read repos.
Both, but unfortunately still mostly using Celery 3.2. Never had any problems with zombie tasks, since that's related to how you're storing tasks and queues and RabbitMQ is excellent at that. Zombie tasks are never a problem when using RabbitMQ.
Especially never had problems with zombie workers. That wouldn't be a problem with your choice of queue library, more an infra problem.
Interesting. At the moment I'm noticing some zombie workers popping up here and there with RQ and was thinking of migrating soon to Celery or Dramatiq. I think I'll experiment a bit and decide between both (but Celery seems like the more popular option, so I'm a bit biased for going for that).
RQ is horrible. We’ve used in K Health and migrated to Celery as configuration, optimisation and monitoring where all really hard with RQ. It takes ten seconds to get started but days to get to production. Not a good trade off!
Having used resque how production workloads I wouldn't want to use Redis as a job queue. It works fine for small workloads but doesn't have many resiliency capabilities and doesn't scale well (cluster mode drops support for some set operations so probably can't use that). Replication is async and it's largely single threaded so then you get this SPOF bottleneck in the middle of a distributed work system
You may want to check the Disque module I just (a few weeks ago) published. Synchronous replication and strong guarantees of delivery under failures. Also best effort algorithms to minimize re-delivery of at-least-once messages.
Different to RQ – but related - I recently released Lightbus [1] which is aimed at providing simple inter-process messaging for Python. Like RQ it is also Redis backed (streams), but differs in that is a communication bus rather than a queue for background jobs.
We were unsatisfied with Rq (I forget which parts, but I seem to recall there was lots of code that strictly wasn't needed) and wrote http://github.com/valohai/minique instead.
I’ve used RQ in production applications just last week. It’s pretty basic so there’s upsides and downsides, but so far it simply works! I
May opt in for celery down the line but for the state of my project, rq helped me ship quick and iterate.
Just use Celery. I know several teams who made the, in hindsight, very poor choice of using RQ. By the time you realize it’s a bad decision it’s very hard to get out of.
Unless your use case is absurdly simple, and will always + forever be absurdly simple, then celery will fit better. Else you find yourself adding more and more things on top of RQ until you have a shoddy version of Celery. You can also pick and choose any broker with Celery, which is fantastic for when you realize you need RabbitMQ,
After replicating most of Sidekiq’s pro and enterprise behavior using older data structures I attempted to migrate to streams. What I discovered is that all the features I really wanted were available in SQL (specifically PostgreSQL). I’m not the first person to discover this, but it was such a refreshing change.
That led me to develop a Postgres based job professor in Elixir: https://github.com/sorentwo/oban
All the goodies only possible by gluing Redis structures together through lua scripts were much more straightforward in an RDBMS. Who knows, maybe the recent port of disque to a plug-in will change things.