Thanks for this, I had considered using celery for a recent project but ultimately backed away because I got the feeling it was more trouble than it was worth. As a point of reference would you say the learning curve for a celery setup is similar to that of django? Not that theres anything terribly hard about django, but Id agree that its probably overkill if youre relatively new to python and are just looking for a quick way to produce some html with no intent on developing it further.
I wouldn't say Celery's learning curve is steeper than Django's, but it definitely seems like overkill for your case. If you need to do some time-consuming action periodically (and making an HTTP request by hand each time is not an option), then you could just use cron for the start if your project is relatively simple. And if you literally need to just produce some HTML when asked for, then why are you considering using an async task processor such as Celery?
I just started using Celery this week for the first time, to handle parallel processing of thousands of tasks in a data pipeline.
Three days in, I can tell you that it does work, but it does take a lot of searching through the docs to optimize. It's very hard to run with class objects too, so we just created long-scripted functions for the worker.
Even now I'm trying to figure out why the worker is unable to refresh access tokens after 60 minutes, and tempted to just have it run as root.
Django is kind of slow learning curve, but understandable. Celery for me was quicker, but more along the lines of: follow the instructions, and google / stack overflow until it works. A lot less understanding involved.
I had the same feeling, though I am using celery for different projects, still it needs time for me to figure out "what is going on there?". Specially I have used for simple task queue system which was overkill. And python-rq definitely a good choice. It does one thing, API is quiet simple and short and it does the task well.
Thanks! I didn't know about RQ, from a cursory glance it does look a lot simpler than Celery.
Sidenote: I decided to follow RQ's author on Github, and discoverd Gitflow as well. So, double thanks!
This! Celery is well supported and powerful but often it is just too much to manage. Everytime we have an error crop up in our deployment it takes too much time to figure out whats going on.
Good, basic practices to follow. Here's a few more:
- If you're using AMQP/RabbitMQ as your result back end it will create a lot of dead queues to store results in. This can easily overwhelm your RabbitMQ server if you don't clear these out frequently. Newer releases of Celery will do this daily I think - but it's worth keeping in mind if your RMQ instance falls over in prod.
- Use chaining to build up "sequential" tasks that need doing instead of calling one after another in the same task (or worse, doing a big mouthful of work) in one task as Celery can prioritise many tasks better than synchronously calling several tasks in a row from one "master" task.
- Try to keep a consistent module import pattern for celery tasks, or explicitly name them, as Celery does a lot of magic in the background so task spawning is seamless to the developer. This is very important as you should never mix relative and absolute importing when you are dealing with tasks. from foo import mytask may be picked up differently than "import foo" followed by "foo.mytask" would resulting in some tasks not being picked up by Celery(!)
- Never pass database objects, as OP says, is true; but go one step further and don't pass complex objects at all if you can avoid it. I vaguely remember some of the urllib/httplib exceptions in Python not being serializable and causing very cryptic errors if you didn't capture the exception and sanitise it or re-raise your own.
- Use proper configuration management to set up and configure Celery plus what ever messaging broker/backend. There's nothing more frustrating than spending your time trying to replicate somebody's half-assed Celery/Rabbit configuration that they didn't nail down and test properly in a clean-room environment.
With regards to #1: What happens is that if task_B depends on a value that task_A returns, task_A will insert its value into the queue and task_B will consume it.
If task_C returns a value which no other task cares about, it will insert the value into the queue, and never gets consumed. This is why dead queues (also known as "tombstones") happen.
Always remember to set ignore_result=True for tasks which don't return any consumed value.
In general using an AMQP for the result storage is somewhat of a bad idea i think. But yes I agree about the ignoring results part seeing as most tasks I've seen in the wild don't return anything at all. Hence #6 in the post.
I've worked 4+ years with Celery on 3 different projects and found it incredibly difficult to manage, both from the sysadmin and the coder point of view.
I'll check this out. I recently started looking around for an alternative to celery. I literally just got over a celery-related bug that took way too long to diagnose and one that took even longer on my previous project.
I'm not very happy with the community either. What with the dispersed, incomplete documentation, multiple discussion forums, and snide responses, I'm really getting ready to wash my hands of it.
We are writing more docs now and putting a small website up, currently they are in the README. We provide support by email, there are already a few third-party users using it in production.
However it is a complete rewrite because we felt we couldn't add gevent support and other features to provide extreme visibility without major changes. If you don't need those 2 things, you may want to check out RQ instead for now, it's still a very good piece of software.
I see now that mrq supports concurrency why python-rq does not (at least in a stable fashion).
I'll try mrq for the gevent integration. It's great that you guys are actively working on improving it. Python-rq is great too, but it hasn't been updated in a while and I don't think concurrency is on the radar.
This. In my experience Celery's capabilities are greatly oversold, both by itself and others. Most problems Celery purports to solve it tends to just overcomplicate and often not really solve at all. And most the time I've dug into the code I'd rather I hadn't (discovering that the thing you'd assumed was implemented fairly bulletproofly, really wasn't).
I asked about a progress bar, for a long running web request on Stack Overflow, and Celery seemed to be the accepted way to do that.
I manged to get it set up eventually. Realized a month or so back that it hasn't been running, and it has taken me about 3 times as long to get it up again as it did on the first try. I am sure there must be an easier way.
My major bone of contention is the frequency and stability of releases. There are too many releases and not enough testing before each release. I have frequently found myself trying out a new release because it included a patch I wanted, to only find it has broken something else.
I disagree with the characterization in #1 (although I can't speak to the Celery particulars). I feel like if you have a job that is critical to your business process, the job should be persisted to your database and created within the same database transaction as whatever is kicking off the job.
Consider how background jobs are typically managed with RabbitMQ, Redis, etc. They are usually created in an "after commit" hook from whatever gets persisted to your relational database. In this scenario, there is a gap between the database transaction being committed and the job being sent to and persisted by RabbitMQ or Redis; during this gap the only record of that task is being held in a process's memory.
If this process gets killed suddenly during this gap, that background job will be lost forever. It sounds unlikely, but if RabbitMQ or Redis is down and the process has to sit and retry, waiting for them to come back online, the gap can be sizable.
I think you're missing the point. The Celery (or any task queue really) particulars are very important here, cause you don't want background workers hammering your database if they don't need to. Cause the workers wan't work with a AMQP implementation, which the database is not. It's like using a fork instead of a hammer, sure you might get a few nails but it's not the right tool for the job.
The systems that use these kinds of tools are usually not structured in a way that they need to wait for something in the database to be stored. By nature they are async tasks and they should be able to run whenever and return sometime in the future, and they will most likely produce some kind of result in the database, so there is no reason to store the job information itself in the database.
Jobs are usually not created as hooks after a database commit, so jobs being persisted with database transactions is not quite relevant and Celery has failure mechanism and ways to recover if it was not able to send a task to the broker (ie. RabbitMQ was down).
Redis and RabbitMQ do have a mechanism of persisting jobs onto disk as well so they don't get lost when the process is restarted. So there is no way that a job get's lost forever as you say, if you handle all these cases correctly.
One more thing, Python's database drivers don't work quite as you've described. Namely they don't (by design) make use of the autocommit feature of the database engine, rather they wrap every sql statement in a transaction, so either way each statement get's executed separately in it's own transaction. This would not guarantee, let's say a db record being added and the job being saved as well. You would have to use explicit atomic blocks (something a kin to what Django >= 1.6 has) to get both things or none to be persisted.
I'm coming at it from the Ruby angle, in which jobs are often triggered using ActiveRecord after_commit hooks. I admit to being ignorant of the Python/Celery way of doing things so perhaps I am missing the point. I'm talking about jobs being produced atomically with the data that necessitated the background job (I realize not all background jobs are spawned in this fashion).
I agree with your point about polling being bad, however as someone pointed out below it's not an issue with Postgres's LISTEN/NOTIFY (and I added a note to the queue_classic gem which makes this easy to take advantage of in MRI Ruby).
Obviously I'm aware that Redis and RabbitMQ persist jobs. That's not what I was talking about at all.
I think we're on different wavelengths here so I'll let it be. :-)
I'm not aware of how ruby does it other then hearing about 2 most popular solutions that are used with rails.
As for Postgres's PUB/SUB I've replied to the reply that mentioned that and you can't really use that. Would be great if we cold update the broker driver and see if we could get support though.
My reply should be closer to this, I'm on painkillers a little foggy but this reply is correct. The db record being saved does not guarantee the job will be saved. Particularly with Django.
I disagree with this, in my experience it's almost always a really bad idea to use the DB as a queue. If rabbitmq is down the process should retry a finite amount of times (usually 3 in our use case) then set a status on the db record. Then you have audits running to pick up records in that state and retry the process once the system is back up and running. That way nothing is lost and you gain all of the benefits of Rabbitmq.
Tnx for the link. nice read. There is a difference though, in using something as "a Queue" and using something as an AMQP implementation when it's clearly not that.
No, you are not storing the job itself, just a flag that indicates that the job was not done/or even stored. What's more, there really are various ways to handle failure of storing a job on a queue, but in most cases it's just the absence of the desired effect that can be a giveaway. Ie. you know that the job either didn't finish or wasn't even sent to the queue if it's effect (ie. updating the user's friend list from facebook) was not done.
Also, take a look at #5 for a great monitoring stats solution.
> No, you are not storing the job itself, just a flag that indicates that the job was not done/or even stored
You must be storing enough representation of the job to re-send it though, no? You couldn't do this with a simple flag for all job types I imagine.
> in most cases it's just the absence of the desired effect that can be a giveaway
Sure, but not all jobs create side effects you can easily look for (e.g. sending an email). And secondly, you'd have to create a "sweeper" process for every type of job you create then.
At the point where all the other "normal" mechanisms monitoring storing the job on the queue have failed I would suppose I have a far more greater problem on that system. Personally I would not use flags for jobs types, I would monitor them with flower. Also, we need to keep in mind that it's one thing retrying sending a job to the queue and a whole other thing retrying a job already on the queue that failed for some reason. In my experience so far there a mechanisms that deal with both of these issues effectively and none involved using the webapps database for this.
Thanks, I will take a look at the Pythonic way of doing these things. Celery does look a lot more industrial-grade than most of the Ruby solutions so my curiosity is piqued.
2.Use statsd counters to keep track of basic statistics (counts + timers) for each task
3. Use supervisor + monit to restart workers after lack of activity (I have seen this happen a few times, but never been able to track down why it happens, but this is an easy fix)
Excellent resource, I remember wrestling with learning celery and how to do some simple things, loved finding Flower to monitor things.
I will say though Celery is probably overkill for a lot of tasks people think to use it for, in my case it was mandated to support scaling for a startup that never launched, partly because they kept looking at new technologies for problems they didn't have yet.
Points 1 and 2 are only valid because the Celery database backend implementation uses generic SQLAlchemy. Chances are, if you are using a relational database, it's PostgreSQL. And it does have an asynchronous notification system (LISTEN, NOTIFY), and this system allows you to specify which channel to listen/notify on.
With the psycopg2 module, you can use this mechanism together with select(), so your worker thread(s) don't have to poll at all. They even have an example in the documentation.
It is true that Postgres supports Pub/Sub but unfortunately the Celery broker driver does not take advantage of this. It would be great if we could get support for it. Nevertheless, just because it has pub/sub doesn't mean it's a full AMQP implementation. Also, there's the fact that most amqp solutions are in memory, wheres a database is on disk... also has it's costs.
Once you scale your worker pool up beyond a couple of machines you need some sort of config management with Celery. We use SaltStack to manage a large pool of celery workers and it does a pretty good job.
This is not a Celery-specific tip, but as Celery also likes to "tweak" your logging configuration you can use https://pypi.python.org/pypi/logging_tree to see what's going on under the hood.
I used both, ended with Rq. Freedom if choice can be good, but when you able to make decision. Variety of backends, storages force you to understand how each component really work and when you dig into details you find that they all not equivalent. But you just need something f--kng working and you don't want to pay another guy to maintain zoo of different products.
That is why I decided to use Rq, it is better to know limitations of something simple then know possibilities but not able to make choice.
There are many differences, but most notably rq spawns one process per task. Line count is a stupid metric, e.g recently our line count doubled because of our new coding style, also the majority of the source code is tests.
rq is like a luger pistol, light, simple, gets the job done.
celery is like a .50 caliber machine gun, industrial strength, lots of options, used for a variety of completely different use cases.
For simple stuff, use rq, but celery + rabbitmq work better if you have dozens and dozens plus worker nodes (ie: different servers), whereas with rq, you use redis, which could potentially be a SPOF, even with redis sentinel.
Passing objects to Celery and not querying for fresh objects is not always a bad practice. If you have millions of rows in your database, querying for them is going to slow you way down. In essence, the same reason you shouldn't use your database as the Celery backend is the same reason you might not want to query the database for fresh objects. It depends on your use case of course. Passing straight values/strings should be strongly considered too since serializing and passing whole objects when you only need a single value is not good either.
Oh absolutely values before objects. I said "serializing" more in the sense that pickle is always used for storing the arguments into the queue (or whatever the default serializer).
It always depends on your use-case but generally you want your application to behave correctly, which means it has to have correct/fresh data...you can't sacrifice correctness because of an inability to scale your database.
Yes. I think saying "you can't sacrifice correctness because of an inability to scale your database" is perhaps conveying the wrong message though. I mean, your very first point is about database scaling issues and the advantages of using something like RabbitMQ to avoid expensive SQL queries.
If you are processing a lot of data in Celery, you really want to try to avoid performing any database queries. This might mean re-architecting the system. You might for example have insert-only tables (immutable objects) to address this type of concern.
If you combine Celery with supervisord it's important to check the official config file[1]. At least two settings there are really important - `stopwaitsecs=600` and `killasgroup=true`. If you don't use them you might end up with a bunch of orphaned child Celery processes and your tasks might be executed more than once.
Wondering about something: if you need to have a long task (5s to 10s) in the background, or even longer, for an AJAX request, what should you rather do:
- use gevent + gunicorn, or Tornado, in order to keep a socket open while the worker is processing the task?
- use polling? (less efficient)
- use websockets (but then the implementation is perhaps a bit more complex)
If your ajax request requires long task processing and requires you to wait for it than this is not a background task any more, it's done in one of the web server threads, and even if the thread outsources the task to another process it's still waiting on that proces to finish before returning the ajax response. This is bad.
I'm not entirely convinced about websocket solutions in Python yet, but I've been told flask-websockets is awesome. Nevertheless this doesn't solve the problem for you. Cause the request is just keeping an open line and waiting for a respone....blocking is bad.
The most simplest advise I would have is to have the ajax request trigger a background task and return immediately. The background task will then have some kind of side effect (ie. write some result to a database somewhere) which the ajax request can the look for with some kind of polling mechanism (on some other endpoint). Of course you can complicate this a lot, depending on your needs, but this seemed like the most straightforward solution.
"I'm not entirely convinced about websocket solutions in Python yet, but I've been told flask-websockets is awesome. Nevertheless this doesn't solve the problem for you. Cause the request is just keeping an open line and waiting for a response....blocking is bad."
Tornado only blocks if you do something silly. It's event based, and can keep hundreds of connections open and waiting for it's async response event before actioning/responding the open connection.
"The most simplest advise I would have is to have the ajax request trigger a background task and return immediately. The background task will then have some kind of side effect (ie. write some result to a database somewhere) which the ajax request can the look for with some kind of polling mechanism (on some other endpoint)."
Wow, overkill much? Polling is bad, and is exactly the kind of bad solution that a lot of these libraries are in place to prevent developers from needing to do.
Websockets were made to solve the long-polling and poll-spamming that was prevalent. Now all you have to do is keep a light, open web-socket connection to the server. And the server, being async/evented, will respond when the task is good and ready. Nice and clean.
"Tornado only blocks if you do something silly. It's event based, and can keep hundreds of connections open and waiting for it's async response event before actioning/responding the open connection." - Yes pure tornado based apps are probably fine if you know what you are doing.
"Wow, overkill much? Polling is bad, and is exactly the kind of bad solution that a lot of these libraries are in place to prevent developers from needing to do." - Polling is not bad if you have a good use case. You just cannot do non-blocking stuff with Django for instance, or it's very very hard and tricky. Websockets also limit you with the number of connections you can have open at once.
So you think polling is the most effective solution, it is perhaps the case.
I was thinking whether using something like gevent or Tornado, a bit like nodejs, would enable the webserver to keep the socket open without blocking while the computation is made in a worker, then return the result simply to the socket, thus avoiding having to write a more complex websocket-based or polling-based system, but rather using AJAX transparently :)
Doing non-blocking is tricky, and I'm not convinced that Python's solution are where I'd like them to be on this topic. Also keep in mind that a number of open TCP connections is also a finite number, so you can't really scale well with websockets that way, IMO. But again, it depends on your use case.
For what it's worth, I'm working on exactly this problem with Django+Celery.
Polling seems to be the best way to do it, as it doesn't leave sockets open, and doesn't require a websocket enabled browser.
The implementation I'm working on involves keeping the task metadata in the DB, and polling against that lookup (it makes it easier to do things like restrict task results to specific users as well).
I was also thinking that another way to do it could be to write the result in its final format to a /ajax_output/ directory with a randomly generated name. Then your polling would depend entirely on nginx, which could end up being much more efficient than running through your application framework. Just make sure you regularly clean unused files if you have privacy concerns.
I really like tornado and websockets but keep in mind it gets dicey to scale on one box after you get to about 50 open connections at the same time on one box. You can do things to stretch that out but it's not the easiest thing. You also still have browser requirement issues. So it really depends on your use case polling, which is my least favorite method, is the most versatile method. It's easy to use flask for all of these issues. That said I'm a big fan of Tornado.
What's the use case? Do you need to know exactly when the task is done? Does it vary in duration significantly? Can you split the call into two - one to start it, another to check the status given an ID?
It could be the user sending a computation to the server and wanting its interface to be updated as soon as the computation is done, it is feasible by regularly polling the backend after launching a worker process, but this adds complexity compared to simply opening a non-blocking socket a la nodejs & waiting for the worker to finish its job & sending the result back to the browser
I used Redis for celery in production with great success for a year but then we started running some long running jobs that needed the ACKS_LATE setting and the Redis delivery timeout kept hurting us by resending the task to another worker. It's configurable but in the end we just switched to RabbitMQ. I found it quite painless to setup and migrate to.
Redis is still not an AMQP, but yes Redis's Pub/Sub works quite nicely. Out of all the brokers celery supports I'd recommend only RabbitMQ and Redis to people.
With container solutions like Docker and prebuilt images the setup part is kinda eliminated. Although I don't remember having any special configuration issues with RabbitMQ as well, it just works TM. That's always nice right? :)
Redis works great as a results backend, but I'd still use RabbitMQ for the queue. RabbitMQ is designed to be a message queue, and it does a great job at it.
As one the authors of taskflow I'd like to give a little shout-out for its usage (since it can do similar things as celery, hopefully more elegantly and easily).
I've heard so much about Celery but still have no clue when it would be used. Could someone give some specific examples of when you have used it? I don't really even know what a distributed task is.
A background task is just something that's computed outside of the standard http request/response process. So it's asynchronous in the sense that the result will be computed sometime in the future, but you don't care when.
Distributed just means that you can have your task processing spread out across multiple machines.
A specific example would be, let's say, after your user registers on your website for the first time you wan't to get a list of all his facebook/twitter friends. This action will take a long time and is not vital to the whole registration/login process so you set a task to do that later, and let the user proceed to the site and not make him look at the spinner the whole time, and when the friend list becomes available it will show up on the website (on his profile or whatever). Makes sense?
I'd also add:
Be wary of context dependent actions (e.g. render_template, user.set_password, sign_url, base_url) as you aren't in the application/request context inside of a celery task.
This line `'my_taskA': {'queue': 'for_task_A', 'routing_key': 'for_task_B'},` "for_task_B" should be "for_task_A" to match the CELERY_QUEUES definition. Unless I'm misunderstanding what you're doing, of course.
AMPQ = Advanced Message Queuing Protocol so it's wrong to say that a message broker is "an AMQP". Also, give Redis a try - it's much easier to set up and uses fewer resources.
We should probably talk about the elephant in the room when addressing newbies: the Celery daemon needs to be restarted each time new tasks are added or existing ones are modified. I got past that with the ugly hack of having only one generic task[1] but people new to Celery need to know what they're getting into.
Noted, the wording is a bit contrived i give you that. I like Redis as well, it was mentioned in the comments here a few times. Good thinking about pointing out reloading btw. Tnx.