Author of Celery here. This is an interesting presentation!
I want to clarify something about Celery and RAM usage.
When writing web crawlers, and other (mostly) I/O-bound tasks, you should be using the eventlet/gevent execution pools instead of the multiprocessing one. This will drastically reduce memory use, and perform better.
If you have four CPU cores you can start four worker instances with 1000 threads each (for a total of 4k threads):
`celery multi start 4 -A proj -P gevent -c 1000`
This will utilize all the CPU/cores in your system, working around the GIL.
One of the new features coming in Celery 4 is a message protocol with support for multiple languages, maybe we could have an Elixir worker soon.
Hi, thanks for reading. I need to stress that this is just an example project I came up with after I decided to play with Elixir and Python integration. The project itself doesn't even make much sense - as you and others point out, there are many other, more pythonic ways of handling this task. I chose web scraping because it was easy to split into two subtasks: one IO and one CPU bound.
> The slides mention problems with Celery and RAM usage when writing crawlers, but since this is a mostly I/O-bound task you should be using the eventlet/gevent execution pools instead of the multiprocessing one.
Fetching pages is an IO-bound task, so it's done by Elixir. There we have a pool (for rate limiting) of 10 processes (the Erlang ones - important distinction) that do the downloading.
I think the closest analogy to what happens in this project is a Twisted (EDIT: or any other concurrent, but not parallel framework) app which uses a pool of processes for CPU-bound tasks. Here the Twisted part is replaced with Elixir.
EDIT: Also, we use Celery at work extensively and it works great and there is no real need to replace it with anything! Again, this project is just a tech demo, it doesn't make (much) sense on it's own. But there are other possible integration patterns where Elixir and Python have different roles, which actually do make sense. I think.
I found your project very interesting! I'm very familiar with Erlang/OTP, and have been meaning to play with Elixir for some time. Since you're separating downloading from processing, maybe IPC is a better term for what you're doing, as you don't want these two steps to be separated over the network for data locality.
I only wanted to clarify the RAM situation described in your slides, as it's not widely known that you can use eventlet/gevent with Celery.
I want to clarify something about Celery and RAM usage.
When writing web crawlers, and other (mostly) I/O-bound tasks, you should be using the eventlet/gevent execution pools instead of the multiprocessing one. This will drastically reduce memory use, and perform better.
If you have four CPU cores you can start four worker instances with 1000 threads each (for a total of 4k threads): `celery multi start 4 -A proj -P gevent -c 1000`
This will utilize all the CPU/cores in your system, working around the GIL.
One of the new features coming in Celery 4 is a message protocol with support for multiple languages, maybe we could have an Elixir worker soon.