Hi, thanks for reading. I need to stress that this is just an example project I came up with after I decided to play with Elixir and Python integration. The project itself doesn't even make much sense - as you and others point out, there are many other, more pythonic ways of handling this task. I chose web scraping because it was easy to split into two subtasks: one IO and one CPU bound.
> The slides mention problems with Celery and RAM usage when writing crawlers, but since this is a mostly I/O-bound task you should be using the eventlet/gevent execution pools instead of the multiprocessing one.
Fetching pages is an IO-bound task, so it's done by Elixir. There we have a pool (for rate limiting) of 10 processes (the Erlang ones - important distinction) that do the downloading.
I think the closest analogy to what happens in this project is a Twisted (EDIT: or any other concurrent, but not parallel framework) app which uses a pool of processes for CPU-bound tasks. Here the Twisted part is replaced with Elixir.
EDIT: Also, we use Celery at work extensively and it works great and there is no real need to replace it with anything! Again, this project is just a tech demo, it doesn't make (much) sense on it's own. But there are other possible integration patterns where Elixir and Python have different roles, which actually do make sense. I think.
I found your project very interesting! I'm very familiar with Erlang/OTP, and have been meaning to play with Elixir for some time. Since you're separating downloading from processing, maybe IPC is a better term for what you're doing, as you don't want these two steps to be separated over the network for data locality.
I only wanted to clarify the RAM situation described in your slides, as it's not widely known that you can use eventlet/gevent with Celery.
> The slides mention problems with Celery and RAM usage when writing crawlers, but since this is a mostly I/O-bound task you should be using the eventlet/gevent execution pools instead of the multiprocessing one.
Here: https://klibert.pl/statics/python-and-elixir/#/5/6 in line 18 you can see a `_do_some_real_processing_function`. The whole premise of the project is that this function is CPU-bound. My processor has four cores, so I create a pool of 4 Python processes (https://klibert.pl/statics/python-and-elixir/#/5/2 line 10).
Fetching pages is an IO-bound task, so it's done by Elixir. There we have a pool (for rate limiting) of 10 processes (the Erlang ones - important distinction) that do the downloading.
I think the closest analogy to what happens in this project is a Twisted (EDIT: or any other concurrent, but not parallel framework) app which uses a pool of processes for CPU-bound tasks. Here the Twisted part is replaced with Elixir.
EDIT: Also, we use Celery at work extensively and it works great and there is no real need to replace it with anything! Again, this project is just a tech demo, it doesn't make (much) sense on it's own. But there are other possible integration patterns where Elixir and Python have different roles, which actually do make sense. I think.