Hacker News new | past | comments | ask | show | jobs | submit login
Replacing Celery with Elixir in a Python Project (klibert.pl)
163 points by klibertp on June 12, 2016 | hide | past | favorite | 44 comments



Author of Celery here. This is an interesting presentation!

I want to clarify something about Celery and RAM usage.

When writing web crawlers, and other (mostly) I/O-bound tasks, you should be using the eventlet/gevent execution pools instead of the multiprocessing one. This will drastically reduce memory use, and perform better.

If you have four CPU cores you can start four worker instances with 1000 threads each (for a total of 4k threads): `celery multi start 4 -A proj -P gevent -c 1000`

This will utilize all the CPU/cores in your system, working around the GIL.

One of the new features coming in Celery 4 is a message protocol with support for multiple languages, maybe we could have an Elixir worker soon.


Hi, thanks for reading. I need to stress that this is just an example project I came up with after I decided to play with Elixir and Python integration. The project itself doesn't even make much sense - as you and others point out, there are many other, more pythonic ways of handling this task. I chose web scraping because it was easy to split into two subtasks: one IO and one CPU bound.

> The slides mention problems with Celery and RAM usage when writing crawlers, but since this is a mostly I/O-bound task you should be using the eventlet/gevent execution pools instead of the multiprocessing one.

Here: https://klibert.pl/statics/python-and-elixir/#/5/6 in line 18 you can see a `_do_some_real_processing_function`. The whole premise of the project is that this function is CPU-bound. My processor has four cores, so I create a pool of 4 Python processes (https://klibert.pl/statics/python-and-elixir/#/5/2 line 10).

Fetching pages is an IO-bound task, so it's done by Elixir. There we have a pool (for rate limiting) of 10 processes (the Erlang ones - important distinction) that do the downloading.

I think the closest analogy to what happens in this project is a Twisted (EDIT: or any other concurrent, but not parallel framework) app which uses a pool of processes for CPU-bound tasks. Here the Twisted part is replaced with Elixir.

EDIT: Also, we use Celery at work extensively and it works great and there is no real need to replace it with anything! Again, this project is just a tech demo, it doesn't make (much) sense on it's own. But there are other possible integration patterns where Elixir and Python have different roles, which actually do make sense. I think.


I found your project very interesting! I'm very familiar with Erlang/OTP, and have been meaning to play with Elixir for some time. Since you're separating downloading from processing, maybe IPC is a better term for what you're doing, as you don't want these two steps to be separated over the network for data locality.

I only wanted to clarify the RAM situation described in your slides, as it's not widely known that you can use eventlet/gevent with Celery.


Am I the only one not liking the slides having both vertical and horizonzal navigation? Going through the whole presentation requires going in the right direction (down unless last slide in section, in which case it is right...)


You can use Space, which should take you to the correct next slide, without thinking about directions. I found it a neat format to structure a talk into segments. Also, you can press Esc and get a "map" of the slides.


Nope. I used space. After the Erlang slide it moves to the right "tools used", which is the slide on the same level in the next column.

This along with back button hijacking makes for the single worst presentation format I've encountered so far.

EDIT: it appears that only happens when you have been in that column before. This is still pretty surprising behavior.


Sorry about that. Maybe some of this behavior is configurable in Reveal.js, I'd need to check.

It was fine for me because these are the slides for a talk and I was the one showing them to people.

I'd be happy to post the video of the talk (https://www.facebook.com/events/211449562541292/), but after three months I still haven't heard anything from the people behind the event...


No reason to apologize. The content was definitely still viewable.


I was more annoyed that each slide pushed another entry into the browser navigation history.


I also hate slide decks that do that. I literally went through this one left to right and thought "wow, that was a really high level overview". Only then I realized I had to hit down some of the time.

(And the "you idiot, you weren't using the correct navigation keys" is not a particularly satisfying answer, I navigated with left and right arrows the way I have always navigated slides for years.)


That strikes me as kinda neat, actually -- is that not the purpose of putting the slides in this order? It's a shortcut to just the topic slides, so you can skim through them even faster and only dig in once you want to.


the right/down/up/left is for the presentator to easily skip to preset positions.

as a viewer you can just press space for the traditional progression.


It is worthwhile to mention that reveal.js presentation supports PageUp/Down as the keys to go further or backwards.


It's a useful feature during the Q&A portion of a presentation, since you might need to move backward and forward through the slides quickly to pull a relevant reference, diagram, or code example. I'd agree that the primary content should be linear.


Interesting article!

I would just like to point out that concurrent downloads can be handled much more efficiently in Python > 3.4 thanks to the asyncio library. For an example, look at Guido van Rossum's crawler [0].

[0]: https://github.com/aosabook/500lines/tree/master/crawler


Asyncio doesn't support http yet but that could be fixed with a library like aiohttp


The linked crawler uses aiohttp.


The multiprocessing slide mentions the RAM usage issue with things like Celery (because you start many instances of Python and load in dependencies). Does this solve them?

If so, how does it get around the whole GIL thing and whatnot? Or maybe I'm misunderstanding at what level things are happening?

Is it that you still have one python process but the bottleneck/URL fetching is happening inside your elixir stuff?

Super interested in this, we have this problem with Celery workers and would love to not be bound by RAM for worker count


Unfortunately, no, this doesn't solve the problem of RAM usage by worker Python processes: they are still separate OS processes, although they look like "normal"[1] Erlang/Elixir processes from that side.

By "replacing" I mean pretty literally getting the same effect as with Celery. [EDIT: BTW, in the recent post about Celery it was praised because it lets you compose background/async tasks. Of course, Erlang has something like that, too: https://chrisb.host.cs.st-andrews.ac.uk/skel-test-master/tut...] There really is no other reliable way around the GIL in Python other than multiprocessing.

The main idea here is that we should keep IO-bound code on the Elixir side as it's simply more efficient and easier to write there, but to delegate CPU-bound code to a pool of external processes. But this is only one of many possible patterns of integration: I can imagine a Django project where most of the logic is on the Python side and Elixir only handles WebSockets/long polling connections. Thanks to ErlPort passing data and calling functions between Elixir and Python is effortless, you can call an Elixir function just as easily as you can make Elixir call a Python function in a different process.

Moreover, ErlPort supports Ruby, so you can have workers written in it, too. There's no problem with workers in other languages, too, as long as you write the glue code yourself (or generate using Elixir macros).

[1] It bears repeating: Erlang processes are not OS-level processes. Erlang Virtual Machine, BEAM, runs in a single OS-level process. Erlang processes are closer to green-threads or Tasklets as known in Stackless Python. They are extremely lightweight, implicitly scheduled user-space tasks, which share no memory. Erlang schedules its processes on a pool of OS-level threads for optimal utilization of CPU cores, but this is an implementation detail. What's important is that Erlang processes are providing isolation in terms of memory used and error handling, just like OS-level processes. Conceptually both kinds of processes are very similar, but their implementations are nothing alike.


If you're having RAM problems, you don't want to start up each processor independently as a "port". You'd want to write a Python server that offers a socket, start one instance of that that listens to the socket and then forks (since you control the system fully, probably let it fork once per connection rather than worrying about "preforking" or other complicated scenarios for when you don't control the concurrency), and then write the glue code to talk across that socket. [1]

That way, your Python processes load all their modules and do all their initialization, and the forking automatically takes care of Copy-On-Write semantics in the RAM and you end up with only more-or-less one copy of them in RAM.

[1]: I don't know if there's a "perfect" off-the-shelf solution for you, but there's enough existing code in the world with all the pieces that it's pretty easy to do that nowadays. For instance, one very easy protocol that leaps to mind is:

    4 bytes to say how much JSON there is
    a JSON object containing the metadata for your connection
    4 or 8 bytes to say how long the incoming webpage is
    the contents of the webpage you're telling Python to process
It's not perfectly efficient, but it has great bang-for-the-buck, and will carry you a long way before you start needing to do anything else fancier.

You might also be able to bash ErlPort into compliance and get the client and server side going, but, well, this isn't very difficult code to write and it doesn't take much "screwing around with opaque library code that doesn't really want to be used separately" before you could have just written this.


seconded. also celery turned out quite inflexible when you need multiple different workers, there seem to be some bugs with strange workarounds.


For something as straightforward as downloading files in parallel, why not just use Python threads? Since most of the IO will occur in C without holding the GIL, it seems silly to be forking off entire new processes for this sole purpose.

The processing of the file in Python would hold the GIL, but this could still be resolved with a multiprocessing pool of workers.

In other words, while academically interesting, mixing Elixir and Python for web crawling doesn't make much actual sense.


> In other words, while academically interesting, mixing Elixir and Python for web crawling doesn't make much actual sense.

That's right. It wasn't supposed to be a real project, it's a just a tech demo of sorts. I will have some more real-world uses for the integration in my Raspberry project, I think, but I'm not there yet.


I personally like to bridge Elixir and Python via nanomsg with MessagePack serialization.

Here are some useful libraries:

http://nanomsg.org/

https://hex.pm/packages/exns

https://github.com/walkr/nanoservice


Can we avoid the overhead of starting and shutting down processes by running a single Python process and communicating using something like grpc [0] (or even JSON-RPC for maximum simplicity)?

How do web frameworks like Flask handle multiple concurrent requests? Would performance increase if we started multiple instances of this Python web server on the same machine and load balanced them? The code would be much simpler if there was no need to handle process management.

[0]: http://www.grpc.io/


> Can we avoid the overhead of starting and shutting down processes by running a single Python process and communicating using something like grpc

Actually, that's how ErlPort works. You start a Python process and then you call some functions inside that process. How many functions you'd like to run using a single Python process is up to you - you can spawn a new worker for every call or spawn only one worker and stick to it unless it's killed somehow.

You can also register callbacks on either side (Elixir or Python) and with a bit more effort you can make Python process accept normal Elixir messages and answer in the same way.

> How do web frameworks like Flask handle multiple concurrent requests?

In my experience, they mostly rely on uWSGI and a pool of processes...


The whole point here is we need parallel processing, and Python cannot provide this in a single process due to the GIL. Flask applications depend on the workload being I/O bound, so it can achieve concurrency where parallelism is not required. If you built a process that is CPU bound in Python code in Flask, you'd find it could not achieve much concurrency at all.


There are plenty of multiprocessing wsgi servers, flask builtin included


I think I misunderstood your presentation at first glance. Elixir "processes" are actually green threads. Thus, you actually have Python interpreters in separate OS processes, right?


While BEAM is indeed great I'm wondering why you didn't use Scrapy? It handles concurrency well and is a battle tested production scraper.


The same question was asked when I gave the talk. I'm afraid the only answer is: because I could :-) I thought up a project where it would make some sense to use Elixir and went with it. It's not practical or "real world" project at all!

The point is that the integration Python<->Elixir can be so tight and that there's little overhead when using each language for things it's good at.


> because I could

Love that!

I will look it over, and because I favor LFE (Lisp Flavored Erlang) over Elixir, try to do something neat like this. Although, Elixir is growing on me from my initial take on it a year ago.


Thanks for sharing.

First my complaint: the slides are really annoying :) A traditional left-right stack would've been nicer.

That said, this is something I've been looking at lately as I've got a bunch of python code that I want to parallelize, and a strong interest in Elixir. I found your code samples very helpful.

Edit: I see the other comments mention the slides, and how to navigate them w/ space. Disregard my complaint.


OP mentioned that Elixir is better at concurrency and parallelism. Python is better at processing (more libraries/toolsets available).

As someone who wants to use Elixir more and see its community flourish, what libraries does Elixir need so this project could be done without Python?

I'm guessing some sort of Beautiful Soup or Nokogiri equivalent?


No, actually, there is nothing lacking in Elixir ecosystem if you want to write a web scraper! That's the point. I used HTTPoison for sane (and efficient) HTTP requests and I could have used (and used in some other projects) Floki (https://github.com/philss/floki) for HTML parsing and querying.

However, there are things like generating PDFs with graphs based on the tabular data on the pages or running some more involved Pandas TimeSeries transformations which are simply not available in Elixir. Nor they should be, I think: reportlab or Pandas are already written and do a good job at what they are meant to do. This is the idea: we write a crawler in Elixir and delegate processing to something else. Anything else, in fact: Python was chosen because of ErlPort and how easy it is to integrate, but your workers can in practice be written in anything that understands JSON.


Interesting use case but is it actually nice to mix up the technologies on your stack just for this? Also why re-invent a solution for something that already works? Just my 2 concerns :)


Fantastic. I recently used Elixir for scraping courses off Udemy. I've put the results of this into a site, http://www.coursenut.com

Also I've added an Elixir course promotion to:

http://www.coursenut.com/courses/3692


What is the purpose of the scrape?


Does this solve the state-sharing difficulties of the python solution?


This is a cool way to use both tools together.


I have been looking at exactly this type of solution, as I want to use Elixir for distribution of python(numpy)-based computations to clients. Just a quick question.... is ErlPort maintained? It looks like it's been in alpha for 3 years...


There's little development, it seems, but it's not dead. There are discussions below issues opened, PRs are accepted, occasional commit sometimes lands in master.

It was stable when I used it. But it's also rather simple and short (less than ~1k loc on Python side, for example), so I think it wouldn't be a heavy burden to maintain, even if the current maintainers and users all dropped dead tomorrow :-)


enough for me. I'm going to use this because the combo of Python (more accuratey, numpy and pandas), plus Elixir, is pure goodness.


Sorry if this is a stupid question, but why would you want to distribute computation to the client?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: