Hacker News new | past | comments | ask | show | jobs | submit login

Or use a language where fully utilizing all CPU cores is transparent, like Elixir? There's zero complexity, you basically add 4-5 lines of code and that's it. Honestly, not exaggerating.

I've done several very amateur scrapers in the last several years, I am never going back to languages with a global interpreter lock, ever.




I'm assuming you're talking about Python, which is also "4-5 lines" to use multithreading or multiprocessing. Can you explain what's wrong with GIL languages?

Now that I think about it, it's even less than 4 lines:

from multiprocess.pool import Pool (or ThreadPool)

pool = Pool()

pool.map(scrape, urls)


When the pooled functions are I/O bound then the GIL is not a problem. Any GIL language will do.

However, for example when generating reports, try use the same instrument for serializing 4 pages of DB records to 4 pieces of a big CSV file, each working on a single CPU core. There the languages without GIL truly shine. And languages like Python and Ruby struggle unless their GIL implementations compromise and yield without waiting for an I/O operation to complete.


I'm not sure you understand how the GIL works in Python. If you're using multiprocessing, there's no locking across the code executing on each core. Also, if you're writing to the same file from four processes, you're going to need locking.


What I have last known is that GIL languages work well in multicore scenarios as long as all N tasks have I/O calls that serve as yielding points for the interpreter, and they do not use preemptive scheduling like the BEAM VM (Erlang, Elixir, LFE, Alpaca) do.

Am I mistaken?


As far as Python goes, yes. Multicore implies multiple processes, which means that each process will have it's own Python interpreter, each with it's own GIL.

If you were to use multithreading instead, you would generally have a problem if you were doing non-I/O work.


Then I think we have a misunderstanding of terms. To me "multicore" == "single process, many threads". Apologies for the confusion.

It seems that now we are both on the same page. Single process & many threads are problematic for GIL languages and that's why I gave up using Ruby for scrapers. GIL languages can work very well for the URL downloading part though.


Any further information on this? Last I looked (which was a while ago), the infrastructure like HTML parsers seemed surprisingly tricky in Elixir.


The only complication is if you want to use Meeseks (https://github.com/mischov/meeseeks) which requires the Rust compiler and runtime be installed because it has native bindings. Meeseks is useful because it's a bit faster than the default Floki (https://github.com/philss/floki) and because it can handle very malformed HTML.

As for Elixir itself, here's a quick example:

```

# Assume this contains 1000 URLs

urls = [....]

# This will utilize 100 threads; if the second parameter is omitted, it will use threads equal to CPU cores. For I/O bound tasks however it's pretty safe to use much more.

results = Task.async_stream(&YourScrapingModule.your_scraping_function/1, max_concurrency: 100)

```

It's honestly that simple in Elixir. For finer grained control the line count is little bigger -- but little. Not hundreds of lines for sure.


Meeseeks's speed difference with Floki is not that significant, and my initial findings are they've leveled out even more with OTP 21, sometimes even swinging in favor of Floki.

The better handling of malformed HTML by default is the much bigger deal.


Thank you man (I know you are the author of Meeseks), I didn't know that. Always knew that the current info was the Meeseks was faster than Floki but it seems that OTP 21 largely eliminated that as you said.

Valuable info, thanks!


It was pretty interesting to see Floki get a lot faster and Meeseeks actually get a little slower with OTP 21. I'll enjoy figuring out why. I hope to get a chance to work on the OTP 21 performance of Meeseeks before too long.

On the plus side there were some nice memory improvements for Meeseeks in OTP 21.


(off-topic alert)

Don't let this sound patronizing because it's not -- but have you looked at how many times is the boundary between the BEAM and the Rust code crossed? I haven't inspected Meeseks' code so can't talk, just wildly guessing.

My ancient experience with Java <-> C++ bridges has taught me that if your higher-level language calls the lower-level language very often then the gains of using the lower-level language almost disappear due to the high overhead of constantly serializing data back and forth.

Anyhow, we should probably take this discussion to ElixirForum and not here. :)

(I am @dimitarvp there and almost everywhere else on the net, HN is one of the very few exceptions of inconsistent username for me).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: