Lessons learned scraping 100B product pages

apeace · on July 10, 2018

> As most companies need to extract product data on a daily basis, waiting a couple days for your engineering team to fix any broken spiders isn’t an option. When these situations arise, Scrapinghub uses a machine learning based data extraction tool that we’ve developed as a fallback until the spider has been repaired.

I once worked on a spider that crawled article content and I ran into the same problem. I always wanted to try the following solution to it but never had the chance.

Assume you have a database of URLs and the fields you've scraped from them in the past (title, author, date, etc). If you ever fail to scrape one of those values from a new URL, here's what you do:

- Go back to one of the old URLs where you already have the correct value (let's say it's the title).

- Walk through the whole DOM until you find that known title. At each node you will have to remove child nodes except for text, to deal with titles like "Foo <span>Bar</span>" which you want to match against "Foo Bar". So this is going to be an expensive search.

- Generate several possible selectors which match the node you walked to (maybe you have ".title", ".title h2", ".content .top h2", etc).

- Test each new selector on several other already-crawled pages. If any of the selectors work 100% of the time, there's your new selector.

Any thoughts on whether something like this would work?

AznHisoka · on July 10, 2018

That sounds like a very creative idea.

What I do is call a regression test every x minutes. If it fails, set a flag to save/store the html everytime we crawl pages. Now we can go back and process these saved pages when we fix our crawler

w0rd-driven · on July 10, 2018

I crawl a specific site somewhere up to 50 unique URLs a day. I store both the unparsed full html as a file and the json I'm looking for as another separate file. The idea is if something breaks instead of taking a hit to make the call again, I have the data and I should just process that. It's come in extremely handy when a site redesign changed the DOM and broke the parser.

I do the same at $dayJob where I'm parsing results of an internal API. Instead of making a call later that may not have the same data, I store the json and just process that. I feel like treating network requests as an expensive operation, even though they're not really, helped me come up with some clever ideas I've never had before. It's a premature optimization considering I've had like 0.000001% of failure but being able to replay that one breakage made debugging an esoteric problem waaaaaay simpler than it would've been otherwise.

pdimitar · on July 11, 2018

Off-topic: I so wish I worked for a company where my work involves scraping and storing and analyzing data. :(

hoju · on July 11, 2018

Now is a good time to work in this field since data science is hot and companies need web scrapers to provide the data for these models. Atleast that has been my experience in finance. Try applying!

pdimitar · on July 12, 2018

I have zero experience in data science though. I am a pretty solid and experienced programmer and can learn it all but... don't know. Maybe I should just try indeed.

Do you have any recommendations for places and/or interview practices?

nkozyra · on July 10, 2018

Without going too detailed into this ...

Creating a domain-specific model can be done a million ways, obviously, but the nice thing about HTML is the markup gives hints all along the way. The tree, the class names, the tags, etc. Coupled with the content of each tag, it's absolutely possible to determine collections of items with metadata using ML.

But like most ML problems, the underlying data that feeds the model is the time-consuming part.

If you are working in a single or few domains, I 100% recommend this approach. If you're scraping something far more generalized, first you need to have models that you care about and then you need to create models to determine the content type of what you're scraper is looking at.

1) What kind of content do I have? 2) Does this match a known domain with a model? 3) Apply appropriate model to domain, hopefully extract correct data

Another huge issue is, of course, validation, because you're going to be dealing with an inordinate amount of unknown and unpredictable data depending on what you're looking at.

Alex3917 · on July 10, 2018

> Any thoughts on whether something like this would work?

If it really works 100% of the time then probably. A lot sites though use multiple markup styles for seemingly no reason though. E.g. if you created an account before a certain date then your profile keeps the old HTML, even though the old pages look identical to the new pages.

jd20 · on July 10, 2018

I encountered this on one website, that recently rewrote their website to be mobile-responsive. Depending on the width of the browser window, the right selectors would be totally different, talk about confusing!

tomjen3 · on July 10, 2018

Lets say you are scraping for "Lawn Chair" and I change it to specify "A chair for your lawn" in an attempt to improve SEO. How would your system cope with that?

Suppose alternatively that I have previously shown the status as "available" or "out of stock" and now I change it for some products to "no longer available". Can your system handle those edge cases?

blacksmith_tb · on July 10, 2018

Some of that will depend on how much you structure the page for bots (not just scrapers) - if the availability info is exposed via json-ld, I would expect everyone to consume changes just fine. Note that changing from "lawn chair" (phrase humans actually use and search for) to "chair for your lawn" (phrase which is unlikely to ever be used by anyone) is not an SEO boost...

arciini · on July 10, 2018

At https://WrapAPI.com, we have a tool that lets you get a selector by clicking on a specific element and an interface that lets you test that selector on different sample pages.

We'll make a note of this! It does seem like a cool idea to put these 2 features together and automate the updating of API endpoints

moepstar · on July 10, 2018

I've had the same idea, it kind of sounds like the obvious thing to do...

That said, i haven't found a need to do that yet to verify the idea itself.

Maybe that is similar to what they do using their "ML approach" mentioned?

RossM · on July 10, 2018

Their ML approach is most likely based on their https://github.com/scrapy/scrapely project, which uses instance-based learning (as I understand it, lightweight ML) to scrape other pages from a few examples.

hoju · on July 11, 2018

This approach is known as "wrapper induction"

ChuckMcM · on July 10, 2018

Ah yes, Challenge 4 (anti-bot measures).

At Blekko I developed a number of ways to deal with people that tried to scrape the web site for web results. The three most effective ways are blackholing (your web site vanishes as far as these folks are concerned), hang holding (basically using a crafted TCP/IP stack that does the syn/ack sequence but then never sends data so the client hangs forever), and data poisoning (returning a web page that has the same format as the one they are requesting but filling it with incorrect data).

We had a couple of funny triggers of the anti-bot stuff during the run, once when a presenter on stage showed an example query and enough people in the audience typed it into their phones/tablets/laptops and it all came from the same router that it looked like a bot. The other where the entire country was behind a single router and a school had all all of their students making the same sort of query at the same time (in both cases the trigger was a rapid query rate for the exact same search query from an address).

In Blekko's case since bots either never clicked on ads or always clicked on the same ad (in both cases we got no revenue) keeping bot traffic off the site was measurable in terms of income.

baccheion · on July 10, 2018

The best buffer against scrapers/spammers seems to be lag. That is, progressively slow the rate at which data is returned.

Many bypass protections by limiting request rate and using a pool of lesser known proxies/IPs.

jd20 · on July 10, 2018

I thought the number one anti-bot measure was a cease and desist letter :) seriously though, some of these websites clearly don't want to be scraped, what's stopping them from sending scrapehub a C&D letter and forcing them to comply?

ChuckMcM · on July 10, 2018

Sure, if you can make a reasonable assumption it is them scraping you. As they point out in the article they invest in proxy networks to make their requests appear to come from a bunch of addresses that don't lead back to them.

One of the things we learned at Blekko was that people that run botnets often sell 'proxy service' as a thing, we identified several made out users of the Time Warner "road runner" service. That put us as a web site in a bind in that the proxy service that was running on an infected computer was violating our terms of service but the user might be completely unaware. If they were also a customer and we black holed their IP it would also cut off legitimate traffic. Since we didn't keep a logs that could identify these relations over time (privacy issues) we had to rely on other methods. We never got enough penetration into the search market to make this a huge concern however so the problem remained largely theoretical. We started a program of doing exponential banning where an IP would be banned and then an hour later unbanned, and if it resumed its bad behavior banned for 2 hours then 4 Etc. Once you get to 1024 hours it is pretty safe to assume they are lawfully evil as it were.

These guys fake their user agent, mask their IP addresses, and generally work hard to defeat anti-bot measures. They know they are over the line, but the law has yet to catch up to them.

jd20 · on July 10, 2018

My experience, when it comes to scraping airline websites, the airline's legal department usually doesn't wait to have proof that you were the one that actually scraped them. If you have their data on your website, they send you a C&D, and if they continue to find their data on your website, they will happily sue you. In other words, doesn't matter how you got the data, you must've broken the law if you got their data.

I'm thinking of RyanAir suing Expedia, United vs wandr.me, Southwest suing SWMonkey.com, I'm sure there's countless others.

bryanrasmussen · on July 10, 2018

But if you have Amazon turk workers 'scraping' the data is that illegal?

nl · on July 10, 2018

It’s the reproduction of the data which they are suing over, not the method.

blattimwind · on July 10, 2018

> Multi-threading is a must, when scraping at scale. The more concurrent requests your spiders can make the better your performance - simple.

Intuitively I would think that this sort of problem would profit from using asynchronous ingestion at the edge pushing unprocessed contents to a multi-threaded/multi-process backend. (Because I'd expect that network latencies mean you need lots of threads to saturate I/O, which I'd expect would conflict with effectively using the available CPU power to do the actual document processing).

mgliwka · on July 10, 2018

That's been exactly my experience. Most time is spent connecting or waiting for the server response (TTFB). Using an async I/O event loop approach in combination with EPOLL/KQUEUE you can handle thousands of concurrent connections. You then push the response to your worker nodes, which process the data in a multi-threaded fashion. Stream Processing Frameworks like Apache Spark or Storm work great for that.

skate22 · on July 10, 2018

I would expect that you would use both async requests & multithreading. I might restrict each thread to operate on an exclusive set of domains to throttle the request rate to any given domain at a point in time

afandian · on July 10, 2018

Can I piggy-back off this submission to ask HN if you're running a scraper, have the recent wave of GDPR splash-screens caused you issues? How are you dealing with them? https://news.ycombinator.com/item?id=17471599

moltar · on July 10, 2018

You can just remove them from DOM or hide with CSS.

blattimwind · on July 10, 2018

Except when the backend redirects you to consent.someco.example unless you're logged in or send a consent cookie along.

afandian · on July 10, 2018

What about e.g. http://discourseontheotter.tumblr.com/ ?

Edit: In the UK I see this: https://imgur.com/a/zlWOByh

tomjen3 · on July 10, 2018

In that case I would just accept everything and forward the cookie.

Of course if I was actually trying to read the link I would have to give up, because there appears to be no way to navigate through and opt out.

jklein11 · on July 10, 2018

I don't see a GDPR challenge on this page?

afandian · on July 10, 2018

That's probably because it's being inconsistently applied and you're not in Europe. If that's the case it's all the more insidious! Or you accepted the Tumblr terms in the past.

This is what I get in the UK: https://imgur.com/a/zlWOByh

YouKnowBetter · on July 11, 2018

:s/Europe/EU/

I'm in Switzerland (Europe), don't get the https://imgur.com/a/zlWOByh

ainiriand · on July 10, 2018

Just by chance we experienced a scraper bot on the site past week and we discovered some performance problems thanks to it. It literally fried our ancient caching system and we finally took the step towards using cdn for static delivery and redis for api responses. I wonder if there were those guys because it was some solid scraping.

jacquesm · on July 10, 2018

Badly behaved scrapers should be blocked, not accommodated.

greenyouse · on July 10, 2018

I'd agree. As somebody scraping content, what's so bad about increasing the timeout to like 10 seconds? That way the servers can handle the traffic easily and you're not being a jerk. If you have one async thread for each domain you can still get lots of data quickly. Causing a denial of service attack is very avoidable.

jacquesm · on July 10, 2018

And you're going to be hitting millions of hosts anyway, so all you have to do is rotate from one host to the next and randomize your worker queues. It might take a little longer but you will not blow up someone's aging server. Being a good citizen of the net means to take into account that even if you have gigabits of bandwidth to burn the counterparty may not (and could easily be on the sharp end of a bandwidth capped contract).

Theodores · on July 10, 2018

But, as the article says, they have proxies to get around that.

The thing is that most ecommerce websites of note generate shopping feeds in easily machine readable formats (JSON, XML) for Google Shopping, Facebook and the like. These feeds also go down to SKU level. The URL might not be advertised but it won't be blocked or protected with a user/password API key.

If buying a T Shirt, the product page might list all sizes and all colours only showing a master 'variant' SKU (that is not a real SKU) and the backend might then add the actual size/colour specific SKU to the basket, of which there could be twenty on the 'variant' product page.

Meanwhile the product feed will list every SKU variant, complete with latest pricing and other pertinent information, e.g. barcode, product image etc.

I am sure that most retailers would prefer to just point the scraping party to the feed rather than have them grind the site to a crawl with multi-threaded crawlers hiding behind proxies. So that is how these scrapers can be 'accommodated'.

The sitemaps that go with the ecommerce game are also pretty reliable, these are high up the SEO checklist and will say when products were last updated.

Then there are rich snippets - or whatever they are called now. The trend in these is to have some JSON-LD attached to the page in some format GoogleBot likes. Not hard to ingest.

Sites that don't have their act together for Google Shopping and SEO really are not worth scraping, they will never make it to the Google top 100 search results unless they are selling something that nobody else sells, e.g. 'Tibetan Monkey Stones' where you probably don't need to compete.

To me it sounds like these scraping concerns just need to pay a bit extra for ecommerce developers to show them how the 'puzzle was made' and to stop abusing people's business websites that are not built to be scraped on a daily basis by some random third party on the other side of the globe.

Also the plain old telephone helps. If your brand owning ecommerce team get a call from an interested party saying that they would really like to get a list of their products for their comparison/whatever site then they just might say yes, here is the URL for the feed, oh and here is the one for locale_en_xy. But people would prefer to hack away at some hacky spider rather than just pick up the phone and ask.

pdimitar · on July 10, 2018

I would definitely pick up the phone and ask but you have to admit there's always the danger of being said a very firm "NO!" and some people might actually investigate who are you and take active measures to block any traffic coming from you (which is not very hard if you give them your email during the call).

So I would think most scraping services assume they will be refused when doing such requests so they never bother.

jacquesm · on July 10, 2018

If you are anticipating a 'no' then you definitely should ask for permission.

The only valid reason for starting a large scraping operation is because you can argue that permission would be granted anyway.

pdimitar · on July 10, 2018

Well I am not arguing that point because I am not doing unethical scraping anyway. Just trying to explain why most scrapers go to hammering the servers directly.

Additionally, in my local market the owners of e-commerce websites are extremely narrow-minded and have zero tech education so all they will ever hear from you is "I want to steal that guy's data" which is of course not true at all. But try and argue with a 50-year old guy with the mindset of a feudal master who never truly worked in their life but want to control how everybody around them works.

If the survival of my business was at stake, I would just scrape one page every 3 or so seconds as a reasonable compromise. In fact I have done so for my amateur scraping experiments, although there the timeout was even steeper -- 10 seconds per page.

ryandrake · on July 10, 2018

I did a project in the past that involved scraping non-mainstream e-commerce sites and we encountered this mindset. Durr, what? Yer want to take all mah data?? It ended up easier to just write the scrapers than to explain what we were doing to Neanderthals.

pdimitar · on July 10, 2018

> Durr, what? Yer want to take all mah data?? It ended up easier to just write the scrapers than to explain what we were doing to Neanderthals.

As demeaning and offensive many people would find that statement to be, I still found it to be the sad reality most of the time.

Plus my local community is much smaller and I would not want vengeful businessmen who understand NOTHING from what I am trying to achieve, to actively sabotage me. They can easily call my ISP and deny me service, for example.

So I opted for ethical scraping without asking questions. Seems to be the best working compromise.

Thanks for sharing your experience. Let's bathe in the confirmation bias it dips us in. :D

Theodores · on July 10, 2018

I forgot about the managers! They get to have jobs this way.

ameister14 · on July 10, 2018

Realistically, both should happen. That way you don't have any losses when it happens again from another scraper.

ainiriand · on July 10, 2018

Well, we have never blocked any scraper. They are not a big deal for big systems. The problem here was that they were consuming a resource for which we had a very crappy cache system and they fried the website.

harry-wood · on July 10, 2018

So you provide an API, but someone was hammering your website with a scraper? That's annoying. Seems like bad behaviour.

I did find it surprising that this article has a whole section on "Challenge 4: Anti-Bot Countermeasures" (and how to bypass them) but doesn't mention giving any consideration as to whether this is a reasonable way to behave.

ainiriand · on July 10, 2018

Haha sorry for my crappy explanation. We have a couple of front apps consuming an internal API. And the scraper was consuming the web frontend.

jjeaff · on July 10, 2018

As a side note, I have had quite a bit of experience trying to block automated scraping services. And I found that the best way is to quietly attempt to detect scraping. Then, serve up tainted data.

In our case, competitors were scraping pricing data in order to competitively price their products without having to do the work.

So we just randomly start to give them incorrect prices on every few products. Not only would it make the whole data set useless, they had no way of figuring out which data was correct without manually checking and since we didn't do it to everything and started at random intervals, it made it too difficult for them to figure out when their ip had actually be quietly blacklisted.

ikeboy · on July 10, 2018

What's ironic is that most of the sites with anti scraping protection also do scraping of their own.

E.g. Amazon and Walmart both do a lot of their own scraping.

Doctor_Fegg · on July 10, 2018

Really going to call for a [citation needed] on that "most"!

ikeboy · on July 10, 2018

Maybe I should rephrase to “put the most effort into anti scraping”.

Every major ecommerce site scrapes, it would be a competitive disadvantage if they didn’t.

elorant · on July 10, 2018

Something like that could only work if prices are scraped from one source only. If multiple sources are used they could just compare prices and exclude the ones that fall way off. So my guess is your site is an edge case.

throwawaymath · on July 10, 2018

What do you do if you can't detect the scraping? And if you do detect scraping, how do you ensure the data you provide them is both invalid and consistent?

jjeaff · on July 12, 2018

I used a multiplier that was calculated using the date, a static secret, and a seed hashed from the sku. So it was consistent but the offset was different product to product. So that even if you manually went in and figured out the offset for a specific product you couldn't just offset all scraped prices.

But once they lose a bunch of money the first time, they tend to stop trying. We tracked down one competitor that was mirroring our prices on an hourly basis. So we waited until late at night, tanked our price on a few expensive items, then placed orders on the competitors site.

The human touch tends to scare off scrapers faster than a technological fence anyway.

taitems · on July 10, 2018

I hope you too got a chuckle from reading their anti-bot counter measures section only to see their form protected by Google’s “I’m not a robot” CAPTCHA plugin.

fareesh · on July 10, 2018

I've always wondered if it makes more sense to render the page as a jpeg and run some kind of machine learning to identify and read off the relevant details

jacquesm · on July 10, 2018

I'd go the other way and say that pages that no longer contain relevant information in a normally digestible format should be dropped from search engines and other automated indices.

After all, the web was built on accessibility of information, not on purposeful obfuscation.

If you go so far as to essentially flatten the webpage to the point where you might as well print it out and then do OCR on it then you've thrown out the baby with the bathwater, you had all that information when you started. Or at least, you should have had it.

Otherwise we might as well kiss HTML goodbye and render the web as pdfs, with or without links.

zzzcpan · on July 10, 2018

The biggest search engine doesn't have your best interests at heart and has been trying to make HTML and accessibility of information obsolete for years. Some pages now render only with javascript or require solving javascript challenge to even get to the rendering (hello cloudflare) and essentially kissed HTML goodbye.

dwynings · on July 10, 2018

Fair warning: I work at Diffbot.

Essentially that's what Diffbot (https://www.diffbot.com/) does, except we don't the render pages as an image nor do OCR.

Diffbot renders the page in a headless browser, and uses computer vision to automatically identify the key page attributes and extract normalized data for specific page types (Articles, Products, Discussions, Profiles, Images, and Videos).

This approach enables us to work in any language and on sites that we've never come across before automatically with better than human level accuracy.

vinceguidry · on July 10, 2018

I want machine learning, but definitely not to identify information off of a gif. Websites are too dynamic for that to really be feasible. Say I want to scrape plane ticket prices. Getting the info off the page once you're there is the easy part. The hard part is figuring out how to navigate the website. I don't want to do the annoying manual work of programming page traversal. I want to be able to make a ML bot that does it for me. That way no matter how often they change the interface, all I have to do is re-run the ML.

tegansnyder · on July 10, 2018

Maybe someone can chime in, but I'm pretty sure Diffbot does something similar.

dwynings · on July 10, 2018

Hey Tegan,

Answered on the parent, but it's somewhat similar.

nerdponx · on July 10, 2018

OCR tech is good but it's not there yet.

dredmorbius · on July 10, 2018

Dumping rendered text, aswith lynx or w3m, is another option.

There's a lot of very bad HTML out there.

cmjqol · on July 10, 2018

I always assumed Web Scraping wasn't something particularly challenging because of how many libraries existed for this purpose.

This article made me realize I assumed wrong.

mipmap04 · on July 10, 2018

It gets especially difficult with dynamic content or when trying to scrape sites written on very heavy frameworks like ASP.NET Webforms that require passing the view state with every request. I made a calendar aggregator for adult hockey times in my area that scrapes rink websites[0] and it was far more difficult than I had thought it would be because of the fact that the rinks all used Telerik Webforms controls to do their calendars. It turned a 30 minute job into a 2 hour job.

[0]: http://dpscschedule.azurewebsites.net/

polskibus · on July 10, 2018

Wouldn't such scraping be under TOS of scraped sites?

merinowool · on July 10, 2018

Bot cannot consent to or understand TOS...

jd20 · on July 10, 2018

What if the site requires you to create an account and login? Can the bot create it's own account, and still claim to not consent?

merinowool · on July 10, 2018

Is bot aware of what it is doing? I don't think so.

baxtr · on July 10, 2018

I had to stop reading the otherwise interesting sounding piece when at the left bottom corner a "Get the Enterprise Web Scraping Guide" box popped up (second paragraph or so). Maybe I'll give it a second chance later.

moltar · on July 10, 2018

It was a pretty thin self promoting post. You didn’t miss much.

Exuma · on July 10, 2018

> However, our recommendation is to go with a proxy provider who can provide a single endpoint for proxy configuration and hide all the complexities of managing your proxies.

Can you provide an example of such service?

THanks!

danni · on July 10, 2018

I run https://www.scraperapi.com which does this!

Ian_Kerins · on July 10, 2018

Crawlera (https://scrapinghub.com/crawlera) is the one Scrapinghub developed

misterbowfinger · on July 10, 2018

> A large proportion of these bot countermeasures use javascript to determine if the request is coming from a crawler or a human (Javascript engine checks, font enumeration, WebGL and Canvas, etc.).

How effective are scraping countermeasures anyway?

jd20 · on July 10, 2018

They work pretty well for any scraper that's not using an actual browser with JavaScript engine. It keeps the riff-raff out.

A dedicated person will eventually work his way around all available counter-measures, though.

livando · on July 10, 2018

"Multi-threading is a must, when scraping at scale."

I disagree on this point. Starting with a single threaded model allowed my team to scale quickly and with little additional overhead. What we have lost with performance we gained in simplicity and developer productivity. That being said tuning and porting portions of the app to a multi-threaded system is slotted to take place within the next year.

Start with single threaded and simple, move to multi-threaded scrapers when the juice is worth the squeeze.

pdimitar · on July 10, 2018

Or use a language where fully utilizing all CPU cores is transparent, like Elixir? There's zero complexity, you basically add 4-5 lines of code and that's it. Honestly, not exaggerating.

I've done several very amateur scrapers in the last several years, I am never going back to languages with a global interpreter lock, ever.

iooi · on July 10, 2018

I'm assuming you're talking about Python, which is also "4-5 lines" to use multithreading or multiprocessing. Can you explain what's wrong with GIL languages?

Now that I think about it, it's even less than 4 lines:

from multiprocess.pool import Pool (or ThreadPool)

pool = Pool()

pool.map(scrape, urls)

pdimitar · on July 10, 2018

When the pooled functions are I/O bound then the GIL is not a problem. Any GIL language will do.

However, for example when generating reports, try use the same instrument for serializing 4 pages of DB records to 4 pieces of a big CSV file, each working on a single CPU core. There the languages without GIL truly shine. And languages like Python and Ruby struggle unless their GIL implementations compromise and yield without waiting for an I/O operation to complete.

iooi · on July 10, 2018

I'm not sure you understand how the GIL works in Python. If you're using multiprocessing, there's no locking across the code executing on each core. Also, if you're writing to the same file from four processes, you're going to need locking.

pdimitar · on July 10, 2018

What I have last known is that GIL languages work well in multicore scenarios as long as all N tasks have I/O calls that serve as yielding points for the interpreter, and they do not use preemptive scheduling like the BEAM VM (Erlang, Elixir, LFE, Alpaca) do.

Am I mistaken?

iooi · on July 10, 2018

As far as Python goes, yes. Multicore implies multiple processes, which means that each process will have it's own Python interpreter, each with it's own GIL.

If you were to use multithreading instead, you would generally have a problem if you were doing non-I/O work.

pdimitar · on July 10, 2018

Then I think we have a misunderstanding of terms. To me "multicore" == "single process, many threads". Apologies for the confusion.

It seems that now we are both on the same page. Single process & many threads are problematic for GIL languages and that's why I gave up using Ruby for scrapers. GIL languages can work very well for the URL downloading part though.

detaro · on July 10, 2018

Any further information on this? Last I looked (which was a while ago), the infrastructure like HTML parsers seemed surprisingly tricky in Elixir.

pdimitar · on July 10, 2018

The only complication is if you want to use Meeseks (https://github.com/mischov/meeseeks) which requires the Rust compiler and runtime be installed because it has native bindings. Meeseks is useful because it's a bit faster than the default Floki (https://github.com/philss/floki) and because it can handle very malformed HTML.

As for Elixir itself, here's a quick example:

```

# Assume this contains 1000 URLs

urls = [....]

# This will utilize 100 threads; if the second parameter is omitted, it will use threads equal to CPU cores. For I/O bound tasks however it's pretty safe to use much more.

results = Task.async_stream(&YourScrapingModule.your_scraping_function/1, max_concurrency: 100)

```

It's honestly that simple in Elixir. For finer grained control the line count is little bigger -- but little. Not hundreds of lines for sure.

mischov · on July 10, 2018

Meeseeks's speed difference with Floki is not that significant, and my initial findings are they've leveled out even more with OTP 21, sometimes even swinging in favor of Floki.

The better handling of malformed HTML by default is the much bigger deal.

pdimitar · on July 10, 2018

Thank you man (I know you are the author of Meeseks), I didn't know that. Always knew that the current info was the Meeseks was faster than Floki but it seems that OTP 21 largely eliminated that as you said.

Valuable info, thanks!

mischov · on July 10, 2018

It was pretty interesting to see Floki get a lot faster and Meeseeks actually get a little slower with OTP 21. I'll enjoy figuring out why. I hope to get a chance to work on the OTP 21 performance of Meeseeks before too long.

On the plus side there were some nice memory improvements for Meeseeks in OTP 21.

pdimitar · on July 10, 2018

(off-topic alert)

Don't let this sound patronizing because it's not -- but have you looked at how many times is the boundary between the BEAM and the Rust code crossed? I haven't inspected Meeseks' code so can't talk, just wildly guessing.

My ancient experience with Java <-> C++ bridges has taught me that if your higher-level language calls the lower-level language very often then the gains of using the lower-level language almost disappear due to the high overhead of constantly serializing data back and forth.

Anyhow, we should probably take this discussion to ElixirForum and not here. :)

(I am @dimitarvp there and almost everywhere else on the net, HN is one of the very few exceptions of inconsistent username for me).

rosha · on July 10, 2018

I tried several different queue systems best version I got is using Erlang Queue, Elixir & Kafka on top for doing high concurrent crawler, the project was to develop a realtime Amazon product ASIN price monitoring system for our company as a challenger prototype. Our main problem was basically proxies, we stopped buying them as managing thousands of proxies is a huge effort that we did not want to take, also lack of data means our Hadoop clusters gets thirsty and machines stops learning properly. Currently we are using a third party https://proxycrawl.com on very high tiers > 10B with a great discount and we are happy to get that part solved. Other lessons learnt are like sometimes things fail and logs help a lot so you will need a highly available Logging and monitoring system.

jd20 · on July 10, 2018

Back when I worked for a very large tech company, building their web crawler, I had good success with Golang. On four servers, with 10 GigE interconnect and SSD, and a very fast pipe to the Internet, I was able to push about 10K pages / second sustained. At any given time, there were probably several million connections open concurrently.

I've played with Elixir as well, and it's also great for this type of thing.

proxycrawl.com looks very cool, I'm actually looking for a proxy service for my current scraping project. Are they also a good choice if you're doing lower tiers (like thousands of requests a day)?

rosha · on July 10, 2018

Golang is a good choice too but in my experience its nothing compared to what you can do with Erlang Queue and Elixir. Regarding your question about proxycrawl, I do not know honestly, I tested the service for few days on some few millions per day and it was great too. I would say they are good for a very high volume, we are still using it, so that should be a good signal to try them.