By default jsdom doesn't run JavaScript files unless they're specified within th...

MatthewPhillips · on March 8, 2012

But will the pages scripts respond to events? For example on Twitter after the page loads it looks for a tweet in the hash bang and if there is one, does an XMLHttpRequest, then updates the DOM. Will Jsdom be able to scrape sites like that?

baudehlo · on March 8, 2012

"curl" can scrape sites like that. Just watch the Chrome "Network" window and see what requests get made. Then use that. Scraping doesn't need a browser.

MatthewPhillips · on March 8, 2012

Of course curl can. And curl was the best option 15, even 10 years ago. But to scrape a complicated site with curl (or any other non-browser scrapping utility) you wind up basically rewriting the application you're attempting to scrape.

Browsers sole purpose is to run web apps. It doesn't make sense to use anything else.

baudehlo · on March 8, 2012

At the low end of the scale I agree. But once you start scaling it's very difficult to run more than around 40 concurrent sessions with a browser-based scraper.

With a pure Nodejs scraper we can run over 1000 parallel sessions per CPU.