Hacker News new | past | comments | ask | show | jobs | submit login

By default jsdom doesn't run JavaScript files unless they're specified within the program. However, if you call 'jsdom.jsdom' it will then run the scripts specified in the webpage, and those you specify in your program.



But will the pages scripts respond to events? For example on Twitter after the page loads it looks for a tweet in the hash bang and if there is one, does an XMLHttpRequest, then updates the DOM. Will Jsdom be able to scrape sites like that?


"curl" can scrape sites like that. Just watch the Chrome "Network" window and see what requests get made. Then use that. Scraping doesn't need a browser.


Of course curl can. And curl was the best option 15, even 10 years ago. But to scrape a complicated site with curl (or any other non-browser scrapping utility) you wind up basically rewriting the application you're attempting to scrape.

Browsers sole purpose is to run web apps. It doesn't make sense to use anything else.


At the low end of the scale I agree. But once you start scaling it's very difficult to run more than around 40 concurrent sessions with a browser-based scraper.

With a pure Nodejs scraper we can run over 1000 parallel sessions per CPU.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: