Hacker News new | past | comments | ask | show | jobs | submit login

In the world of SPA (single page applications), headless browser API is super helpful, playwright[1] and puppeteer[2] are very good choices.

[1] https://github.com/microsoft/playwright

[2] https://github.com/puppeteer/puppeteer




Highly recommend playwright (if I'm not mistaken most of the big developers from puppeteer were hired by MS to work on playwright). I run into significantly less await/async problems with playwright than I did with puppeteer and the codegen tool is super helpful as a first pass option.


Playwright integrates with lot of different browsers compared to puppeteer which just uses chrome.


Also is the ability to open the Networks panel, to snoop on requests and find the exact API call that you might need to perform your task, instead of having to pull in all of HTML/JS/CSS crap. As a lot of SPAs have essentially pushed everything behind JSON APIs, all information is usually one (authenticated) API call away.


Most content heavy websites that tend to be scrapped, usually use server side rendering for this exact same reason, and put many obstacles in the way to make sure that data doesn't get scrapped easily. See: product price, stock, delivery information.


If you're interested in running the puppeteer in containers, take a look at chrome-aws-lambda[1] and browserless docker container[2]

Not affiliated with browserless, but they do have a free/paid cloud service. https://www.browserless.io

[1] https://github.com/alixaxel/chrome-aws-lambda

[2] https://github.com/browserless/chrome


https://chrome.browserless.io/ is perhaps the best technical demo I've ever seen, and shows off Browserless's capabilities amazingly. An incredibly high-quality service and codebase.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: