Hacker News new | past | comments | ask | show | jobs | submit login

How does that help you mitigate when a site changes? If you’re fetching some value in a given <div> under a long XPATH and they decide to change that path?



You don't use XPath&CSS selectors at all (Except if you dont have choice). You rely on more generic stuff, e.g, "the button that has 'Sign in' on it":

    await page.getByRole('button', { name: 'Sign in' }).click();
See playwright locators: https://playwright.dev/docs/locators


I started putting data-testid attributes in my web app for automated testing using playwright. Prevents me from breaking my own script but it sure would make me more scrapable if anyone cared. Well.. I guess I only do it on inputs, not the rendered page which is what scrapers care most about.


Unless you start a war against scrapers, you don't need to worry about that as I'll always find a way to scrape your site as long as its valuable to 'me'. Even if it requires Real browser + OCR :)


Oh I know I couldn't prevent it. But if you wanted to scrape me, you'd have to pay the monthly subscription because everything is behind a pay wall/login. And then you'd only have access to data you entered because it's just that kind of app :-)


This is where you just train an LLM so you can write:

'get button named "sign in" and click'

Then on the back end, it generates your example code.


Adept is doing it.


Don't know about the poster, but I try to find divs and buttons in a fuzzy way. Usually via element text. Sometimes it mitigates changes, sometimes it doesn't. It's a guessing game. Especially when they start using shadow elements or iframes in the page. If I'm looking for something specific like a price or dimensions, I can sometimes get away with it by collecting dollar amounts or X x Y x Z from the raw text.


iframes have been a pain the butt to scrape against. I see it more and more in websites now.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: