How does that help you mitigate when a site changes? If you’re fetching some val...

sunshadow · on Nov 11, 2023

You don't use XPath&CSS selectors at all (Except if you dont have choice). You rely on more generic stuff, e.g, "the button that has 'Sign in' on it":

    await page.getByRole('button', { name: 'Sign in' }).click();

See playwright locators: https://playwright.dev/docs/locators

8n4vidtmkvmk · on Nov 11, 2023

I started putting data-testid attributes in my web app for automated testing using playwright. Prevents me from breaking my own script but it sure would make me more scrapable if anyone cared. Well.. I guess I only do it on inputs, not the rendered page which is what scrapers care most about.

sunshadow · on Nov 11, 2023

Unless you start a war against scrapers, you don't need to worry about that as I'll always find a way to scrape your site as long as its valuable to 'me'. Even if it requires Real browser + OCR :)

erhaetherth · on Nov 11, 2023

Oh I know I couldn't prevent it. But if you wanted to scrape me, you'd have to pay the monthly subscription because everything is behind a pay wall/login. And then you'd only have access to data you entered because it's just that kind of app :-)

latchkey · on Nov 11, 2023

This is where you just train an LLM so you can write:

'get button named "sign in" and click'

Then on the back end, it generates your example code.

bluecrab · on Nov 12, 2023

Adept is doing it.

nurettin · on Nov 11, 2023

Don't know about the poster, but I try to find divs and buttons in a fuzzy way. Usually via element text. Sometimes it mitigates changes, sometimes it doesn't. It's a guessing game. Especially when they start using shadow elements or iframes in the page. If I'm looking for something specific like a price or dimensions, I can sometimes get away with it by collecting dollar amounts or X x Y x Z from the raw text.

aynyc · on Nov 12, 2023

iframes have been a pain the butt to scrape against. I see it more and more in websites now.