Web Scraping via JavaScript Runtime Heap Snapshots (2022)

ricardo81 · on Aug 8, 2023

>In recent years, the web has gotten very hostile to the lowly web scraper. It's a result of the natural progression of web technologies away from statically rendered pages to dynamic apps built with frameworks like React and CSS-in-JS.

Dunno, a lot of the time it actually makes scraping easier because the content that's not in the original source tends to be served up as structured data via XHR- JSON usually- you just need to take a look at the data you're interested in and if it's not in 'view-source', it's coming from somewhere else.

Browser based scraping makes sense when that data is heavily mangled or obfuscated, laden with captchas and other anti-scraping methods. Or if you're interested in if text is hidden, what position it's on the page etc.

Raed667 · on Aug 8, 2023

Agreed! multiple times I wasted hours figuring out what selectors to use, but then remembered that I can just look at the network tab and have perfectly structured JSON data.

btown · on Aug 8, 2023

For those curious about how this can work in production, Puppeteer's setRequestInterception and page.on('response') are incredibly powerful. Platforms like Browserless can make this easy to orchestrate as well. Also, many full-stack JS frameworks will preload JSON payloads into the initial HTML for hydration. There are tons of possibilities beyond DOM scraping.

That said, it's surprising how many high-traffic sites still use "send an HTML snippet over an AJAX endpoint" - or worse yet, ASP.NET forms with stateful servers where you have to dance with __VIEWSTATE across multiple network hops. Part of the art of scraping is knowing when it's worthwhile to go down these rabbit holes, and when it's not!

ricardo81 · on Aug 8, 2023

The point was, you don't have to wait for JS to re-arrange the dom, sometimes it's a simple request to example.com/api/endpoint?productid=123 and you have all the data you need. No worries about HTML markup.

kbenson · on Aug 8, 2023

I think btown's point was sometimes what you're served is not just "the data you need" from that request, but a portion of the page that will be inserted, rather than built from raw data and inserted, so you need to parse the HTML in the response given since it's an HTML snippet.

It's still generally easier, because you don't have to worry about zeroing in on the right section of the page before you start pulling the data out of the HTML, but not quite as easy as getting a JSON structure.

ricardo81 · on Aug 8, 2023

I'd disagree on the easier. Instead of loading 100x assets per thing you want to scrape, 2x is way better for both parties.

But perhaps if there's xx endpoints for some inefficient reason, the browser DOM would be better.

kbenson · on Aug 8, 2023

I think you're still misunderstanding. Sometimes, sites haven't adopted a pure data-driven model, and when example.com/api/endpoint?productid=123 is requested it doesn't return JSON for the product with id 123, but instead returns a div or table row of HTML which has the data for that product already in it, which is then inserted directly where it's meant to be in the current page, rather then built into HTML from JSON and then inserted.

What I was saying is that method is not quite as easy as pure JSON to get data from, but still easier to parse and find the specific data for the specific item you're looking for, as it's a very small amount of markup all related to the entry in question.

My interpretation of btown's comment is along the same lines, that it's surprising how many sites still serve HTML snippets for dynamic pages.

btown · on Aug 8, 2023

So I have seen that indeed!

But also, some more modern sites with JSON API endpoints will have extremely bespoke session/auth/state management systems that make it difficult to create a request payload that will work without calculations done deep in the bowels of their client-side JS code. It can be much easier, if slower and more costly, to mimic a browser and listen to the equivalent of the Network tab, than to find out how to create valid payloads directly for the API endpoints.

josephg · on Aug 8, 2023

If you can see the request in the network tab, you can just right click - copy as curl and then replay the request from the command line and noodle with the request parameters that way. Works great!

ricardo81 · on Aug 8, 2023

Honestly, from prior experience any scraping requirements that require browser implementation tend to be due to captchas and anti-scraping measures, nothing to do with the data layout.

It's either in the DOM or in one or two other payloads.

Isn't this sort-of-why people hide themselves behind Cloudflare, to remove the lowest common denominators of scraping.

kbenson · on Aug 8, 2023

Yes. Sometimes you can see that there's a static (per session) header they add to each request, and all you have to do is find and record that header value (such as shimming addRequestHeader) and append it to your own requests from that context...

ricardo81 · on Aug 8, 2023

I get you, and agree, it's not as easy to scrape. And makes no sense to do it that way, for them, a scraper, a search engine, A.N other, or the user.

The old SSI includes of Apache would probably just be as efficient.

Reading the OP and comments it seems like a generational difference with a younger gen not appreciating server side generation in the same way.

commandlinefan · on Aug 8, 2023

> look at the network tab

The challenge there is automating it, though - usually the rest endpoints require some complex combination of temporary auth token headers that are (intentionally) difficult to generate outside the context of the app itself and expire pretty quickly.

Raed667 · on Aug 8, 2023

You can use the application context, while also automatically intercepting requests. Best of both worlds.

puppeteer: https://pptr.dev/api/puppeteer.page.setrequestinterception

playwright: https://playwright.dev/docs/network#network-events

halJordan · on Aug 11, 2023

And the article is about using puppeteer ...

ricardo81 · on Aug 8, 2023

In browsers 'copy as curl' is decent enough. Do the request through a command line.

If there's ephemeral cookies, they tend to follow a predictable pattern.

z3t4 · on Aug 8, 2023

Static files are much easier to scrape. Its even easier to scrape a static page then it is to use api's

1vuio0pswjnm7 · on Aug 9, 2023

Care to provide some examples. The majority of sites submitted to HN do not even require cookies let alone tokens in special headers. A site like Twitter is an exception not the general rule.

geysersam · on Aug 9, 2023

As a scraping target, Twitter is closer to the rule than the exception.

1vuio0pswjnm7 · on Aug 10, 2023

Not sure about "scraping targets". I'm referring to websites that can be read without using Javascript. Few websites submitted to HN try to discourage users with JS disabled from reading them by using tokens in special headers. Twitter is an exception. Twitter's efforts to annoy users into enabling Javascript are ineffective anyway.

https://github-wiki-see.page/m/zedeus/nitter/wiki/Instances

paulddraper · on Aug 8, 2023

That's true....but that was already true.

Whatever method you were using before SPAs to authenticate your scraper (HTTP requests, browser automation), you can use that same method now.

spaniard89277 · on Aug 8, 2023

But what if their backend blocks you? I'm trying to develop a instagram scraper and I find that I'll have to spend money on rotating proxies.

It doesn't matter if you scrape the DOM or get some Json.

I just need to scrape some public account posts and, I may be dumb, but I dunno how to do that with the official APIs (developers.facebook is hard to understand for me).

blister · on Aug 8, 2023

Hah, I literally just fought this for the past month. We run a large esports league that relies on player ranked data. They have the data, and as mentioned above, they send it down to the browser in beautiful JSON objects.

But they're sitting behind Cloudflare and aggressively blocking attempts to fetch data programmatically, which is a huge problem for us with 6000+ players worth of data to fetch multiple times every 3 months.

So... I built a Chrome Extension to grab the data at a speed that is usually under their detection rate. Basically created a distributed scraper and passed it out to as many people in the league as I could.

For big jobs when we want to do giant batches, it was a simple matter of doing the pulls and when we start getting 429 errors (rate limit blocking code they use), switch to a new IP on the VPN.

The only way they can block us now is if they stop having a website.

Give one of the commercial VPN providers a try. They're usually pretty cheap and have tons of IPs all over the place. Adding a "VPN Disconnect / Reconnect" step to the process only added about 10 seconds per request every so often.

kbenson · on Aug 8, 2023

It probably doesn't save you much, since you already built the chrome extension, but having done both I found that tampermonkey is often much easier to deal with in most cases and also much quicker to develop for (you can literally edit the script in the tampermonkey extension settings page and reload the page you want it to apply to for immediate testing).

sublinear · on Aug 8, 2023

I might be wrong, but some sites can block 'self' origin scripts by leaving it out of the Content Security Policy and only allowing scripts they control served by a CDN or specified subdomain to run on their page. Not sure when I last tried this and on what browser(s).

You'd have to disable CSP manually in your browser config to make it work, but that leaves you with an insecure browser and a lot of friction for casual users. Not sure if you can tie about:config options to a user profile for this use case. Distributing a working extension/script is getting harder all the time.

kbenson · on Aug 8, 2023

I don't recall if I've encountered that specific problem in tampermonkey (or if I did and it didn't cause a problem worth remembering), but you can run things in the extension's context as well to bypass certain restrictions, as well as use special extension provided functions (GM_* from the greasemoney standard) that allow for additional actions.

I do recall intercepting requests when I used a chrome extension to change CSP values though and not needing to when doing something similar later in tampermonkey, but it may not have been quite the same issue as you're describing, so I can't definitively say whether I had a problem with it or not.

throwawayadvsec · on Aug 8, 2023

a VPN won't do anything to help you with instagram

the best of the best are 4G rotating proxies

the fingerprint needs to change also

Raed667 · on Aug 8, 2023

Instagram is notoriously hard to scrape. I was talking more about random SPAs that you can find.

For Instagram, here is a link that maybe helps: https://apify.com/apify/instagram-profile-scraper

kbenson · on Aug 8, 2023

You can buy proxies to use, of varying quality, but they are somewhat expensive depending on what you need.

I'll just say that firefox still runs tampermonkey, and that includes firefox mobile, so depending on how often you need a different IP and how much data you're getting, you might be able to do away with the whole idea of proxies and just have a few mobile phones that can be configured as workers that take requests through a tampermonkey script. Or that a laptop tethers to that does the same, or that runs puppeteer itself. It depends on whether a worker needs a new IP every few minutes, hours or days as to whether a real mobile phone works (as some manual interaction is often required to actively change the IP).

throwawayadvsec · on Aug 8, 2023

Instagram data is really valuable and they don't like sharing it, you're not dumb!

rotating proxies are the way to go with insta, you can't do much about IP blocking besides using the right IPs

although in theory if you had an account(s) you could still scrape data from a datacenter IP(aws), even though the limits were lower than a 4G proxy

you can buy/create insta accounts for less than a $ using throwaway phone numbers

ricardo81 · on Aug 8, 2023

There are plenty sites that do that.

And if you're determined to scrape [a website], sometimes it needs proxies, rotating user agents and some rate limiting.

Using a full blown browser sometimes helps prevent you hitting those rate limits, but they're still there.

paulddraper · on Aug 8, 2023

> I dunno how to do that with the official APIs

Instagram deliberately nuked their own APIs.

IDK exactly why, I think it had to do with the 2016 U.S. election.

1024core · on Aug 8, 2023

These days people are hiding network activity too. For example, I couldn't find how to download the data powering the salary table here: https://sfstandard.com/2023/08/02/see-what-san-franciscos-to...

buzer · on Aug 8, 2023

It's in iframe. In Firefox right click table, This Frame - Show only this frame. That opens https://datawrapper.dwcdn.net/32i0b/1/. In there you can see it makes request to https://datawrapper.dwcdn.net/32i0b/1/dataset.csv

1024core · on Aug 8, 2023

Thank you! :-)

duderific · on Aug 8, 2023

Huh? In network tab in Chrome devtools, there are fetches for dataset.csv - the url is https://datawrapper.dwcdn.net/32i0b/1/dataset.csv which downloads the dataset.

slaymaker1907 · on Aug 8, 2023

It's the item called "dataset.csv" in the network tab using Edge devtools.

dunham · on Aug 8, 2023

Sometimes that stuff ends up in the cache too, and you can write a script to scrape the cache.

bambax · on Aug 8, 2023

> the content that's not in the original source tends to be served up as structured data via XHR- JSON usually-

Yes, you can overwrite fetch and log everything that comes in or out of the page you're looking at. I do that in Tampermonkey but one can probably inject the same kind of script in Puppeteer.

Tade0 · on Aug 8, 2023

I'm grateful that GraphQL proliferated, because I don't even have to scrape such resources - I just query.

A while ago, when I was looking for an apartment, I noticed that only the mobile app for a certain service allows for drawing the area of interest - the web version had only the option of looking in the area currently visible on the screen.

Or did it? Turns out it was the same GraphQL query with the area described as a GeoJSON object.

GeoJSON allows for disjointed areas, which was particularly useful in my case, because I had three of those.

holoduke · on Aug 8, 2023

I am still using Casperjs ameith phantomjs. Old tech. But works perfectly. Some scripts are running for 10 years on the same sites without ever made a change.

Levitating · on Aug 8, 2023

But modern websites often use templating engines and render everything server sided.

icedchai · on Aug 8, 2023

Only "modern" ones? It seems everything old is new again.

masfuerte · on Aug 8, 2023

Github is moving in the opposite direction. The project browser used to be server rendered but anything below the root requires js now.

hk1337 · on Aug 8, 2023

Are there really that many opportunities where you need to scrape it with a browser as opposed to just fetching from the same JSON endpoint the web site is getting it from?

There are some, not many, but when possible I would rather just use a simple request library to fetch it than have to spin up a browser.

ricardo81 · on Aug 8, 2023

If it's server side rendered then you don't need a browser.

recursive · on Aug 8, 2023

I was doing modern websites in the 90s with CGI and perl.

beardyw · on Aug 8, 2023

So do we think the heap is revealing the response from an API?

ricardo81 · on Aug 8, 2023

I don't care about the heap. Load up a page, the content is either in the DOM or an external asset. Just a case of seeing how a site works.

99% of the time if it's not in the DOM it's an XHR request to a standardised API with nice, clean data.

johnnyworker · on Aug 8, 2023

A bit of a tangent, but a long time ago I was kicked out of a Facebook group for what I considered to be completely made up reasons -- and what really got to me was that by being banned from it, I couldn't even point to the posts that had been actively misunderstood and distorted. I couldn't find anything in the cache files, so I saved a process dump of the still running Firefox and stitched the posts together from that. I stopped caring the as soon as I had my proof, but was still sheepishly proud that I managed to get it.

Alifatisk · on Aug 8, 2023

That's very clever

wmichelin · on Aug 8, 2023

When you figured it out, what was the cause?

johnnyworker · on Aug 9, 2023

It's been over 10 years, and not interesting, but because you asked: it was a political group, and I was replying to a group member who was talking about how love is all you need. I tried to make the point that love may be an essential ingredient, but is not all you need, certainly not for political change (I think I used some Chomsky quotes to that effect...), and that group member essentially straw manned that into advocating hate (which I really wasn't). Knowing how I was back then, I probably was being a Dwight Schrute about it, as in "bzzzt, false, love is NOT all you need, because A, B and C".

So we were both irritated and talked back back and forth, ceding no ground, and suddenly an admin banned me without warning. It was the suddenness more than anything that got to me; I felt misrepresented and censored, and I just wanted to be able to re-read it all to gain closure, if you will. If I could find the file today I would probably cringe at what I wrote, or at least how I wrote it, but back then I nodded and thought "yup, I'm correct", haha.

wmichelin · on Aug 11, 2023

Totally feel that, I got a life-time ban from a huge programming sub on Reddit for being rude to a person who was obviously just copying-and-pasting their software homework on the forum.

ricardo81 · on Aug 8, 2023

Great use case.

breatheoften · on Aug 8, 2023

I used a technique like this a few years back in a production product ... We had an integration partner (who we had permission to integrate with) that offered a different api for integration partners than was used for their website but which was horribly broken and regularly gave out the wrong data. The api was broken but the data displayed on their web page was fine so someone on the team wrote a browser automation (using ruby and selenium!) to drive the browser through the series of pages needed to retrieve all the information required. Needless to say, this broke all the time as the page/css changed etc.

At some point I got pulled in and ran screaming away from selenium to puppeteer -- and quickly discovered the joy that is scripting the browser via natively supported api's and the chrome debugger protocol.

The partners web page happened to be implemented with the apollo graphql client and I came across the puppeteer api for scanning the javascript heap -- I realized that if I could find the apollo client instance in memory (buried as a local variable inside some function closure referenced within the web app) -- I could just use it myself to get the data I needed ... coded it up in an hour or so and it just worked ... super fun and effective way to write a "scraper"!

OnDocumentReady -> scan the heap for the needed object -> use it directly to get the data you need

paulddraper · on Aug 8, 2023

> scripting the browser via natively supported api's

Every modern browser has native support for WebDriver (i.e. Selenium) APIs.

The advantage to Puppetteer is that the Chrome Dev Tools API is just a better API.

visarga · on Aug 8, 2023

Wondering if we can automate the finding part with a LLM. You just tell if what data you want scraped.

dontupvoteme · on Aug 9, 2023

I think the approach would be to classify the property by the values.

simonw · on Aug 8, 2023

Has anyone seen a version of this trick that works with Playwright instead of Puppeteer?

EDIT: https://github.com/adriancooney/puppeteer-heap-snapshot/blob... is the code that captures the snapshot, and it uses createCDPSession() - it looks like Playwright has an equivalent for that Puppeteer API, documented here: https://playwright.dev/docs/api/class-cdpsession

None4U · on Aug 8, 2023

The main/only difference here is that Puppeteer only supports Chromium, while Playwright support multiple browsers. CDP is the Chrome DevTools Protocol. Otherwise, as long as you're using Chrome in both, you get the same base protocol with a different API.

jawerty · on Aug 8, 2023

This is very cool. I do a lot of puppeteer scraping and this library would help with a lot of the more complicated DOMs to work with.

I have a live coding stream I did the other day scraping Facebook for comments https://www.youtube.com/live/03oTYPm12y8?feature=share

If you're interested in seeing puppeteer in action I started doing streams last month where I talk through my method. I’ll be posting a lot more since it's been very fun.

Overall puppeteer is great because you get to easily inject js scripts in a nice API. Selenium is great too but not as developed of a web scraping interface imo. Also puppeteer is a very optimized headless browser which is a given. What really matters is implementing a VPN proxy and storing your cookies during auth routines which I can get into if you have any questions about that.

adriancooney · on Aug 8, 2023

Thanks for posting this again! It's a year later and I still haven't touched the web scraper in production which is great to reflect on. It seems running the Youtube command on the post is still producing the exact same data too.

  $ npx puppeteer-heap-snapshot query \
    --url https://www.youtube.com/watch\?v\=L_o_O7v1ews \
    --properties channelId,viewCount,keywords --no-headless

odysseus · on Aug 9, 2023

Did you ever make another blog post about how to choose properties working backward from the visible data on the web page to the data structure containing said data?

Searching the heap manually is not working very well. The data I want is in a (very) long list of irrelevant values within a "strings" key. It might have something to do with the data on the page that I want to scrape being rendered by JavaScript.

bnchrch · on Aug 8, 2023

I love this. Thank you.

As I understand it this only works for SPAs or other heavy js frontends and would not work on HTML.

I think that’s fine.

What I’m really excited is this combined with traditional mark up scanning plus (incoming buzz word) AI.

Scraping is slowly becoming unstoppable and that a good thing.

koromak · on Aug 8, 2023

Is that true? The DOM is represented somewhere too, probably in the same heap. I'd bet it works on static sites.

bnchrch · on Aug 8, 2023

Oh you may be right!

topherjaynes · on Aug 8, 2023

I'm not a legal scholar and this isn't my area of expertise, but the final note has links out to a TechCrunch article about LinkedIn vs hiQ Labs Inc, which alludes to web scraping being legal, but the case wasn't decided for a few more months, and the court sided with Linkedin. What's the final verdict on web scraping vs creating fake accounts to get user information (which the case focused on)

svdr · on Aug 8, 2023

Wouldn't web scraping be possible by taking screenshots of the rendered pages and then reading them with OCR?

simonw · on Aug 8, 2023

If you just want the text there are other ways to do that. You could dump out document.body.innerText for example - here's how to do that with https://shot-scraper.datasette.io/en/stable/javascript.html

    shot-scraper javascript youtube.com 'document.body.innerText' -r

Output: https://gist.github.com/simonw/f497c90ca717006d0ee286ab086fb...

Or access the accessibility tree of the page using https://shot-scraper.datasette.io/en/stable/accessibility.ht...

    shot-scraper accessibility youtube.com

Output here: https://gist.github.com/simonw/5174380dcd8c979af02e3dd74051a...

lelandfe · on Aug 8, 2023

Of course, if the document is using the outline in unexpected ways, you'll run into trouble. Consider Facebook infamously splitting "Advertisement" into multiple spans to avoid tripping ad blockers.

michaelt · on Aug 8, 2023

Although you'd imagine screenshots would be easy to OCR reliably, it's not guaranteed to get everything correct.

It's not like you can rely on a dictionary to confirm you've correctly OCRed a post by "@4EyedJediO" - who knows if that's an O or a 0 at the end?

And if you're OCRing the title and view count of a youtube video, for example, you've got to take the page layout into account because there's a recommendations sidebar full of other titles with different view counts.

plorntus · on Aug 8, 2023

I guess you'd get better results if you knew the font the site uses (which in many cases you could figure it out pretty quickly) or even just override every font with your own.

is_true · on Aug 8, 2023

Yes, it's possible. We do this for TV shows.

berkle4455 · on Aug 8, 2023

Much of the content worth scraping isn't rendered on the screen.

zffr · on Aug 8, 2023

Do you have any examples? I haven’t experienced this myself

throwawayadvsec · on Aug 8, 2023

URL, images, stuff shown after you click on a button...

ekianjo · on Aug 8, 2023

probably very inefficient as it would depend on layout a lot too

cush · on Aug 8, 2023

As inefficient as parsing heap snapshots?

brigadier132 · on Aug 8, 2023

Much more

spaniard89277 · on Aug 8, 2023

You'll be spending resources on LLMs like crazy. Possible but very messy IMO.

anamexis · on Aug 8, 2023

You don't need LLMs for OCR.

spaniard89277 · on Aug 8, 2023

No but maybe you want to do something with the ocr output.

ekianjo · on Aug 8, 2023

OCR does not get you the names of the classes in a DOM

c0balt · on Aug 8, 2023

Huh, I'm surprised this gets past the symbol mangling done by most js minifiers. Though maybe they don't mangle attribute fields everywhere.

However this is a nice hack around "modern" page structures and kudos to the author for making a proper tool out of it.

madeofpalk · on Aug 8, 2023

Most/all minifiers won't actually mangle object property names as those often have observable side effects. You want to grab all the keys for an object and do something different depending on the name of the key - you can no longer do that if the minifier has mangled all the object keys. Not to mention I imagine it would be significantly harder track all references to object keys across an application (as opposed to just local variables).

btown · on Aug 8, 2023

As I recall, the only minifier that went this far was Closure Compiler [0] - the caveats documented there still apply!

(Fun fact: I believe that Closure Library, and by extension the Closure Compiler, are still used for the Gmail rich text editor! [1])

[0] https://developers.google.com/closure/compiler/docs/api-tuto...

[1] https://google.github.io/closure-library/source/closure/goog...

nathell · on Aug 8, 2023

And ClojureScript.

bugsliker · on Aug 8, 2023

It's the JSON data payload that has unminified keys. Though YouTube is one of the few google sites that still use JSON, most use protocol buffers which generate JS interfaces which would indeed be mangled by minifiers.

tantalor · on Aug 8, 2023

> These properties were chosen manually by working backward from the visible data on the web page to the data structure containing said data (I'll dive into that process in another blog post)

That would seem to be the actually interesting/challenging part.

ricklamers · on Aug 8, 2023

Using introspection tools like queryObjects in Chrome Dev Tools should get you quite far:

https://developer.chrome.com/docs/devtools/console/utilities...

jackbeck · on Aug 8, 2023

I would assume that you could just search the heap by the value shown on the page to find out what the key is.

unixfox · on Aug 8, 2023

https://news.ycombinator.com/item?id=31205139

_boffin_ · on Aug 8, 2023

Anyone know of any research on generating HTML differentials against updated webpages and with automatic healing of wrappers / selectors or research on using LLMs with webscraping and how to reduce token usage while retaining context?

simonw · on Aug 8, 2023

I've been doing some pretty dumb tricks for reducing token usage and piping to LLMs which have worked really well.

I have a strip-tags CLI tool which I can pipe HTML through on its way to an LLM, described here: https://simonwillison.net/2023/May/18/cli-tools-for-llms/

I also do things like this:

    shot-scraper javascript news.ycombinator.com 'document.body.innerText' -r \
      | llm -s 'General themes, illustrated by emoji'

Output here: https://gist.github.com/simonw/3fbfa44f83e12f9451b58b5954514...

That's using https://shot-scraper.datasette.io/ to get just the document.body.innerText as a raw string, then piping that to gpt-3.5-turbo with a system prompt.

In terms of retaining context, I added a feature to my strip-tags tool where you can ask it to NOT strip specific tags - e.g.:

    curl -s https://www.theguardian.com/us | \
      strip-tags -m -t h1 -t h2 -t h3

That strips all HTML tags except for h1, h2 and h3 - output here: https://gist.github.com/simonw/fefb92c6aba79f247dd4f8d5ecd88...

Full documentation here: https://github.com/simonw/strip-tags/blob/main/README.md

_boffin_ · on Aug 8, 2023

First, love your work. Sadly, I don't think this path would work for me as for what I do, I need the selectors as part of the workflow.

Roughly, my end goal is to do a single or multi-shot with the following information HTML differential (could be selectors, xpaths, data regions, differentials of any of the above, etc...), code stacktrace, related code, and prompt.

For this example, let's consider that the flow involves the bot to login to a website. I have selectors for the `.username` and `.password` inputs and then a selector for the login button as `.login-btn`.

1. The site updates their page and changes up all their IDs, but keeps the same structure. 2. The site updates their page and changes up all their IDs, but changes the structure and the form is named something different and is somewhere else in the DOM. 3. many... many other examples.

Trying to figure out how to minimize the tokens, but keep the needed context to regenerate the selectors that are needed to maintain the workflow.

simonw · on Aug 8, 2023

Yeah, I've been thinking a bit about that kind of problem too.

My hunch is you could do it with a much more complex setup involving OpenAI functions - by trying different things (like "list just input elements with their names and associated labels") in a loop with the LLM where it gets to keep asking follow-up questions of the DOM until it finds the right combination.

_boffin_ · on Aug 8, 2023

Would love to connect with you and discuss this further.

zenyc · on Aug 8, 2023

I’m also working on this issue. Open to brainstorming some approaches together

_boffin_ · on Aug 8, 2023

Lets! Would love to hear your thoughts.

zenyc · on Aug 8, 2023

z (at) 6bow.com

ryanwaggoner · on Aug 9, 2023

I’d like to join the discussion if I can, I’m actively working in this area.

solanav · on Aug 8, 2023

Cool! I use selenium to do phishing detection at my company and I use javascript declared variables as a source of data to analyse. It’s specially useful for links that are obfuscated by concatenating two variables into another one.

enson110 · on Aug 9, 2023

Thanks for sharing.

I think one of the most challenging part of web scraping is dealing with the website's anti-scraping measures, such as it needs to sign in, encountering 403 forbidden error, and reCAPTCHA.

Does anyone have more experience in handling that?

elwell · on Aug 9, 2023

I'm tempted to scrape this blog page so I can read it without my eyes hurting.

elwell · on Aug 9, 2023

It's still a game of cat and mouse. Next step is for the website to store multiple instances of similarly structured data so you scrape the dummy unknowingly.

dontupvoteme · on Aug 9, 2023

Or generate garbage data to poison their scraper.

j0hnyl · on Aug 8, 2023

Does JavaScript state live in the heap too? Can we use this technique to pull objects from memory?

bonadrag · on Aug 9, 2023

How do we evade the captchas with this method? Asking for a friend...

1-6 · on Aug 8, 2023

Is there something which would allow me to do this with Jupyter Notebook?

_boffin_ · on Aug 8, 2023

What exactly are you trying to do?

xkcd1963 · on Aug 8, 2023

My dear fellows this is joyous news