I don't think that Mercury Prize table is a representative example because each column has an obviously unique structure that the LLM can key in on: (year) (Single Artist/Album pair) (List of Artist/Album pairs) (image) (citation link)
I think a much better test would be something like "List of elements by atomic properties" [1] that has a lot of adjacent numbers in a similar range and overlapping first/last column types. However, the danger with that table might be easy for the LLM to infer just from the element names since they're well known physical constants. The table of counties by population density might be less predictable [2] or list of largest cities [3]
The test should be repeated with every available sorting function too, to see if that causes any new errors.
Additionally, using any Wiki page is misleading, as LLMs have seen their format many times during training, and can probably reproduce the original HTML from the stripped version fairly well.
Instead, using some random, messy, scattered-with-spam site would be a much more realistic test environment.
Good points. But I feel like even with the cities article it could still ‘cheat’ by recognising what the data is supposed to be and filling in the blanks. Does it even need to be real though? What about generating a fake article to use as a test so it can’t possibly recognise the contents? You could even get GPT to generate it, just give it the ‘Largest cities’ HTML and tell it to output identical HTML but with all the names and statistics changed randomly.
You step back and realize: we are thinking about how to best remove some symbols from documents that not a moment ago we were deciding certainly needed to be in there, all to feed a certain kind of symbol machine which has seen all the symbols before anyway, all so we don't pay as much cents for the symbols we know or think we need.
If I was not a human but some other kind of being suspended above this situation, with no skin in the game so to speak, it would all seem so terribly inefficient... But as fleshy mortal I do understand how we got here.
if you want the AI to be able to select stuff, give it cheerio or jQuery access to navigate through the html document;
if you need to give tags, classes, and ids to the llm, I use an html-to-pug converter like https://www.npmjs.com/package/html2pug which strips a lot of text and cuts costs. I don't think LLMs are particularly trained on pug content though so take this with a grain of salt
Hmmm. That's interesting. I wish there was a Node-RED node for the first library (I can always import the library directly and build my own subflow, but since I have cheerio for Node-RED and use it for paring down input to LLMs already...)
But OP did a (admittedly flawed) test. Have you got anything to back up your claim here? We've all got our own hunches but this post was an attempt to test those hypotheses.
ChatGPT is clearly trained on wikipedia, is there any concern about its knowledge from there polluting the responses? Seems like it would be better to try against data it didn't potentially already know.
I roughly came to the same conclusion a few months back and wrote a simple, containerized, open source general purpose scraper for use with GPT using Playwright in C# and TypeScript that's fairly easy to deploy and use with GPT function calling[0]. My observation was that using `document.body.innerText` was sufficient for GPT to "understand" the page and `document.body.innerText` preserves some whitespace in Firefox (and I think Chrome).
I use more or less this code as a starting point for a variety of use cases and it seems to work just fine for my use cases (scraping and processing travel blogs which tend to have pretty consistent layouts/structures).
Some variations can make this better by adding logic to look for the `main` content and ignore `nav` and `footer` (or variants thereof whether using semantic tags or CSS selectors) and taking only the `innerText` from the main container.
I've been building an AI chat client and I use this exact technique to develop the "Web Browsing" plugin. Basically I use Function Calling to extract content from a web page and then pass it to the LLM.
There are a few optimizations we can make:
- trip all content in <script/> and <style/>
- use Readability.js for articles
- extract structured content from oEmbed
It works surprisingly well for me, even with gpt-4o-mini
Anecdotally, the same seems to apply to the output format as well. I’ve seen much better performance when instructing the model to output something like this:
I'd say how much is good enough highly depends on your use case. For something that still has to be reviewed by a human, I think even .7 is great; if you're planning to automate processes end-to-end, I'd aim for higher than .95
Well, when "simply" extracting the core text of an article is a task where most solutions (rule-based, visual, traditional classifiers and LLMs) rarely score above 0.8 in precision on datasets with a variety of websites and / or multilingual pages, I would consider that not too bad.
Chain of thought or some similar strategies (I hate that they have their own name and like a paper and authors, lol) can help you push that 0.9 to a 0.95-0.99.
This. Images passed to LLMs are typically downsampled to something like 512x512 because that’s perfectly good for feature extraction. Getting text would mean very large images so the text is still readable.
Or you can ask it to keep specific tags if you think those might help provide extra context to the LLM:
curl 'https://simonwillison.net/' | strip-tags .quote -t div -t blockquote
Add "-m" to minify the output (basically stripping most whitespace)
Running this command:
curl 'https://simonwillison.net/' | strip-tags .quote -t div -t blockquote -m
Gives me back output that starts like this:
<div class="quote segment"> <blockquote>history | tail -n
2000 | llm -s "Write aliases for my zshrc based on my
terminal history. Only do this for most common features.
Don't use any specific files or directories."</blockquote> —
anjor #
3:01 pm
/ ai, generative-ai, llms, llm </div>
<div class="quote segment"> <blockquote>Art is notoriously
hard to define, and so are the differences between good art
and bad art. But let me offer a generalization: art is
something that results from making a lot of choices. […] to
oversimplify, we can imagine that a ten-thousand-word short
story requires something on the order of ten thousand
choices. When you give a generative-A.I. program a prompt,
you are making very few choices; if you supply a hundred-word
prompt, you have made on the order of a hundred choices. If
an A.I. generates a ten-thousand-word story based on your
prompt, it has to fill in for all of the choices that you are
not making.</blockquote> — Ted Chiang #
10:09 pm
/ art, new-yorker, ai, generative-ai, ted-chiang </div>
An example of where this approach is problematic: many ecommerce product pages feature embedded json that is used to dynamically update sections of the page.
I think a much better test would be something like "List of elements by atomic properties" [1] that has a lot of adjacent numbers in a similar range and overlapping first/last column types. However, the danger with that table might be easy for the LLM to infer just from the element names since they're well known physical constants. The table of counties by population density might be less predictable [2] or list of largest cities [3]
The test should be repeated with every available sorting function too, to see if that causes any new errors.
[1] https://en.wikipedia.org/wiki/List_of_elements_by_atomic_pro...
[2] https://en.wikipedia.org/wiki/List_of_countries_and_dependen...
[3] https://en.wikipedia.org/wiki/List_of_largest_cities#List