Hacker News new | past | comments | ask | show | jobs | submit login
Tell HN: We should snapshot a mostly AI output free version of the web
136 points by jacquesm 9 months ago | hide | past | favorite | 60 comments
While we can, and if it isn't too late already. The web is overrun with AI generated drivel, I've been searching for information on some widely varying subjects and I keep landing in recently auto-generated junk. Unfortunately most search engines associate 'recency' with 'quality' or 'relevance' and that is very much no longer true.

While there is still a chance I think we should snapshot a version of the web and make it publicly available. That can serve as something to calibrate various information sources against to get an idea of whether or not they are to be used or rather not. I'm pretty sure Google, OpenAI and Facebook all have such snapshots stashed away that they train their AIs on, and such data will rapidly become as precious as 'low background steel'.

https://en.wikipedia.org/wiki/Low-background_steel




Sounds like you want Common Crawl - they have snapshots going back to 2013, take your pick: https://data.commoncrawl.org/crawl-data/index.html

(A semi-ironic detail: Common Crawl is one of the most common sources used as part of the training data for LLMs)


> such data will rapidly become as precious as 'low background steel'.

I'm also totally not convinced by this argument.

Synthetic data as an input to a careful training regimen will result in better outputs, not worse, because you're still subjecting the model to optimization and new information. Over time you can pull out the worse performing (original and synthetic) training data. That careful curation is the part that makes the difference.

It's like DNA in the chemical soup. It's been replicating polymers since the beginning, but in the end intelligence arises. It didn't need magical ingredients. When you climb a gradient, it typically takes you somewhere better.


> in the end intelligence arises. It didn't need magical ingredients.

That's the current prevailing hypothesis, but we don't yet fully understand the phenomenon of intelligence enough to definitively rule out any magical ingredients: unknown variables or characteristics of the system/inputs/data that made it possible for intelligence to emerge.

This proposed snapshot of the web, before it gets further "contaminated" by synthetic AI/LLM-generated data, might prove to be valuable or it might not. The premise could be wrong. Maybe we learn that there's nothing fundamentally special about human-generated data, compared to synthetic data derived from it.

It seems worthwhile to consider though, in case it turns out that there is some yet unknown quality of the more or less "pure" human data. In the metaphor of low-background steel, we could be entering a period of unregulated nuclear testings without being fully aware of the consequences.


I don't buy this at all. AI data is a real part of the environment. The thing to modify is the loss function not the training data. You need to be able to evaluate text on the internet and so do models.

This idea of contamination by AI vs pristine human data isn't persuasive to me at all. It feels like a continuation of the wrong idea that LLMs are parrots.


"Careful curation" is the part you lose when you use synthetic data. Subjecting models to "new information" isn't useful otherwise you could just feed it random 01s and hope to carefully curate it later

(also, how much time did the soup take? Can you wait that long?)


Training AI on AI generated data produces some increasingly weird outputs, I am sure we are already seeing the results of this in some models but the level of Hallucination is only going to increase unless some kind of checks and balances are implemented


Hallucination^2


Im convinced just cleaning existing dataseys would be more effective


2024 might already be too late, since this sentiment has been shared since at least 2021:

2021: https://twitter.com/jackclarkSF/status/1376304266667651078

2022: https://twitter.com/william_g_ray/status/1583574265513017344

2022: https://twitter.com/mtrc/status/1599725875280257024

Common Crawl and the Internet Archive crawls are probably the two most ready sources for this, you just have to define where you want to draw the line.

Common Crawl's first crawl of 2020 contains 3.1B pages, and is around 100TB: https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-05/inde... with their previous and subsequent crawls listed in the dropdown here: https://commoncrawl.org/overview

Internet Archive's crawls are here: https://archive.org/details/web organized by source. Wide Crawl 18 is from mid-2021 and is 68.5TB: https://archive.org/details/wide00018. Wide Crawl 17 was from late 2018 and is 644.4TB: https://archive.org/details/wide00017


Why is wide crawl 18 smaller than 17?


The tumbler purge was worse than I expected…


> I'm pretty sure Google, OpenAI and Facebook all have such snapshots stashed away that they train their AIs on

They probably just use publicly-available resources like The Pile. If newer training material becomes unusable for whatever reason, the old stuff still exists.

Paradoxically, I think a lot of research is showing that synthetic training information can be just as good as the real stuff. We may stumble upon an even stranger scenario where AI-generated content is more conducive to training than human content is.


> Paradoxically, I think a lot of research is showing that synthetic training information can be just as good as the real stuff.

Which studies show this? https://arxiv.org/abs/2305.17493 shows the exact opposite and my (layman's) understanding of statistics and epistemology lines up entirely with this finding.

Like, how could this even theoretically work? In the best case scenario wouldn't training on synthetic training data make LLMs overconfident / overfit the data once faced with new (human) input to respond to?


I don't have any exact references, but multiple finetuning datasets have used curated GPT-3/4 conversations as training data. It's less that they're overtly superior to human data, and more that they're less-bad and more abundantly available.

> Like, how could this even theoretically work?

I'm not really an expert on it either, but my understanding is that it works the same way curating human data works. You sift through the garbage, nonsense, impolite and incoherent AI responses and only include the exemplary conversations in your training set.

It feels kinda like the "monkeys on typewriters writing shakespeare" parable. If you have enough well-trained AIs generate enough conversations, eventually enough of them will be indistinguishable enough from human data to be usable for training.


> They probably just use publicly-available resources like The Pile

I’d be very surprised if the big orgs don’t have in house efforts that far exceed the pile. Hell we know Google paid Reddit a pile of money for data and other orgs are also willing to pay


Yeah they absolutely do not use the pile.


GPT-Neo and Llama were both trained on The Pile, and both of those were fairly influential releases. That's not to say they don't also use other resources, but I see no reason not to use The Pile; it's enormous.

It's also not everything there is, but for public preservation purposes I think the current archives are fine. If Google or Meta turn out to have been secretly stockpiling old training data without our knowledge, I'm not exactly sure what "we" would lose.


To index the web, you generally do make a copy of it.

Google has a huge number of books scanned, too.


“Somewhere at Google there is a database containing 25 million books and nobody is allowed to read them.”

https://www.theatlantic.com/technology/archive/2017/04/the-t...



Yeah, I hadn't thought about their abandoned effort to scan every book and archived newspaper in the world in a while, but I bet they're regretting now that they didn't finish. A non-trivial amount of that physical media has been tossed or degraded by underfunded libraries since then. And it's more valuable to them now that it ever was.


I learned Rust, with great help from ChatGPT-4.

If I can learn from AI-generated content, then I totally believe that AI can too.


The problem with AI-generated content is not necessarily that it's bad, rather, it's not novel information. To learn something, you must not already know it. If it's AI-generated, the AI already knows it.


How much work do individual humans do that could be considered genuinely truly novel? I measure the answer to be "almost none."


That's true to some extent, but training on synthetic content is big these days:

https://importai.substack.com/p/import-ai-369-conscious-mach...


We might also say the same thing about spelling and grammar checkers. The difference will be in the quality of oversight of the tool. The "AI generated drivel" has minimum oversight.

Example: I have a huge number of perplexity.ai search/research threads, but the ones I share with my colleagues are a product of selection bias. Some of my threads are quite useless, much like a web search that was a dud. Those do not get shared.

Likewise, if I use LLM to draft passages or even act as something like an overgrown thesaurus, I do find I have to make large changes. But some of the material stays intact. Is it AI, or not AI? It's bit of both. Sometimes my editing is heavyhanded, other times, less so, but in all cases, I checked the output.


You are assuming that you and AI are the same sort of thing.

I do not think we are at that point yet. In the meantime, the idea that we might get to intelligence by feeding in more data might get choked out by poisoned data.

I have a suspicion that there's a bit more to it than just more data though.


AI does not 'learn' like a human.


I learned.. If I can… then I totally…


I've posted this recently on another post as well, but before AI-generated spam there was content farm spam. This has been increasing in search results and on social networking sites for years now.

The solution is sticking to the websites you trust. And LLMs and RAG can actually make for a really good, very relevant search engine.


I feel like archive.org and The Pile have this covered, no?


Until some lawyers force us to get rid of it.


This implies that the pre-AI internet wasn't already overrun with SEO optimized junk. Much of the internet is not worth preserving.


ROSE : We've always kept records of our lives. Through words, pictures, symbols... from tablets to books…

COLONEL : But not all the information was inherited by later generations. A small percentage of the whole was selected and processed, then passed on. Not unlike genes, really.

ROSE : That's what history is, Jack.

COLONEL : But in the current, digitized world, trivial information is accumulating every second, preserved in all its triteness. Never fading, always accessible.

ROSE : Rumors about petty issues, misinterpretations, slander…

COLONEL : All this junk data preserved in an unfiltered state, growing at an alarming rate.

ROSE : It will only slow down social progress, reduce the rate of evolution.

COLONEL : Raiden, you seem to think that our plan is one of censorship.


SEO content farms have been publishing for decades now.


Alternatively, searching has to be changed. The non AI content doesn't necessarily disappear, but are gradually becoming "hidden gems". Something like Marginalia which does this for SEO noise would be nice.


At least I think I can tell when I am reading AI generated content, and stop reading and go somewhere else. Eventually though it'll get better to the point where it'll be hard to tell, but maybe then it's also good enough to be worth reading?


I mean, that is one assumption you could make.


I don't really have this problem because I habitually use the Tools option on Google (or equivalent on other search engines like DDG) to only return information from before a certain date. It's not flawless, as some media companies use a more or less static URL that they update frequently, but SEO-optimizers like this are generally pretty easy to screen out.

That said it's a problem, even if it's just the latest iteration of an older problem like content farming, article spinners and so on. I've said for years that spam is the ultimate cancer and that the tech community's general indifference to spam and scams will be its downfall.


Internet archive?


Not sure if they have a thorough snapshot, but good idea for sure. IA is probably the only entity on earth who might share this dataset instead of hoarding it.


And, on that note: https://archive.org/donate/


Using "before:2023" in your Google query helps. For now.

A few months ago, Lispi314 made a very interesting suggestion: an index of the ad-free internet. If you can filter ads and affiliate links then spam is harder to monetize.

https://udongein.xyz/notice/AcwmRcIzxOLmrSamum

There are some obvious problems with it, but I think I'd still like to see what that would look like.


good lord that is a horribly designed website, the OP that that person is linking to:

https://infosec.exchange/@bhawthorne/111601578642616056

:

How bad are the thousands of new stochastically-generated websites?

Last night I wanted to roast some hazelnuts, and I could not remember the temperature I used last time. So I searched on DuckDuckGo. Every website that I could find was machine-generated with different temps listed. One site had three separate methods listed that were essentially differently worded versions of the same thing. With different temperatures.

So I pulled my copy of Rodale’s Basic Natural Foods Cookbook off the shelf and looked it up there.

I think it may be time to download an archive copy of the 2022 Wikipedia before we lose all of our reference material. It was nice having all the world’s knowledge at my fingertips for a couple of decades, but that time seems to be past.


Sure, we can take a snapshot of our bot filled web today before it goes true AI. Not sure what the real benefit would be.


I have a sliver of hope AI generated content will actually be good one day. Just like I believe automated cars will be better than humans. I have nothing against reading content that was written by AI, for some of my reading.


I've been giving talks about Common Crawl for the last year with a slide about exactly this, using low background steel as an example.


that's what archive.org already does, but if you want to re-implement it, you would have to crawl all the web, eventually save thumbnails of pages with screenshotone (https://microlaunch.net/p/screenshotone)


> recently auto-generated junk

this would only apply for pre-agi era though


Is this really all that different from the procedurally generated drivel or the offshore freelance copy/paste generated drivel?

I find that I get a lot more AI content, but it mostly displaced the original freelancer/procedurally generated spam.


Reality is a mess in a lot of ways. Unfortunately in this case, it's a bit late.

Wouldn't it be nice if Elgoog, OpenAI, or Character.ai published this dataset, considering they definitely have it, and also they caused this issue.

I'm not holding my breath.


Internet Archive exists for webpages


The web has been overrun by drivel for over two decades now.


Since September at least


Isn’t this common crawl?


It's way too late.


r/Datahoarder probably already has you covered.


Same seems to have been happening on hn from the last several months.

had actually posted a question about this around that time, but the only reply i got was by a guy saying it is not likely, because the hn hive mind would drive down such posts.

not sure if he was right because I still see evidence of such stuff.


yes


Embrace it. Stop living in the past, Gatsby. Just ask ChatGPT for the answers you seek. Hahaha! :)

What are you searching for anyway??




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: