Tell HN: We should snapshot a mostly AI output free version of the web

simonw · 2024-04-16T23:27:10 1713310030

Sounds like you want Common Crawl - they have snapshots going back to 2013, take your pick: https://data.commoncrawl.org/crawl-data/index.html

(A semi-ironic detail: Common Crawl is one of the most common sources used as part of the training data for LLMs)

echelon · 2024-04-17T01:44:08 1713318248

> such data will rapidly become as precious as 'low background steel'.

I'm also totally not convinced by this argument.

Synthetic data as an input to a careful training regimen will result in better outputs, not worse, because you're still subjecting the model to optimization and new information. Over time you can pull out the worse performing (original and synthetic) training data. That careful curation is the part that makes the difference.

It's like DNA in the chemical soup. It's been replicating polymers since the beginning, but in the end intelligence arises. It didn't need magical ingredients. When you climb a gradient, it typically takes you somewhere better.

lioeters · 2024-04-17T11:45:37 1713354337

> in the end intelligence arises. It didn't need magical ingredients.

That's the current prevailing hypothesis, but we don't yet fully understand the phenomenon of intelligence enough to definitively rule out any magical ingredients: unknown variables or characteristics of the system/inputs/data that made it possible for intelligence to emerge.

This proposed snapshot of the web, before it gets further "contaminated" by synthetic AI/LLM-generated data, might prove to be valuable or it might not. The premise could be wrong. Maybe we learn that there's nothing fundamentally special about human-generated data, compared to synthetic data derived from it.

It seems worthwhile to consider though, in case it turns out that there is some yet unknown quality of the more or less "pure" human data. In the metaphor of low-background steel, we could be entering a period of unregulated nuclear testings without being fully aware of the consequences.

nh23423fefe · 2024-04-17T18:03:21 1713377001

I don't buy this at all. AI data is a real part of the environment. The thing to modify is the loss function not the training data. You need to be able to evaluate text on the internet and so do models.

This idea of contamination by AI vs pristine human data isn't persuasive to me at all. It feels like a continuation of the wrong idea that LLMs are parrots.

eviks · 2024-04-17T05:31:48 1713331908

"Careful curation" is the part you lose when you use synthetic data. Subjecting models to "new information" isn't useful otherwise you could just feed it random 01s and hope to carefully curate it later

(also, how much time did the soup take? Can you wait that long?)

xs83 · 2024-04-17T04:07:53 1713326873

Training AI on AI generated data produces some increasingly weird outputs, I am sure we are already seeing the results of this in some models but the level of Hallucination is only going to increase unless some kind of checks and balances are implemented

dainiusse · 2024-04-17T11:42:37 1713354157

Hallucination^2

cyanydeez · 2024-04-17T20:56:54 1713387414

Im convinced just cleaning existing dataseys would be more effective

vitovito · 2024-04-16T23:36:03 1713310563

2024 might already be too late, since this sentiment has been shared since at least 2021:

2021: https://twitter.com/jackclarkSF/status/1376304266667651078

2022: https://twitter.com/william_g_ray/status/1583574265513017344

2022: https://twitter.com/mtrc/status/1599725875280257024

Common Crawl and the Internet Archive crawls are probably the two most ready sources for this, you just have to define where you want to draw the line.

Common Crawl's first crawl of 2020 contains 3.1B pages, and is around 100TB: https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-05/inde... with their previous and subsequent crawls listed in the dropdown here: https://commoncrawl.org/overview

Internet Archive's crawls are here: https://archive.org/details/web organized by source. Wide Crawl 18 is from mid-2021 and is 68.5TB: https://archive.org/details/wide00018. Wide Crawl 17 was from late 2018 and is 644.4TB: https://archive.org/details/wide00017

Atotalnoob · 2024-04-17T00:11:04 1713312664

Why is wide crawl 18 smaller than 17?

el_duderino_ · 2024-04-17T02:54:51 1713322491

The tumbler purge was worse than I expected…

talldayo · 2024-04-16T23:22:44 1713309764

> I'm pretty sure Google, OpenAI and Facebook all have such snapshots stashed away that they train their AIs on

They probably just use publicly-available resources like The Pile. If newer training material becomes unusable for whatever reason, the old stuff still exists.

Paradoxically, I think a lot of research is showing that synthetic training information can be just as good as the real stuff. We may stumble upon an even stranger scenario where AI-generated content is more conducive to training than human content is.

dangerwill · 2024-04-16T23:55:32 1713311732

> Paradoxically, I think a lot of research is showing that synthetic training information can be just as good as the real stuff.

Which studies show this? https://arxiv.org/abs/2305.17493 shows the exact opposite and my (layman's) understanding of statistics and epistemology lines up entirely with this finding.

Like, how could this even theoretically work? In the best case scenario wouldn't training on synthetic training data make LLMs overconfident / overfit the data once faced with new (human) input to respond to?

talldayo · 2024-04-17T00:07:58 1713312478

I don't have any exact references, but multiple finetuning datasets have used curated GPT-3/4 conversations as training data. It's less that they're overtly superior to human data, and more that they're less-bad and more abundantly available.

> Like, how could this even theoretically work?

I'm not really an expert on it either, but my understanding is that it works the same way curating human data works. You sift through the garbage, nonsense, impolite and incoherent AI responses and only include the exemplary conversations in your training set.

It feels kinda like the "monkeys on typewriters writing shakespeare" parable. If you have enough well-trained AIs generate enough conversations, eventually enough of them will be indistinguishable enough from human data to be usable for training.

Havoc · 2024-04-16T23:31:37 1713310297

> They probably just use publicly-available resources like The Pile

I’d be very surprised if the big orgs don’t have in house efforts that far exceed the pile. Hell we know Google paid Reddit a pile of money for data and other orgs are also willing to pay

vagabund · 2024-04-16T23:47:55 1713311275

Yeah they absolutely do not use the pile.

talldayo · 2024-04-16T23:59:25 1713311965

GPT-Neo and Llama were both trained on The Pile, and both of those were fairly influential releases. That's not to say they don't also use other resources, but I see no reason not to use The Pile; it's enormous.

It's also not everything there is, but for public preservation purposes I think the current archives are fine. If Google or Meta turn out to have been secretly stockpiling old training data without our knowledge, I'm not exactly sure what "we" would lose.

skybrian · 2024-04-16T23:48:41 1713311321

To index the web, you generally do make a copy of it.

Google has a huge number of books scanned, too.

TowerTall · 2024-04-17T02:19:19 1713320359

“Somewhere at Google there is a database containing 25 million books and nobody is allowed to read them.”

https://www.theatlantic.com/technology/archive/2017/04/the-t...

compootr · 2024-04-17T03:57:58 1713326278

https://archive.ph/rQ7Zb

pinko · 2024-04-17T00:00:00 1713312000

Yeah, I hadn't thought about their abandoned effort to scan every book and archived newspaper in the world in a while, but I bet they're regretting now that they didn't finish. A non-trivial amount of that physical media has been tossed or degraded by underfunded libraries since then. And it's more valuable to them now that it ever was.

bigyikes · 2024-04-16T23:31:34 1713310294

I learned Rust, with great help from ChatGPT-4.

If I can learn from AI-generated content, then I totally believe that AI can too.

coder-3 · 2024-04-16T23:37:47 1713310667

The problem with AI-generated content is not necessarily that it's bad, rather, it's not novel information. To learn something, you must not already know it. If it's AI-generated, the AI already knows it.

HeatrayEnjoyer · 2024-04-17T00:57:36 1713315456

How much work do individual humans do that could be considered genuinely truly novel? I measure the answer to be "almost none."

skybrian · 2024-04-17T02:50:04 1713322204

That's true to some extent, but training on synthetic content is big these days:

https://importai.substack.com/p/import-ai-369-conscious-mach...

fdr · 2024-04-17T01:02:27 1713315747

We might also say the same thing about spelling and grammar checkers. The difference will be in the quality of oversight of the tool. The "AI generated drivel" has minimum oversight.

Example: I have a huge number of perplexity.ai search/research threads, but the ones I share with my colleagues are a product of selection bias. Some of my threads are quite useless, much like a web search that was a dud. Those do not get shared.

Likewise, if I use LLM to draft passages or even act as something like an overgrown thesaurus, I do find I have to make large changes. But some of the material stays intact. Is it AI, or not AI? It's bit of both. Sometimes my editing is heavyhanded, other times, less so, but in all cases, I checked the output.

gorjusborg · 2024-04-17T02:55:13 1713322513

You are assuming that you and AI are the same sort of thing.

I do not think we are at that point yet. In the meantime, the idea that we might get to intelligence by feeding in more data might get choked out by poisoned data.

I have a suspicion that there's a bit more to it than just more data though.

janice1999 · 2024-04-16T23:42:24 1713310944

AI does not 'learn' like a human.

mr90210 · 2024-04-17T00:20:40 1713313240

I learned.. If I can… then I totally…

uyzstvqs · 2024-04-16T23:30:40 1713310240

I've posted this recently on another post as well, but before AI-generated spam there was content farm spam. This has been increasing in search results and on social networking sites for years now.

The solution is sticking to the websites you trust. And LLMs and RAG can actually make for a really good, very relevant search engine.

potatoman22 · 2024-04-16T23:13:30 1713309210

I feel like archive.org and The Pile have this covered, no?

pixl97 · 2024-04-16T23:25:39 1713309939

Until some lawyers force us to get rid of it.

Zenzero · 2024-04-17T02:36:21 1713321381

This implies that the pre-AI internet wasn't already overrun with SEO optimized junk. Much of the internet is not worth preserving.

Lammy · 2024-04-17T02:44:43 1713321883

ROSE : We've always kept records of our lives. Through words, pictures, symbols... from tablets to books…

COLONEL : But not all the information was inherited by later generations. A small percentage of the whole was selected and processed, then passed on. Not unlike genes, really.

ROSE : That's what history is, Jack.

COLONEL : But in the current, digitized world, trivial information is accumulating every second, preserved in all its triteness. Never fading, always accessible.

ROSE : Rumors about petty issues, misinterpretations, slander…

COLONEL : All this junk data preserved in an unfiltered state, growing at an alarming rate.

ROSE : It will only slow down social progress, reduce the rate of evolution.

COLONEL : Raiden, you seem to think that our plan is one of censorship.

skybrian · 2024-04-16T23:23:12 1713309792

SEO content farms have been publishing for decades now.

signaru · 2024-04-17T10:16:47 1713349007

Alternatively, searching has to be changed. The non AI content doesn't necessarily disappear, but are gradually becoming "hidden gems". Something like Marginalia which does this for SEO noise would be nice.

jdswain · 2024-04-17T01:37:42 1713317862

At least I think I can tell when I am reading AI generated content, and stop reading and go somewhere else. Eventually though it'll get better to the point where it'll be hard to tell, but maybe then it's also good enough to be worth reading?

DanHulton · 2024-04-17T01:52:02 1713318722

I mean, that is one assumption you could make.

anigbrowl · 2024-04-17T00:52:14 1713315134

I don't really have this problem because I habitually use the Tools option on Google (or equivalent on other search engines like DDG) to only return information from before a certain date. It's not flawless, as some media companies use a more or less static URL that they update frequently, but SEO-optimizers like this are generally pretty easy to screen out.

That said it's a problem, even if it's just the latest iteration of an older problem like content farming, article spinners and so on. I've said for years that spam is the ultimate cancer and that the tech community's general indifference to spam and scams will be its downfall.

aaronblohowiak · 2024-04-16T23:12:02 1713309122

Internet archive?

metadat · 2024-04-16T23:13:24 1713309204

Not sure if they have a thorough snapshot, but good idea for sure. IA is probably the only entity on earth who might share this dataset instead of hoarding it.

roughly · 2024-04-16T23:35:56 1713310556

And, on that note: https://archive.org/donate/

neilk · 2024-04-17T02:36:55 1713321415

Using "before:2023" in your Google query helps. For now.

A few months ago, Lispi314 made a very interesting suggestion: an index of the ad-free internet. If you can filter ads and affiliate links then spam is harder to monetize.

https://udongein.xyz/notice/AcwmRcIzxOLmrSamum

There are some obvious problems with it, but I think I'd still like to see what that would look like.

ikt · 2024-04-17T04:25:04 1713327904

good lord that is a horribly designed website, the OP that that person is linking to:

https://infosec.exchange/@bhawthorne/111601578642616056

:

How bad are the thousands of new stochastically-generated websites?

Last night I wanted to roast some hazelnuts, and I could not remember the temperature I used last time. So I searched on DuckDuckGo. Every website that I could find was machine-generated with different temps listed. One site had three separate methods listed that were essentially differently worded versions of the same thing. With different temperatures.

So I pulled my copy of Rodale’s Basic Natural Foods Cookbook off the shelf and looked it up there.

I think it may be time to download an archive copy of the 2022 Wikipedia before we lose all of our reference material. It was nice having all the world’s knowledge at my fingertips for a couple of decades, but that time seems to be past.

giantg2 · 2024-04-17T00:58:42 1713315522

Sure, we can take a snapshot of our bot filled web today before it goes true AI. Not sure what the real benefit would be.

dudus · 2024-04-17T00:46:37 1713314797

I have a sliver of hope AI generated content will actually be good one day. Just like I believe automated cars will be better than humans. I have nothing against reading content that was written by AI, for some of my reading.

ccgreg · 2024-04-18T16:24:47 1713457487

I've been giving talks about Common Crawl for the last year with a slide about exactly this, using low background steel as an example.

greyzor7 · 2024-04-18T09:38:29 1713433109

that's what archive.org already does, but if you want to re-implement it, you would have to crawl all the web, eventually save thumbnails of pages with screenshotone (https://microlaunch.net/p/screenshotone)

wseqyrku · 2024-04-17T07:49:27 1713340167

> recently auto-generated junk

this would only apply for pre-agi era though

MattGaiser · 2024-04-16T23:23:14 1713309794

Is this really all that different from the procedurally generated drivel or the offshore freelance copy/paste generated drivel?

I find that I get a lot more AI content, but it mostly displaced the original freelancer/procedurally generated spam.

metadat · 2024-04-16T23:12:41 1713309161

Reality is a mess in a lot of ways. Unfortunately in this case, it's a bit late.

Wouldn't it be nice if Elgoog, OpenAI, or Character.ai published this dataset, considering they definitely have it, and also they caused this issue.

I'm not holding my breath.

jamesy0ung · 2024-04-17T02:05:54 1713319554

Internet Archive exists for webpages

acheron · 2024-04-16T23:26:49 1713310009

The web has been overrun by drivel for over two decades now.

malfist · 2024-04-16T23:36:07 1713310567

Since September at least

mceoin · 2024-04-17T00:12:04 1713312724

Isn’t this common crawl?

RecycledEle · 2024-04-17T04:52:21 1713329541

It's way too late.

LorenDB · 2024-04-17T01:07:01 1713316021

r/Datahoarder probably already has you covered.

fuzztester · 2024-04-17T00:29:19 1713313759

Same seems to have been happening on hn from the last several months.

had actually posted a question about this around that time, but the only reply i got was by a guy saying it is not likely, because the hn hive mind would drive down such posts.

not sure if he was right because I still see evidence of such stuff.

alpenbazi · 2024-04-16T23:54:56 1713311696

keepamovin · 2024-04-17T02:31:22 1713321082

Embrace it. Stop living in the past, Gatsby. Just ask ChatGPT for the answers you seek. Hahaha! :)

What are you searching for anyway??