"The PaLM 2 pre-training corpus is composed of a diverse set of sources: web doc...

inscrutable · on May 10, 2023

My sweet summer child, this is a closely guarded secret. Will only be revealed if perhaps Europe demands it so that copyright holders can sue.

reactordev · on May 10, 2023

Metadata will show where it came from, should you choose to keep it. Or so they showed on the big screen at I/O today.

inscrutable · on May 10, 2023

maybe you're right, but I'd be skeptical. In a non-snarky way, this shows the data sources used in models to date up to GPT 3.

https://lifearchitect.ai/whats-in-my-ai/

OpenAI paid $2m/year for twitter feeds until Elon cut them off, and Sam Altman has mentioned they'd paid a lot for scientific journals and Reddit mention they'll start charging. Given how central data quality and curation is, if these private data sources give a significant boost, it won't be available for Apache2 models.

sebzim4500 · on May 10, 2023

Given Reddit's inability to keep their website functioning (unless you use the far superior old.reddit.com) I find it hard to believe they would be able to stop a motivated developer from scraping the whole site.

dontupvoteme · on May 10, 2023

this is about the time that i expect sites to begin returning intentionally corrupt/incorrect/perhaps outright garbage (subtle or not, probably better subtle so they don't realize it until it's far too late) data in order to intentionally poison enemy wellscraping. where "ethics" dissolve into the inherent raw cannibalistic laws of capitalist ventures.

then you can sell them back the TBs they scraped at a 1000x markup for the real data. or attempt to watermark it so you can prove their illegal(?) usage of your services in their training.

KeplerBoy · on May 10, 2023

You might be right. What a dystopian future that will be. Make a few requests too many and the webserver might think you're scraping data so it gaslights you into reading bullshit.

dmix · on May 11, 2023

Is this sarcasm? I can’t tell.

KeplerBoy · on May 11, 2023

It's not. The internet will be crazy once compute will be cheap enough to slightly modify all displayed content slightly to suit your personal user profile.

dmix · on May 11, 2023

So you think Reddit is going to replace their actual content… with very believable generated text? And that’s going to fool people at scale? How does that help Reddit (or other org) combat bots? You can just put garbage text that seems real but has nothing to do with todays news (or politics or science).

I’m really struggling to understand how you think this is going to work and result in harm.

This assumes both the site and the reader are really dumb.

sebzim4500 · on May 10, 2023

Maybe they've been doing that for years and that's why all the advice subreddits turned into creative writing subreddits.

jimmygrapes · on May 10, 2023

I fully expect Discord to be a data source, if not already, then for a future version. I also expect that the only way the general public would ever find this out is via whistle-blower.

astrange · on May 11, 2023

It'd be pretty easy to tell; you could just ask it to generate Discord chats and notice it works. Text models also like to memorize their inputs if they're big enough, so you could probably get specific ones.

FanaHOVA · on May 11, 2023

They don't specify, but if you're generally curious you should look into mC4, RedPajama, The Stack, etc as they are the foundation of most training sets.

simonw · on May 11, 2023

I've spent quite a bit of time exploring RedPajama! https://simonwillison.net/2023/Apr/17/redpajama-data/

adt · on May 11, 2023

Here's my estimate:

https://lifearchitect.ai/bard#dataset