Hacker News new | past | comments | ask | show | jobs | submit login

"The PaLM 2 pre-training corpus is composed of a diverse set of sources: web documents, books, code, mathematics, and conversational data"

I really want to know more about the training data. Which web documents, which books, code from where, conversational data from where?




My sweet summer child, this is a closely guarded secret. Will only be revealed if perhaps Europe demands it so that copyright holders can sue.


Metadata will show where it came from, should you choose to keep it. Or so they showed on the big screen at I/O today.


maybe you're right, but I'd be skeptical. In a non-snarky way, this shows the data sources used in models to date up to GPT 3.

https://lifearchitect.ai/whats-in-my-ai/

OpenAI paid $2m/year for twitter feeds until Elon cut them off, and Sam Altman has mentioned they'd paid a lot for scientific journals and Reddit mention they'll start charging. Given how central data quality and curation is, if these private data sources give a significant boost, it won't be available for Apache2 models.


Given Reddit's inability to keep their website functioning (unless you use the far superior old.reddit.com) I find it hard to believe they would be able to stop a motivated developer from scraping the whole site.


this is about the time that i expect sites to begin returning intentionally corrupt/incorrect/perhaps outright garbage (subtle or not, probably better subtle so they don't realize it until it's far too late) data in order to intentionally poison enemy wellscraping. where "ethics" dissolve into the inherent raw cannibalistic laws of capitalist ventures.

then you can sell them back the TBs they scraped at a 1000x markup for the real data. or attempt to watermark it so you can prove their illegal(?) usage of your services in their training.


You might be right. What a dystopian future that will be. Make a few requests too many and the webserver might think you're scraping data so it gaslights you into reading bullshit.


Is this sarcasm? I can’t tell.


It's not. The internet will be crazy once compute will be cheap enough to slightly modify all displayed content slightly to suit your personal user profile.


So you think Reddit is going to replace their actual content… with very believable generated text? And that’s going to fool people at scale? How does that help Reddit (or other org) combat bots? You can just put garbage text that seems real but has nothing to do with todays news (or politics or science).

I’m really struggling to understand how you think this is going to work and result in harm.

This assumes both the site and the reader are really dumb.


Maybe they've been doing that for years and that's why all the advice subreddits turned into creative writing subreddits.


I fully expect Discord to be a data source, if not already, then for a future version. I also expect that the only way the general public would ever find this out is via whistle-blower.


It'd be pretty easy to tell; you could just ask it to generate Discord chats and notice it works. Text models also like to memorize their inputs if they're big enough, so you could probably get specific ones.


They don't specify, but if you're generally curious you should look into mC4, RedPajama, The Stack, etc as they are the foundation of most training sets.


I've spent quite a bit of time exploring RedPajama! https://simonwillison.net/2023/Apr/17/redpajama-data/





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: