Hacker News new | past | comments | ask | show | jobs | submit login

Musk has been fixated on this idea that Twitter is a huge treasure trove of data for AI training, often complaining about AI companies using its data for training purposes. I had just assumed that companies like OpenAI were just crawling the web, which included Twitter, rather than targeting Twitter in particular. Is the Twitter data really that valuable for training AIs? What particular qualities does it have that make it particularly useful compared to any other freely available data set?



(1) Twitter's data is accurately timestamped, (2) there's new data constantly flowing in talking about recent events. There's no other source like that in English other than Reddit.

AFAIU neither of those are relevant to GPT-like architectures but it's not inconceivable to think there might be a model architecture in the future that takes advantage of those. Purely from a information theoretic POV, there's non-zero bits of information in the timestamp and relative ordering of tweets.


> There's no other source like that in English other than Reddit

1) Facebook Posts/Comments, 2) Instagram Posts/Comments, 3) Youtube Comments, 4) Gmail content, 5) LinkedIn Comments, 6) TikTok contents / comments

X and Reddit are definitely valuable, but they're definitely not unique. I think Meta and Google have inherent advantages because their data is not accessible to LLM competitors and they have the actual capabilities to build great LLMs.

Unless X decides to tap AI talent in China, they're going to have a REALLY hard time spinning up a competitive LLM team compared to OpenAI, Google, and Meta, which I think are the top three LLM companies in that order.


The discourse on Facebook & Instagram is significantly different than what you would find on X or Reddit, both in terms of quality & topics.


Facebook groups has, weirdly enough, had a bunch of quality discussions similar to Reddit. Can't speak for Instagram, but FB groups are worth peeking into to follow your favorite software projects.


My local city/town Facebook groups are surprisingly good. Like on Reddit there's always the specter of sketchy weird things happening behind the scenes with the mods/admins, but the day-to-day experience is very much that of chatting and sharing with my neighbors.


Fandom/topic or hobby/writing groups on Facebook are better quality discussion venues than Reddit if you can accept seeing some very obvious instances of spam posts and spam comments.


I haven't noticed a significant difference in discussion quality between The Platform Formerly Known As Twitter and Youtube/Reddit for most topics. Maybe I'm cynical, but the vast majority of public communication on the internet seems to be roughly the same quality in my mind.


> the vast majority of public communication on the internet seems to be roughly the same quality in my mind.

I mean, could it be that it's just that the platforms you're familiar with are similar quality? There are major quality differences. Consider, for example, HN vs Instagram. Do you really see no difference in the quality of discourse, or do you just not use Instagram?


> vast majority

By bulk/raw data volume, I'd say that the vast majority of internet communication is the same quality, yeah-- I'll stick by that assertion. That's not at odds with acknowledging there exist locations where intelligent communication happens. My position is just that the signal to noise ratio is pretty bad in the majority of places.


Not sure which way that cuts


> 3) Youtube Comments

Is probably LLM poison.

> 4) Gmail content

Is huge but also has enormous privacy issues. Most people by default assume their emails are reasonably private, whereas most people wouldn't assume their comments on these platforms are private.


That totally depends on the youtube channel. The tech and project channels I follow have excellent comments, often better than HN and proggit. Eg:

https://www.youtube.com/@HyperspacePirate

https://www.youtube.com/@scottmanley

https://www.youtube.com/@3blue1brown

https://www.youtube.com/@reps

https://www.youtube.com/@AppliedScience


Hiring a lot of experts from China sounds extremely politically challenging.


[flagged]


Oh give it a rest, will you? Hiring AI "also-rans" or whatever term satisfies your sense of national chauvinism then.

For kicks I tried searching for news about "Chinese AI experts" and found the CCP has apparently infiltrated august institutions such as the Financial Times and Harvard Business Review (here, for instance: https://hbr.org/2021/02/is-china-emerging-as-the-global-lead...). Maybe you can go pester them.


I am 100% certain that there is at least a single AI expert in China.

This is bait, right?


Of course, the issue is they don't have an ai industry. The idea that they do is a CCP talking point theyve been pushing lately. They're pumping out fake ai crap all over the internet.


That has to be the fastest I’ve seen someone concede a statement is true that they just called “CCP propaganda” a moment ago.


How is AI talent from China related?


I assume it’s for access and expertise into the “Sinosphere” of knowledge. Wherein the other side of the coin is the Anglosphere in the west.


Just take an aggregate of every news article written in the top 100 newspapers.

That's high quality content, timestamped and about current events.

There is very little content on Twitter that compared in quality to one will written news article.


I would assume this is simply not enough data. You also have access to all public domain books and they usually make up a small fraction of the data used to train a model from scratch. For fine tuning a model, having a unique source of high quality data is probably valuable even if small.


There is information posted to X which the top 100 newspapers are not willing to / do not care to publish.


As we've learned in recent months, it goes both ways.


It could also be argued that the Twitter firehose requires substantial RLHF, de-biasing and moderation controls because of its colloquial nature.


Safety/Alignment researchers have been too fixated on making the one perfect LLM that has zero bias (or, arguably, one preferred bias) in my opinion.

I don't think Musk is the type of person to make the same mistake, so we'll either end up with a Twitter LLM that accurately represents the sum total of the Twitter firehose, and/or many derivative LLMs each having a set of, possibly orthogonal, biases. Honestly, I think the later is preferable and would represent the diversity of opinions in reality more accurately.

Given the data source, I think it will be important to be able to switch between LLM personalities in the future to get the "crowd truth".


> Twitter LLM that accurately represents the sum total of the Twitter firehose,

We need an xkcd showing a conversation between twitter, reddit, and hacker news based LLMs. Political rage meets memes meets pedantry.


Wire news services come to mind as an alternative.


Perhaps, but the quantity of data is comparatively miniscule.


The quantity of information is probably higher though :)


Of course, but for training data current LLMs seem to need quantity above all else.


I’m sure it depends on the type of prediction task, right?

Current LLMs are trying to predict typical human prose from samples pulled from the internet. So it isn’t as if they are sacrificing quality for quantity. A bunch of text from the internet is a very good representation of typical human prose. Whether it is well written or the descriptions contained in the prose accurately represent, like, actual physical reality is another issue.

Maybe they want to predict something with, like, less dimensionality but more utility than a paragraph of fiction.


Agreed, not that anyone really seems to care so.


> there's new data constantly flowing in talking about recent events.

In which the distinction between "data" and "information" is crucial. Especially now that the "floodgates" have been re-opened regarding misinformation, bots, impersonators and the likes.

Data is crucial when in need of training body. But information is crucial when the training must be tuned, limited or just verified.


What about HN?


Probably HN is already part of The Pile [1]?

I guess X is harder to scrap without permission.

[1] https://pile.eleuther.ai


It is indeed:

We introduce new datasets derived from the fol- lowing sources: PubMed Central, ArXiv, GitHub, the FreeLaw Project, Stack Exchange, the US Patent and Trademark Office, PubMed, Ubuntu IRC, HackerNews, YouTube, PhilPapers, and NIH ExPorter. We also introduce OpenWebText2 and BookCorpus2, which are extensions of the original OpenWebText (Gokaslan and Cohen, 2019) and BookCorpus (Zhu et al., 2015; Kobayashi, 2018) datasets, respectively.

From https://arxiv.org/abs/2101.00027 (The Pile: An 800GB Dataset of Diverse Text for Language Modeling)


...HackerNews...

And there's my incentive to stop posting on HN.

It's been a blast, guys. I'm going back to lurker mode.


> > Probably HN is already part of The Pile

Everybody smile for the camera, or we could just moon them, or both!


HN is a very specific demographic and probably orders of magnitude smaller in scale. Similar, but also not really comparable.


Smaller in scale, yeah, but probably high bias towards technical content and written by people who (mostly) care about how to write properly. There is at least 37,354,035 items that could be indexed with lots of it high quality, percentage wise probably higher quality/post than Twitter, Facebook and other sources.


I think Twitter is not a great place to dig out training data in general. Most of its data is not well structured and/or tagged. Its signal to noise ratio is relatively low. Its texts are generally very short and very dependent on context. Twitter has been largely failing as a short-form video platform. There's some trace of algorithmic generation of interest/topic-based feed on Twitter, but you know, its quality was never great. I guess it's just a hard problem given Twitter's environment.

Its strength is freshness and volume, but I guess these can be achieved without Twitter if you have a strong web crawling infrastructure? Also, the current generation of LLM is not really capable of exploiting minute-level freshness... at least for now.


Also, Twitter is not where people go to be nice. Twitter incentivizes snarky, disparaging, curt behavior because (a) it limits message length to an extent where nice speech doesn't have a place (b) saying nice things gets you likes while saying not-nice things gets you retweets, and retweets are more highly valued by the algorithm.


Twitter’s data would be very valuable for generating tweet-like content: short, self-contained snippets, images and video.

There’s not a lot of data in Twitter today resembling long-form content: essays, news articles, books, scientific papers, etc. That’s probably why Twitter/X expanded the tweet size limit, to be able to collect such data.


Yep! It would be a great it generating hot takes and click bait.


and racist, anti-Semitic content


If Twitter was as much a treasure trove of user data that Elon thinks it is, then why is Twitter's ad targeting so much worse than Facebook's and Instagram's?


Twitter has the largest database of tweets. If you want an AI that writes tweets, there’s nothing better. Why? Twitter could offer a service that bypasses the need for community manager… just feed it a press release, or product website, and it will provide a stream of well crafted ad tweets… or astroturf, even.


Doesn't really matter what the content is as long as the sequences of tokens make sense. That's the goal: predict the next token given the previous N tokens. Higher level structures like tweets just fall out of that but I wouldn't be too surprised if a model trained only on tweets could also generalize to some other structures.


Twitter and Reddit are extremely valuable to LLMs and makers of both are really kicking themselves over missing the boat with open APIs.


Now they're kicking themselves into irrelevance by restricting access.


Comment datasets are valuable for conversational AI, it’s the same reason Reddit locked down the API I imagine.


Training anything on a network of bot traffic. What a time to be alive...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: