Musk has been fixated on this idea that Twitter is a huge treasure trove of data for AI training, often complaining about AI companies using its data for training purposes. I had just assumed that companies like OpenAI were just crawling the web, which included Twitter, rather than targeting Twitter in particular. Is the Twitter data really that valuable for training AIs? What particular qualities does it have that make it particularly useful compared to any other freely available data set?
(1) Twitter's data is accurately timestamped, (2) there's new data constantly flowing in talking about recent events. There's no other source like that in English other than Reddit.
AFAIU neither of those are relevant to GPT-like architectures but it's not inconceivable to think there might be a model architecture in the future that takes advantage of those. Purely from a information theoretic POV, there's non-zero bits of information in the timestamp and relative ordering of tweets.
X and Reddit are definitely valuable, but they're definitely not unique. I think Meta and Google have inherent advantages because their data is not accessible to LLM competitors and they have the actual capabilities to build great LLMs.
Unless X decides to tap AI talent in China, they're going to have a REALLY hard time spinning up a competitive LLM team compared to OpenAI, Google, and Meta, which I think are the top three LLM companies in that order.
Facebook groups has, weirdly enough, had a bunch of quality discussions similar to Reddit. Can't speak for Instagram, but FB groups are worth peeking into to follow your favorite software projects.
My local city/town Facebook groups are surprisingly good. Like on Reddit there's always the specter of sketchy weird things happening behind the scenes with the mods/admins, but the day-to-day experience is very much that of chatting and sharing with my neighbors.
Fandom/topic or hobby/writing groups on Facebook are better quality discussion venues than Reddit if you can accept seeing some very obvious instances of spam posts and spam comments.
I haven't noticed a significant difference in discussion quality between The Platform Formerly Known As Twitter and Youtube/Reddit for most topics. Maybe I'm cynical, but the vast majority of public communication on the internet seems to be roughly the same quality in my mind.
> the vast majority of public communication on the internet seems to be roughly the same quality in my mind.
I mean, could it be that it's just that the platforms you're familiar with are similar quality? There are major quality differences. Consider, for example, HN vs Instagram. Do you really see no difference in the quality of discourse, or do you just not use Instagram?
By bulk/raw data volume, I'd say that the vast majority of internet communication is the same quality, yeah-- I'll stick by that assertion. That's not at odds with acknowledging there exist locations where intelligent communication happens. My position is just that the signal to noise ratio is pretty bad in the majority of places.
Is huge but also has enormous privacy issues. Most people by default assume their emails are reasonably private, whereas most people wouldn't assume their comments on these platforms are private.
Oh give it a rest, will you? Hiring AI "also-rans" or whatever term satisfies your sense of national chauvinism then.
For kicks I tried searching for news about "Chinese AI experts" and found the CCP has apparently infiltrated august institutions such as the Financial Times and Harvard Business Review (here, for instance: https://hbr.org/2021/02/is-china-emerging-as-the-global-lead...). Maybe you can go pester them.
Of course, the issue is they don't have an ai industry. The idea that they do is a CCP talking point theyve been pushing lately. They're pumping out fake ai crap all over the internet.
I would assume this is simply not enough data. You also have access to all public domain books and they usually make up a small fraction of the data used to train a model from scratch. For fine tuning a model, having a unique source of high quality data is probably valuable even if small.
Safety/Alignment researchers have been too fixated on making the one perfect LLM that has zero bias (or, arguably, one preferred bias) in my opinion.
I don't think Musk is the type of person to make the same mistake, so we'll either end up with a Twitter LLM that accurately represents the sum total of the Twitter firehose, and/or many derivative LLMs each having a set of, possibly orthogonal, biases. Honestly, I think the later is preferable and would represent the diversity of opinions in reality more accurately.
Given the data source, I think it will be important to be able to switch between LLM personalities in the future to get the "crowd truth".
I’m sure it depends on the type of prediction task, right?
Current LLMs are trying to predict typical human prose from samples pulled from the internet. So it isn’t as if they are sacrificing quality for quantity. A bunch of text from the internet is a very good representation of typical human prose. Whether it is well written or the descriptions contained in the prose accurately represent, like, actual physical reality is another issue.
Maybe they want to predict something with, like, less dimensionality but more utility than a paragraph of fiction.
> there's new data constantly flowing in talking about recent events.
In which the distinction between "data" and "information" is crucial. Especially now that the "floodgates" have been re-opened regarding misinformation, bots, impersonators and the likes.
Data is crucial when in need of training body. But information is crucial when the training must be tuned, limited or just verified.
We introduce new datasets derived from the fol-
lowing sources: PubMed Central, ArXiv, GitHub,
the FreeLaw Project, Stack Exchange, the US
Patent and Trademark Office, PubMed, Ubuntu
IRC, HackerNews, YouTube, PhilPapers, and NIH
ExPorter. We also introduce OpenWebText2 and
BookCorpus2, which are extensions of the original
OpenWebText (Gokaslan and Cohen, 2019) and
BookCorpus (Zhu et al., 2015; Kobayashi, 2018)
datasets, respectively.
Smaller in scale, yeah, but probably high bias towards technical content and written by people who (mostly) care about how to write properly. There is at least 37,354,035 items that could be indexed with lots of it high quality, percentage wise probably higher quality/post than Twitter, Facebook and other sources.
I think Twitter is not a great place to dig out training data in general. Most of its data is not well structured and/or tagged. Its signal to noise ratio is relatively low. Its texts are generally very short and very dependent on context. Twitter has been largely failing as a short-form video platform. There's some trace of algorithmic generation of interest/topic-based feed on Twitter, but you know, its quality was never great. I guess it's just a hard problem given Twitter's environment.
Its strength is freshness and volume, but I guess these can be achieved without Twitter if you have a strong web crawling infrastructure? Also, the current generation of LLM is not really capable of exploiting minute-level freshness... at least for now.
Also, Twitter is not where people go to be nice. Twitter incentivizes snarky, disparaging, curt behavior because (a) it limits message length to an extent where nice speech doesn't have a place (b) saying nice things gets you likes while saying not-nice things gets you retweets, and retweets are more highly valued by the algorithm.
Twitter’s data would be very valuable for generating tweet-like content: short, self-contained snippets, images and video.
There’s not a lot of data in Twitter today resembling long-form content: essays, news articles, books, scientific papers, etc. That’s probably why Twitter/X expanded the tweet size limit, to be able to collect such data.
If Twitter was as much a treasure trove of user data that Elon thinks it is, then why is Twitter's ad targeting so much worse than Facebook's and Instagram's?
Twitter has the largest database of tweets. If you want an AI that writes tweets, there’s nothing better. Why? Twitter could offer a service that bypasses the need for community manager… just feed it a press release, or product website, and it will provide a stream of well crafted ad tweets… or astroturf, even.
Doesn't really matter what the content is as long as the sequences of tokens make sense. That's the goal: predict the next token given the previous N tokens. Higher level structures like tweets just fall out of that but I wouldn't be too surprised if a model trained only on tweets could also generalize to some other structures.