Twitter changed their ToS to explicitly disallow distributing twitter dumps like...

_hfqa · on June 8, 2011

-- Ups, I forgot to scrape the TOS

sigil · on June 9, 2011

I have 500GB of tweets from the random sample stream, in raw json form, going back to August 2010. Aside from some very broad measures it hasn't been all that useful to us. I never considered distributing it until now, but the TOS clause forbidding redistribution wrankles me. Are we really all okay with contributing content under these terms?

tibbon · on June 8, 2011

Additionally, this data probably isn't as useful as many might think. We found that collecting random tweets probably isn't that useful for most research overall, partially because using any of the streaming APIs omit tweets. Even 'full' firehose seems to omit some tweets, so it can't be considered a complete set, nor verified as a completely random set.

_hfqa · on June 8, 2011

-- I disagree.

- You can cluster users based on tweet data, links relationships &/or even user-to-user relationships

- Understand how retweets work and how fast they propagate.

- Sentiment analysis based on a specific keyword.

- Trend analysis.

There are N number of ways this dataset can be helpful. You have 200MM tweets. Enough for a quick experiment using real data.

* Its true that is "random" data. Just unrandom it!

tibbon · on June 8, 2011

User-to-user relationships aren't that great with incomplete data of the tweets, but also of the social graph. Pulling a large social graph from Twitter is nearly impossible and getting deltas on anything more than a few hundred people is equally impossible.

Propagation of retweets really needs a near complete dataset of those tweets/retweets. A steaming sample of the dataset really isn't great for this.

Sentiment analysis can be done to determine the overall feeling on a topic, but I'd feel really incomplete doing it on this dataset. Again, pulling the stream for the term or keyboard you're looking to sample is much better. Most sentiment analysis on Twitter is pretty flawed anyway.

Trend analysis works on this dataset ok, but measuring the true magnitude of an event would be hard (like Osama being killed) since you don't know what portion of the tweets you've actually got.

I worked with Sethish on the Web Ecology Project. I wouldn't call your dataset useless, but it really would be more useful generally to have a question, then pull the best possible data that will help you answer that question. Otherwise there's going to be a lot more unknowns that make it a weaker piece of research.

_hfqa · on June 8, 2011

your points are valid.

I want to clarify that this dump is for learning purposes due the lack of "open data".

If people can play with real data from real people and get "real" inputs, that can encourage curious programmers to join the data-mining party. I know there are other dumps out there, np with that. This is just another dump, it may help people come with ideas without the need of coding a a multi-threaded scrapper.

achompas · on June 9, 2011

This is a gold mine for a budding programmer. Anyone interested in learning MapReduce frameworks or messing with sentiment analysis/classification should get this data.

With that said...the data is now unavailable?

a3_nm · on June 9, 2011

This is ridiculous. Are ToS of this kind really enforceable? (What about outside the US?)

(Anyway, I guess that in practice, if a torrent with a dump of tweets appears somewhere, it's pretty hard to find out who did it. Yes, Twitter could do some clever watermarking of the API results or correlate the dump contents with the server logs, but it would probably be a lot of work.)