Free SQL dump with 200 million tweets from 13 million users

sethish · on June 8, 2011

Twitter changed their ToS to explicitly disallow distributing twitter dumps like this: http://chronicle.com/blogs/profhacker/the-end-of-twapperkeep...

I was a part of the webecology project (and 140kit.com, both of which gave large twitter datasets to researchers.

_hfqa · on June 8, 2011

-- Ups, I forgot to scrape the TOS

sigil · on June 9, 2011

I have 500GB of tweets from the random sample stream, in raw json form, going back to August 2010. Aside from some very broad measures it hasn't been all that useful to us. I never considered distributing it until now, but the TOS clause forbidding redistribution wrankles me. Are we really all okay with contributing content under these terms?

tibbon · on June 8, 2011

Additionally, this data probably isn't as useful as many might think. We found that collecting random tweets probably isn't that useful for most research overall, partially because using any of the streaming APIs omit tweets. Even 'full' firehose seems to omit some tweets, so it can't be considered a complete set, nor verified as a completely random set.

_hfqa · on June 8, 2011

-- I disagree.

- You can cluster users based on tweet data, links relationships &/or even user-to-user relationships

- Understand how retweets work and how fast they propagate.

- Sentiment analysis based on a specific keyword.

- Trend analysis.

There are N number of ways this dataset can be helpful. You have 200MM tweets. Enough for a quick experiment using real data.

* Its true that is "random" data. Just unrandom it!

tibbon · on June 8, 2011

User-to-user relationships aren't that great with incomplete data of the tweets, but also of the social graph. Pulling a large social graph from Twitter is nearly impossible and getting deltas on anything more than a few hundred people is equally impossible.

Propagation of retweets really needs a near complete dataset of those tweets/retweets. A steaming sample of the dataset really isn't great for this.

Sentiment analysis can be done to determine the overall feeling on a topic, but I'd feel really incomplete doing it on this dataset. Again, pulling the stream for the term or keyboard you're looking to sample is much better. Most sentiment analysis on Twitter is pretty flawed anyway.

Trend analysis works on this dataset ok, but measuring the true magnitude of an event would be hard (like Osama being killed) since you don't know what portion of the tweets you've actually got.

I worked with Sethish on the Web Ecology Project. I wouldn't call your dataset useless, but it really would be more useful generally to have a question, then pull the best possible data that will help you answer that question. Otherwise there's going to be a lot more unknowns that make it a weaker piece of research.

_hfqa · on June 8, 2011

your points are valid.

I want to clarify that this dump is for learning purposes due the lack of "open data".

If people can play with real data from real people and get "real" inputs, that can encourage curious programmers to join the data-mining party. I know there are other dumps out there, np with that. This is just another dump, it may help people come with ideas without the need of coding a a multi-threaded scrapper.

achompas · on June 9, 2011

This is a gold mine for a budding programmer. Anyone interested in learning MapReduce frameworks or messing with sentiment analysis/classification should get this data.

With that said...the data is now unavailable?

a3_nm · on June 9, 2011

This is ridiculous. Are ToS of this kind really enforceable? (What about outside the US?)

(Anyway, I guess that in practice, if a torrent with a dump of tweets appears somewhere, it's pretty hard to find out who did it. Yes, Twitter could do some clever watermarking of the API results or correlate the dump contents with the server logs, but it would probably be a lot of work.)

jdvolz · on June 8, 2011

Calufa, next time you're in Vegas, send me a message and we'll get a beer. Thank you. You just made something I'm doing vastly more awesome.

stavros · on June 8, 2011

Torrent here, when done: http://burnbit.com/torrent/170493/twitter_sql_bz2

_hfqa · on June 8, 2011

import to mysql:

bunzip2 < my_database.sql.bz2 | mysql -h localhost -u root -p my_database

aonic · on June 8, 2011

Thanks! More interested in the scraper.. is it open-source? If yes, where can we download it? If not, can you write about your experience in building it?

tibbon · on June 8, 2011

Writing a Twitter scraper is pretty trivial and you can find several good examples on Github. I'd put mine online, but the commands I was using in 2009/2010 are changed/deprecated largely and my code wouldn't run.

In either case, as Sethish said, distributing dumps like this is against the new ToS.

_hfqa · on June 8, 2011

I will blog about how I did it in a few days...

meifun · on June 8, 2011

Where do you Blog so I can add to my RSS?

_hfqa · on June 8, 2011

I dont have a blog, sorry. I will open one soon...

Feel free to follow me http://twitter.com/calufa.

kodeshpa · on June 8, 2011

if you are interested in crawling FB, check this out http://www.zubha-labs.com/oauth-trick-for-facebook-desktop-a...

JeeyoungKim · on June 9, 2011

Hey guys, what would be the most sane way to work with this dataset? If it's 173GB, it's probably hard to load it up in a single machine.

ck2 · on June 8, 2011

Hmm, how many days back does it go?

Twitter search still only goes back 10 days in 2011, so how deep is this data?

_hfqa · on June 8, 2011

To be honest I have no idea. It crawled 13MM users, some accounts can be very old with very old tweets... You can look at the CD_data table and look for the tweet html code and parse the timestamp.

ck2 · on June 8, 2011

Apparently Twitter now has 100+ Million tweets per DAY.

So you caught about 2 days worth but randomly in time.

_hfqa · on June 8, 2011

@ck2 --- correct.

laprise · on June 9, 2011

Neat ! here some tips for creating a kick ass graph visualization: http://www.martinlaprise.info/2010/02/15/visualize-your-own-...

nametoremember · on June 11, 2011

Damn, I just saw this. I would have liked to use it. How can Twitter make you take it down when it is all public information anyway?

_hfqa · on June 8, 2011

A EMAIL FROM TWITTER KILLED THE DATASET --- :S

user24 · on June 8, 2011

Can you give more detail? The link is still up... What did they say?

edit reply via twitter: "they asked me to remove the dump due TOS" (http://twitter.com/#!/calufa/status/78556903772393474)

which I guess is what I expected.

But are scrapers subject to TOS?

JeeyoungKim · on June 14, 2011

Does anybody want to share MD5 hash of the file? I'm trying to decompress this file, and I'm keep getting an error.

JeeyoungKim · on June 14, 2011

wait, the torrent link has it. I do have the same md5hash, and yet, it's keep crashing whenever i'm trying to uncompress this shit... wtf is going on.

justadude · on June 19, 2011

Did you figure out how to get this working? I tried 7-zip as well as winrar and both errored out

juiceandjuice · on June 8, 2011

Wow, I just downloaded that whole archive in a minute.

_hfqa · on June 8, 2011

bz2 compression ;) --- 1147480:1 compression ratio

joelthelion · on June 9, 2011

Just shows how much real information is in tweets : not much :)

8maki · on June 14, 2011

Oh it's awesome dump. Are these mainly from US?

chrisjsmith · on June 8, 2011

All that is meaningless chatter between people and information about bathroom habits. Perhaps if we pooled that distributed effort into something constructive, the world would be a better place.

PostOnce · on June 12, 2011

http://twitter.com/#!/id_aa_carmack

"Msbuild seems to limit to 100 files on a cl command line, which introduces noticeable sync losses when parallel building on 24 threads."

It's not all meaningless, you just choose to follow meaningless users.