Hacker News new | past | comments | ask | show | jobs | submit login

User-to-user relationships aren't that great with incomplete data of the tweets, but also of the social graph. Pulling a large social graph from Twitter is nearly impossible and getting deltas on anything more than a few hundred people is equally impossible.

Propagation of retweets really needs a near complete dataset of those tweets/retweets. A steaming sample of the dataset really isn't great for this.

Sentiment analysis can be done to determine the overall feeling on a topic, but I'd feel really incomplete doing it on this dataset. Again, pulling the stream for the term or keyboard you're looking to sample is much better. Most sentiment analysis on Twitter is pretty flawed anyway.

Trend analysis works on this dataset ok, but measuring the true magnitude of an event would be hard (like Osama being killed) since you don't know what portion of the tweets you've actually got.

I worked with Sethish on the Web Ecology Project. I wouldn't call your dataset useless, but it really would be more useful generally to have a question, then pull the best possible data that will help you answer that question. Otherwise there's going to be a lot more unknowns that make it a weaker piece of research.




your points are valid.

I want to clarify that this dump is for learning purposes due the lack of "open data".

If people can play with real data from real people and get "real" inputs, that can encourage curious programmers to join the data-mining party. I know there are other dumps out there, np with that. This is just another dump, it may help people come with ideas without the need of coding a a multi-threaded scrapper.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: