Official Google Research Blog: The Unreasonable Effectiveness of Data

presty · on March 26, 2009

This line of thought follows previous posts on the same subject.

It's very Norvig-esq (http://www.youtube.com/watch?v=LNjJTgXujno), but there's also http://anand.typepad.com/datawocky/2008/03/more-data-usual.h... and also Chris Anderson and Wired's flame bait http://www.wired.com/science/discoveries/magazine/16-07/pb_t... (that month's wired was dedicated to this subject)

And like someone at the previous discussions has said, this is the base of the scientific method, not it's death

bd · on March 26, 2009

Golden quote from Norvig's talk at Startup School:

Q: What's your opinion about semantic web?

A: Semantic web. Future of the web. And it always will be.

Also:

If I assigned engineers to (semantic web) formats based on the percentage of pages that had those formats, then the correct number of engineers for semantic web was zero.

ntoshev · on March 26, 2009

I would also add the "Theorizing from data" talk from Norvig:

http://www.youtube.com/watch?v=nU8DcBF-qo4

andreyf · on March 26, 2009

Norvig had a great response to that wired article: http://norvig.com/fact-check.html

Anon84 · on March 26, 2009

Unfortunately, this only shows one side of the equation. Namely, the internet behemoth side.

If you're Google, Yahoo! or one of their friends, you can get away with relying just on correlations extracted directly from data. After all, you have all the data you could possibly want, and if you don't have, you can easily measure it in a straightforward way.

Everybody else, however, has to do a much better job of developing the right algorithms and insights to get the upper hand. The best way to do this of course, is to use whatever data you manage to scrape together.

Luckily, they also seem to recognize that sometimes data just isn't enough and ask for help. You've seen this in the Netflix prize, the AOL search log debacle and more recently in Microsoft's release of search logs for WSCD09.

Retric · on March 26, 2009

The Netflix prize is a contest to discover how much you can do with pure data. I don't see how you can place it on the semantic web side of things when they don't do any tagging etc.

Anon84 · on March 26, 2009

I said nothing the semantic web.

I just said that sometimes, all the data in the world isn't enough if you don't have the right algorithms or insights.

mikepellon · on March 26, 2009

I think an interesting related avenue of research would be investigating analytically the "emotional" content of the Internet. Johnathan Harris over at http://www.number27.org has made some great strides in the area looking at blog posts and global news content (see http://www.wefeelfine.org and http://www.tenbyten.org). While Harris has some very impressive visualizations of massive amounts of data I believe we are at that point that we can move beyond just looking at massive collections of data and begin to saw something mathematically about the patterns and characteristics that emerge from those sources. With the advent of cheap cloud computing, aka Amazon EC2, such detailed and massive undertakings are now possible by ordinary developers.

jacoblyles · on March 26, 2009

"Let large quantities of data solve your problems" might not be the best advice if you are hardware constrained. Not everyone has the petabytes of storage and the terabytes of RAM that Google has.

I guess cutting edge natural language apps are going to be the playground of the big boys until PCs reach the scale necessary to do experiments.

ntoshev · on March 26, 2009

Practical natural language apps might require much less, though: see Norvig's spell checker for example. You can probably fit the google index from 1998 on a single modern machine.

jorgem · on March 26, 2009

Anyone know of an API to access that trillion word google corpus mentioned in the article?

snprbob86 · on March 26, 2009

No API, but you can buy it on 6 DVDs for $150:

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-ar...

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=...

nl · on March 26, 2009

Not big fans of the Semantic Web, then...

mikedouglas · on March 26, 2009

I wouldn't be so sure. It's just that they're proposing a very different method (statistical analysis) to uncover the inherent meaning in the text.

tokenadult · on March 26, 2009

I thought one of their main points in the paper is that there are always going to be more data sets in "natural" form than in conveniently marked-up form, so that a researcher has to develop tools to deal with natural data as they are and still cope with that. Then the next new-and-improved scheme for semantic mark-up can learn from what is observed in vast data sets.

dschobel · on March 26, 2009

I remember watching one of Norvig's Tech Talks where he was asked specifically about the semantic web and his response: "the semantic web is the future of the web... and always will be"