What is data science?

hnote · on Aug 23, 2010

The misnaming of fields of study is so common as to lead to what might be general systems laws.

For example, Frank Harary once suggested the law that any field that had the word "science" in its name was guaranteed thereby not to be a science. He would cite as examples Military Science, Library Science, Political Science, Homemaking Science, Social Science, and Computer Science.

Discuss the generality of this law, and possible reasons for its predictive power.

-- Gerald Weinberg, "An Introduction to General Systems Thinking."

hnote · on Aug 23, 2010

A Google Books link: http://bit.ly/cAF4Xz

ced · on Aug 23, 2010

Counterexample: Neuroscience. But that's the only one I can find.

hnote · on Aug 23, 2010

Notice how Neuroscience can't be split into two words... This exception seems to be supporting the rule.

anamax · on Aug 23, 2010

Materials science is my counter example.

billswift · on Aug 23, 2010

Materials science isn't any more (or less) science than computer science is. It is applied chemistry.

Tichy · on Aug 22, 2010

Hm, what is the scale of the chart at the end (Cassandra Jobs)? My guess is that it went from 0 to 3 jobs...

Not that I dislike Cassandra, just that interesting jobs tend to be rare in my experience.

greenlblue · on Aug 23, 2010

I disagree with the whole premise that data science is the wave of the future. As soon as you collect a piece of information it is already out of date and is of very limited use. Weather scientists have been collection data since the beginning of the century and they are still no better at predicting the weather than they were at the beginning of the century. Economists are in the same boat. They have tons of data on markets but they still can't figure out what makes markets tick. What we need is not more data or data centric thinking. We need generative models that explain how the data is being generated and why it is being generated in a certain way. There is only so much you can squeeze out of raw data by computing numbers from it and so far I don't think the results have been that impressive.

pvg · on Aug 23, 2010

Weather scientists have been collection data since the beginning of the century and they are still no better at predicting the weather than they were at the beginning of the century. Economists are in the same boat.

If you're going to disagree with a premise, it might be better to pick counter-examples that are actually, you know, true.

greenlblue · on Aug 23, 2010

Huh? I'm pretty sure both economists and weather scientists have tons of data but their theories to explain that data have been stagnating for a while now. Economic crashes and booms still happen and nobody knows why even though there is a lot of historical market data people still don't know what leads to them and how to stop them. When was the last time you got an accurate rain forecast? 40% chance of rain? What does that even mean? It either rains or it doesn't and I want to know exactly which and nobody will figure out how to do so anytime soon. While I'm at it I will throw the human genome at you as well. Big pharma was promising untold breakthroughs once we figured out the stuff people are made from but the results have been unspectacular since cancer incidence and whatever other incidence you can think of has been steady and even increasing. The only time big pharmacy companies made any breakthroughs were when they were trying to figure out something and accidentally figured out something else or invented some condition, ADHD anyone? The best treatment even after all the data collection and genome extraction still is prevention and your grandma knew that and she lived during a time when the number 23 counted for big data. So I'm not holding out my breath on any promises made with big data as a powerpoint bullet.

The main problem with all big data sets is that nobody knows how to handle them globally. All algorithms and theories treat the data locally and then try to collate the local pieces to get a global picture. This is fine when your data sets underlying structure is a graph or can be treated as some kind of graph. So the future doesn't belong to data scientists or statisticians but to the people that will figure out how to treat large non-graph like data sets in a meaningful way.

cageface · on Aug 23, 2010

On the contrary it seems like data-centric models are utterly trouncing ab-initio methods. See http://www.scribd.com/doc/13863110/The-Unreasonable-Effectiv... .

mturmon · on Aug 23, 2010

The key difference between economics and weather is that experiments with prediction change the economic system's dynamics (e.g., large-scale automated trading), but not the weather system's dynamics.

akshayubhat · on Aug 23, 2010

     already out of date and is of very limited use

Depends on the type of the information, news: yes,information about biology, or Geo-Information or encyclopedic: not so fast. on other hand we now have access to real time data.

     Weather scientists have been collection data since the beginning of the century and they are still no better at predicting the weather than they were at the beginning of the century.

Predicting Weather is near impossible in theory forget the practice. its a chaotic system, generally in data science we are trying to predict things which we know can be predicted, such as determining whether an article is relevant to a topic for example. Humans can do that very well but for machines to do it you need more data.

     They have tons of data on markets but they still can't figure out what makes markets tick.

Again markets are chaotic and are affected by things like low probability events. You can do predict some patterns using information asymmetry, thats what all those traders at Goldman Sachs and other firms do.

    We need generative models that explain how the data is being generated and why it is being generated in a certain way.

The systems that we are interested in have complex models, and generating them from first principles isnt always as easy, additionally even if you do generate you need to test them against real world data. Rather than using the normal hypothesis - experiment cycle, it makes more sense to look for predictable patterns in data.

    I don't think the results have been that impressive.

Thats because you are out of touch with the field. Look at Google's Statistical Translation results, for example, they beat every generative model around the town. Or netflix's prize for recommendation system.

As someone interested in Data Science and been quite involved with it for few years as a student and an intern let me try to explain why data science is now becoming an important area:

1. We now have a lot of data in a machine readable format:

Unlike few years ago we now have huge datasets publicly available they, you have topic specific datasets such as Geo Names, Linked Geo data to much more wide encyclopedic datasets such as Wikipedia Data Dump, Freebase, Open Cyc, DBPedia.

We also have huge amount of user generated data, E.g. I have on my computer right now a huge chunk of twitters follower network consisting of 35 million users(I am writing an open source whom to follow system). Additionally I also have 100 million tweets.

2. Not just data we now have access to real time data:

You can access the public twitter time-line, using their streaming api, there are quite a few pubsubhubub systems out there which combine information from disparate sources and provide you unified source.

3. Moreover we have tools to handle the deluge [well sort off]:

Thanks to Google, Apache, Yahoo, Facebook we now have Hadoop , Map Reduce, Pig and other tools which make job of parallelizing processing of the data easier.

5. We have a Scalable on demand infrastructure in place:

Using AWS we can buy processing power as needed. It would have been impossible earlier, I recently bought a high memory instance with 17GB Ram for 50 cents an hour for 10 hours to run some jobs. It would have been impossible few years ago. We can now also deploy web apps very easily using Google App engine and dont need to even pay a penny, this enables us to create nice interfaces for visualization and querying for the data at a low initial cost.

Finally we are slowly building infrastructure to sell datasets or custom apps. E.g. Amazon Dev Pay or Infochimps.

The name data science can though be misleading, if you are a student that would mean taking course in Machine Learning, Data Mining, Information Retrieval, Statistics, Distributed Computing, Databases.

pvg · on Aug 23, 2010

Predicting Weather is near impossible in theory forget the practice.

Are you really saying that everyone from pre-historic hunter-gatherer societies tracking the seasons to modern meteorologists with their supercomputers and satellites have been engaging in something theoretically and practically impossible? That would surely be staggering news to everyone involved.

btilly · on Aug 23, 2010

The details of which days are going to be sunny a month out, which will be rainy, and when storms will arrive is chaotic and impossible even in theory to predict in detail more than a few weeks away. Practice is worse, there we manage no more than a few days, and this has been true for several decades.

That said, there are larger trends that can be predicted. For instance the seasons which come from astronomical facts. Or the several year El Niño/La Niña oscillation. Not to mention relatively slow moving Rossby waves in the jet stream. (One of which is bringing hot weather to Russia and monsoons to Pakistan right now.) These give useful information about what is likely to happen and keep happening over periods of weeks, months, and to some extent years.

But none of this brings us any closer to the idea of being able to give an exact weather forecast for a day 6 months in the future. That goal is impossible. And has been known to be impossible for several decades. Furthermore I assure you that this fact is well-known to every competent modern meteorologist. (The word "competent" does not necessarily cover people chosen primarily for their appearance to deliver the weather report for local TV stations.)

mturmon · on Aug 23, 2010

According to this ECMWF planning document (page 14, figure 4):

http://www.ecmwf.int/about/programmatic/strategy/strategy.pd...

the forecast skill for ECMWF and NOAA have been improving pretty steadily over the last 15 years. Basically, we're seeing two days farther into the future now than 20 years ago.

I agree that chaotic dynamics and various noise sources limit the time horizon for weather predictions to perhaps 2 weeks.

pvg · on Aug 23, 2010

I know what 'chaotic' means. Or where the seasons come from. The statement I was taking issue with is still complete nonsense. We can predict the weather, both in theory and in practice. He was saying that we can't and this is obviously untrue.

akshayubhat · on Aug 23, 2010

http://en.wikipedia.org/wiki/Chaos_theory

     To his surprise the weather that the machine began to predict was completely different from the weather calculated before. Lorenz tracked this down to the computer printout. The computer worked with 6-digit precision, but the printout rounded variables off to a 3-digit number, so a value like 0.506127 was printed as 0.506. This difference is tiny and the consensus at the time would have been that it should have had practically no effect. However Lorenz had discovered that small changes in initial conditions produced large changes in the long-term outcome.[43] Lorenz's discovery, which gave its name to Lorenz attractors, showed that even detailed atmospheric modelling cannot in general make long-term weather predictions. Weather is usually predictable only about a week ahead.[25]

Note this is different from predicting Global Warming, Global warming is a long term trend prediction, not what will be temperature at certain day at a certain place kind of prediction.

pvg · on Aug 23, 2010

You should also read the wikipedia page on 'Weather Forecasting' while you're there. Or consider if you can predict whether the upcoming December in the Northern hemisphere might be colder than this last July.

akshayubhat · on Aug 23, 2010

read before you type, from my comment above:

     Weather is usually predictable only about a week ahead.

and about

     December in the Northern hemisphere might be colder than this last.

Thats not exactly a prediction, It is a seasonal variation due to inclination of earth w.r.t. sun. Thats same as some product will have higher sale before Christmas, or there will be more traffic on roads before thanksgiving.

mturmon · on Aug 23, 2010

Well, I think you said two things.

(1) "Predicting Weather is near impossible in theory forget the practice."

and

(2) Quoting wikipedia, "Weather is usually predictable only about a week ahead."

These two remarks are not quite opposites, but almost.

akshayubhat · on Aug 23, 2010

you are getting confused in semantics of word "prediction".

Weather in theory is an Chaotic system thus there is no simple laws such as F = Ma to predict it.

Time frames of week , seconds and years are meaningless. You can always argue that you can predict weather for next milli or micro or femto seconds, why stop at weeks? but that does not proves that chaos theory is wrong or that there is some generative law which can be used to predict weather without relying on data.

Also from a utilitarian perspective, a long term accurate weather forecast would be so much useful rather than just a week long range [Assuming that is accurate right now,]. But you dont find accurate prediction for monsoon in January, or even of say a hurricane a fortnight before.

mturmon · on Aug 23, 2010

I can barely understand what you're saying here, so I won't comment beyond this reply.

The time scale is critical. If you look at any review of numerical weather prediction accuracy, such as the one I linked to above, you will see it's a key parameter. It's why 3-day forecasts are excellent and 7-day forecasts are not very good. Good 3-day predictability does not imply good 7-day predictability.

This is no different than the pictures of the state trajectories of the Lorenz system (http://en.wikipedia.org/wiki/Lorenz_attractor) starting from two nearby states, which stay together for a time, and then diverge suddenly.

It's true, the relevant laws are the Navier-Stokes differential equations, not simple laws like F = m a. But, we can observe the boundary conditions and propagate the system state forward in time. These equations do constrain the future dynamics.

btilly · on Aug 23, 2010

You can always argue that you can predict weather for next milli or micro or femto seconds, why stop at weeks?

The reason that I've heard for stopping at weeks is that that is the time frame for perturbations to work their way up from the quantum scale to the macro scale. We cannot, even in principle, measure what is happening everywhere on the quantum scale.

Of course as soon as you merge quantum mechanics and chaos theory, life gets very, very weird. See http://www.iqc.ca/publications/tutorials/chaos.pdf for more.

mturmon · on Aug 23, 2010

Sometimes these smaller perturbations are called "sub-grid phenomena". I think practitioners think of them as, say, pressure waves or flows which average to zero when observed on a 10km x 10km grid. But it is possible that their ultimate origin is on a much finer grid ;-)

Another problem, separate from these effects, is getting closure on the variables in the model. Things like evaporation from soils and vegetation, for example, which ECMWF is trying to include in their models. You put them in the weather model to improve accuracy, and all the sudden you need a time-dependent model for soil and vegetation water content. And also, sensors to satisfy the boundary conditions for your model (e.g., to estimate plant vegetation type).

pvg · on Aug 23, 2010

thus there is no simple laws such as F = Ma

There are plenty of simple mechanical systems governed by mechanical laws that exhibit chaotic behaviour. You said a silly thing about weather prediction, it happens. There's really no need to dig yourself into some deeper hole of gibberish.