Hacker News new | past | comments | ask | show | jobs | submit login
50 years of Data Science [pdf] (dropboxusercontent.com)
115 points by revorad on Oct 22, 2015 | hide | past | favorite | 11 comments



A great read.

Thesis of the article: "Insightful statisticians have for at least 50 years been laying the groundwork for constructing that would-be entity as an enlargement of traditional academic statistics. This would-be notion of Data Science is not the same as the Data Science being touted today, although there is significant overlap. The would-be notion responds to a different set of urgent trends - intellectual rather than commercial. Facing the intellectual trends needs many of the same skills as facing the commercial ones and seems just as likely to match future student training demand and future research funding trends"

"Data Science" is currently a field where corporations are winning in defining what activity academic universities should support. This article is a nice pull back in the other direction, and focusing on the intellectual rather than commercial elements.


Not sure if I completely agree. The problem, as I see it, is that corporations are one of the most important data collectors out there, and sometimes most accessible. Personally, I am doing research with data from supermarket chains, credit card companies and telecommunication operators. However, I always strive to make the paper mainly about an intellectual challenge (central place theory, modeling inclusive economic growth, studying effects of the dynamics human mobility, ...) and just use the commercial application as a side note. So, even if the commercial intent is there and the corporation interest is visible in the paper, the attempt is to use it just as a "price to pay" for increasing a "more general" knowledge.

Unfortunately, if I look at the (sometimes very commendable!) efforts of public data release from, e.g., governments (data.gov, data.gov.uk, etc) there always some shortfalls. They are either too flat -- a classic census table, which is of limited interest because you can deal with it without the need of innovative data analysis techniques -- or just not large or deep enough to get a big picture from them. For instance, in human mobility, a government will provide you a mobility survey, a sample of self-reported trips recalled from memory for only one target day. But a phone call record can give you all people, for a long period of time, without relying on people's faulty memory of their movements. If you want to understand, e.g., how disease spreads in a city, or how people are cut out by public transportation and infrastructure shortcomings, the survey results are necessarily going to be worse than the ones you'd get with call metadata.


This is a very nice read. Updating my worldviews ... I had no idea that John Tukey (inventor of the fast Fourier transform) was such a large figure in the history of data analysis. In fact, his paper on exploratory data analysis has more citations than his paper on the FFT according to google scholar. I was also surprised to see that he was co-author on the projection pursuit algorithm. Wow!


According to a recent report by Glassdoor, Data Scientist is ranked #1 for "25 Best Jobs For Work-Life Balance" [1].

[1] http://www.glassdoor.com/blog/25-jobs-worklife-balance-2015/


I saw David Donoho give this talk live in September at Princeton's Tukey Centennial conference -- fantastic, and well worth a read. IIRC, gives a good history of data analysis, how to think about the different definitions of and roles for data science, and an introduction to Tukey's work.

For more on the history of data science, here are references from a similar talk by Chris Wiggins: http://bitly.com/icerm


Great summary and very needed at this time to make sense of a number of trends. A minor nit: I think it would be better if he didn't overload the long ago claimed term "Data Modeling" and specifically called it Generative and Predictive Modeling.


This is a very good read, but I have to say that I managed to learn all of this 12 years ago before the term "Data Science" existed. It was quite easy as I was a statistics, comp sci and applied math "triple major."

I don't remember ever hearing the terms "Data Science" or "Big Data," but I do recall taking Department of Statistics courses with titles like "Data Mining," "Statistical & Machine Learning," and "Statistical Computing." We even sort of worked with what is now called "Big Data," by learning how to run large calculations in parallel using R's cluster computing packages.

As fancy and interesting as those courses were, I would only have a superficial understanding had I not also been exposed to more foundational/theoretical courses/topics like "Probability," "Measure Theory," "Mathematical Statistics," "Linear Regression," "Time Series Analysis," "Applied Stochastic Processes," etc.

When it comes to real-life practical implementation of all these ideas, it is necessary to have a pretty steep background in computer science. It's not enough to be able to do a couple runs in R or Hadoop. What is really called for here is at least an undergraduate level of knowledge in all the traditional areas of computer science, like programming languages, databases, computer systems, algorithms & data structures, etc.

Finally, the fourth ingredient is experience. The only way to really learn data analysis is through practice. It takes hundreds of hours of staring at data and code, struggling hard to find the relevant patterns in your data and improve the predictive performance of models. I guess this is the main advantage of pursuing a modern "Data Science" graduate degree, presumably you will spend a lot of time practicing data analysis on "real" data.

What I think irks people about the Data Science trend is that there seem to be a lot of people out there saying that you don't really need to be educated in mathematics and statistics. It's like saying that a mechanical engineer just has to know how to use 3D modeling computer software and doesn't need to know physics or mathematics ... that would lead to disastrous outcomes.


>It's like saying that a mechanical engineer just has to know how to use 3D modeling computer software and doesn't need to know physics or mathematics ... that would lead to disastrous outcomes.

It's also like saying a Software Engineer just needs to know how to "code" and doesn't need to know math or physics. The cornerstone of any Engineering profession is deep knowledge of underlying science and mathematics and their applications to your specific discipline. A paper similar to this offering commentary on the current trend of conflating programming with Software Engineering is sorely needed.


I think you've nailed it precisely. What is irksome (scary?) isn't the sudden upsurge in interest in data analysis (great! lets for science!), but the complete lack of acknowledgement of the importance of statistical theory and knowledge. As the paper mentions, almost all of the new "Data Science" academic programs being created have very little communication with, let alone integration with, real statisticians. Garbage in-garbage out, no matter how much data or how fancy a program.


I submit computer science departments have enough expertise to operate a graduate data science program. Maybe not every CS department, but enough. It is interesting that the actual stats departments do not encompass all statistical knowledge and expertise. More specialized experts in econ, finance, many fields of social science, not to speak of all engineering fields and computer science and applied math. Data science predates modern statistics; Laplace used least-square to ignore "the God hypothesis;" the function-approximation view taken in many engineering fields is just as powerful as the statistical approach.


Now it's suddenly 50 years? When I started reading on HN in 2010 (or '09) Data Science didn't exist yet. We had statistics and IT back then. I saw your birth dude! All the fluids and the screaming.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: