The Open Source Data Science Masters

neilsharma · on Aug 19, 2016

Good collection, but as someone who has been slowly learning data science over the past few years, I think it needs far fewer lectures and waaaay more projects.

The biggest difficulty I have with learning data science is not how the algorithm or tools work, but the problem setup. Where is the data? How do I clean it? What insights can I draw from this? Which algorithms to use? What can I do with the algorithm assuming it works?

Most MOOC projects decide all this for you by giving you a set of tasks to do in order and skeleton code to work off of. Your job is simply to implement a small part of whatever algorithm you learned that week and press run. This way lacks creative development, exploration, trial and error, and critical thinking skills necessary when you go out in the real world.

Also, I think there should be more emphasis on publishing, even if your attempts are inaccurate. Push out a jupyter notebook to github of how you tested out a rudimentary monte carlo simulation on stock data. Or write a blog post with your attempt at determining how much silicon valley home prices will drop if 10K more family units magically existed in SF. Or try to code a random forest algorithm from scratch in a language of your choice. You don't have to be right, but publishing forces you to at least take a critical look at your work and think about the material deeply. MOOCs, at least from my experience, just encourage you to move on to the next topic the moment your code works, without diving too deeply.

shanusmagnus · on Aug 19, 2016

Agree. I'll add that while it's valuable to compile a giant list of topic areas, and references and resources for those topic areas, working through all the stuff on this list would take years to do in a non-trivial way.

What I really prize are curricula that err in the other direction; something like: here are the handful of foundational topic areas you _have_ to know about, and the pieces of those areas that will give you the absolute minimum. But taken collectively, you will then be able to make a beginning; and be able to engage with other, more advanced, resources, as the need arises, and as the sophistication of your projects require.

It's hard (at least for me) to know which subset of knowledge is required to make a beginning. That's where I need help.

neilsharma · on Aug 19, 2016

That's true -- there's definitely a bit of knowledge needed to get started, but most of that can probably be taught in 2-3 classes. I think the first problem in learning data science is coming up with a "Foundations" curriculum:

- Learn a relevant programming language (R, Python) + tools (ipython, anaconda, etc)

- Basic linear algebra (nothing more complex than multiplying matrices) and calculus (what are derivatives and integrals)

- Intro to statistics (just to know the vocabulary -- covariance, correlation, standard error, etc.)

- Rough overview of Machine Learning / AI as a whole and where its used in the world today

After that comes the second problem: "What do I do next?"

- coming up with several interesting but manageably small projects

- getting data for these projects

- access to quality advanced resources that can be consumed as needed while working on the project. A 3-month long MOOC on Neural Networks is an impractical resource. A well-written blog post (with code) or youtube video is far better.

MOOCs + textbooks seem to do better for the first problem if you can sift through all the noise, but fail at the second.

randcraw · on Aug 19, 2016

That's a nice overview of autodidact resources for DS.

But I suggest that you tweak the name a little, like "The Open DS Masters Program" or "Toward OSDS Mastery".

"OSDS Masters" sounds like a plural noun, like you're trying to say, "at this website you can find the great open source masters of data science" -- like Richard Stallmann or the authors of Weka. It's a bit confusing.

cholantesh · on Aug 19, 2016

I presumed the 'Masters' was akin to the usage in "master's degree'

j_s · on Aug 19, 2016

If so then they should add the word 'Degree' everywhere.

dragonwriter · on Aug 19, 2016

Or, just eliminate the subtitle and merge the one important word in the subtitle that isn't in the subtitle, and call it "The Open Source Data Science Master's Curriculum", which has the advantage that while it invokes the idea of a curriculum of the level of a Master's Degree, doesn't falsely present itself as an offering an actual degree, which it is not.

cholantesh · on Aug 19, 2016

I don't think that's the intent, but I can see how it would give such an impression. That alone should be reason to change the title, though.

jmde · on Aug 19, 2016

This seems like a nice compilation for introductory material in one place.

I still can't get over the term "data science", though. Not only is it ridiculously meaningless - what sort of science doesn't involve data, and how often would data be useful to something that isn't scientific at some level - its meaninglessness derives from the hyped buzzword trendiness that drove its upswing.

I say this as someone whose expertise is really sitting at the nexus of what would be considered data science. I feel as if I have been doing what might be considered data science for a long time, before there was a label for it, but watching its ascendance in demand and popularity has been troubling. I should be happy, but I feel like it's being driven by fashion rather than fundamentals, which makes me worried about the trajectory going forward, and disturbed by some communities being thrown under the bus.

dragonwriter · on Aug 19, 2016

> I still can't get over the term "data science", though. Not only is it ridiculously meaningless - what sort of science doesn't involve data, and how often would data be useful to something that isn't scientific at some level

All (empirical) science involves data, but not all of the work of science is the domain-neutral skill of analyzing data. I think "data science" is a bit of a misnomer -- or at least, uses an older and less specific definition of "science" than is now typical -- ("Data in science" would be more accurate under the narrower definition of science, and "Data analytics" probably more direct and clear), but its not *that bad (its no worse than, e.g., "computer science".)

westurner · on Aug 21, 2016

>I still can't get over the term "data science", though. Not only is it ridiculously meaningless - what sort of science doesn't involve data, and how often would data be useful to something that isn't scientific at some level - its meaninglessness derives from the hyped buzzword trendiness that drove its upswing.

I couldn't disagree more.

There are a number of terms for domain-independent data analysis:

- data analysis

- statistics

- statistical modeling

- machine learning

- big data

- data journalism

- data science

I think it makes perfect sense that the practice of collecting and analyzing data be qualified and indentified as a specific field.

I know of no better resource than these venn diagrams which identify the 'danger zones' around data science:

- http://datascienceassn.org/content/fourth-bubble-data-scienc...

Is there such a thing as a statistical model which only applies to a certain domain?

Domain knowledge ("substantive expertise"/"social sciences" in the linked venn diagrams) serves only to logically validate statistical models which may be statistically valid but otherwise illogical, in context to currently-available field knowledge (bias).

Regardless of field, the math is the same.

Regardless of field, the model either fits or it doesn't.

Regardless of field, the controls were either sufficient or they weren't.

danso · on Aug 19, 2016

As a non-data-science practitioner, I think the term works. "Data science", from what I gather, focuses on the work of collecting, collating, maintaining, and analyzing data. All science may rely on data but not all scientists work with data well.

In contrast, I think the term "data journalism" is poor. Because it isn't (typically) about the journalism of data, e.g. what's going on with the use of data. And so to talk about data journalism being a field (nevermind a niche field) makes it seem as if other kinds of journalism don't use data. Even the reporters who rely on 3 anecdotes/interviews to make a story are using data, they're just using a very poor form of it (what with data being the plural of anecdote).

I think David Leonhardt, the former editor of the NYT Upshot, said it well:

http://www.nytimes.com/2015/06/20/upshot/death-to-data-journ...

> Data journalism, ultimately, has the same aim as “quote journalism” and “anecdote journalism.” They all aspire to be “fact journalism” or, more eloquently, journalism.

ende · on Aug 19, 2016

>I still can't get over the term "data science", though. Not only is it ridiculously meaningless - what sort of science doesn't involve data, and how often would data be useful to something that isn't scientific at some level - its meaninglessness derives from the hyped buzzword trendiness that drove its upswing.

Out of curiosity, how do you feel about the word 'computer science'?

jmde · on Aug 19, 2016

That's an interesting question - I agree it's an interesting parallel and one I hadn't thought of before.

I have always been puzzled by the term "computer science" a bit also, because so much of it isn't really science per se (more math or theory along with engineering). When I've thought about it, I usually come to some peace with it because there is a scientific aspect to the field via the hardware side of things, which is really the foundation, at least historically, and there is a historical emphasis on demonstrating results empirically. It's sort of a crude awkward label but I accept it. But then again I went to a school where/when comp sci and EE were the same department.

"Data science" has bothered me more, though, because it's so vague, "data" and "science" are so inextricably defined relative to one another, and because it's arguably misleading - it's not really the science of data, whatever that means, and to the extent it's science, it's just science, but it's not, it's really just statistics.

More appropriate terms to me would be "computational statistics" or "statistical computing", "informatics", or "quantitative computation" or something. Anything but "data science." It's like some stereotypically ignorant but buzzword-compliant management committee, being unfamiliar with data or science, somewhere commanded HR to "find us some of those... you know... data science people!"

... and now venerable universities have whole departments with that title.

cwyers · on Aug 19, 2016

> it's not really the science of data

How isn't it?

tchalla · on Aug 19, 2016

> what sort of science doesn't involve data

There are many like theoretical computer science which do not involve data.

> I say this as someone whose expertise is really sitting at the nexus of what would be considered data science. I feel as if I have been doing what might be considered data science for a long time, before there was a label for it, but watching its ascendance in demand and popularity has been troubling. I should be happy, but I feel like it's being driven by fashion rather than fundamentals, which makes me worried about the trajectory going forward, and disturbed by some communities being thrown under the bus.

There will be a time where things will consolidate. During this time, people who really do data science will be stuck with while people who just have it as a title for the sake of it would face problems.

cholantesh · on Aug 19, 2016

>> what sort of science doesn't involve data

>There are many like theoretical computer science which do not involve data.

That computer science is a 'science' is also pretty contentious! :)

ende · on Aug 19, 2016

Much of computer science is a science. The contention seems to comes from the fact that software engineers tend to come from computer science departments. Maybe more universities should create separate software engineering departments?

cholantesh · on Aug 19, 2016

I don't feel qualified enough to discuss whether or not CS is truly a science, but I do think there's a strong distinction to be made between it and soft eng, for sure.

dragonwriter · on Aug 19, 2016

> Much of computer science is a science.

AFAICT, its mostly a subdomain of math, not science in the empirical sense.

tchalla · on Aug 20, 2016

Reminds of me this - https://xkcd.com/435/

tchalla · on Aug 19, 2016

Math? A lot mathematics doesn't deal with data.

Here's on whether CS is a science

https://www.cs.mtu.edu/~john/jenning.pdf

behnamoh · on Aug 19, 2016

Welcome to the human world! That's just how things work in this world (esp. the industry). We're not logical (or rational) all the time to pick all the "right" words, and while I agree with you on ambiguity of the term, I should also mention that "new" words are indeed required to describe the new things. The way I see it, to name something new, you have three options:

1) Borrow a word from a foreign language; 2) Coin (invent) a new word yourself (e.g. "foo = bar") 3) Give new meaning to old words.

DS is made using the 3rd method. It's vague, it's ambiguous, and it's just not "correct"! But that's exactly why people will "remember" it, as a puzzle and an anomaly. That's how the word sticks in the mind.

randcraw · on Aug 19, 2016

I can see both sides. DS is an awful lot like good-old-fashioned statistics, especially in describing the shape, patterns, and significance of events. But the rise of vast amounts of raw data of diverse kinds and origins, especially deeply contextual data like english text -- this is new, and I think it warrants a more meaningful label for the formal study and the practice of such analysis.

I also have no problem with the use of "science", since DS is one of the purest applications of the scientific method I know. You observe, you hypothesize, you explore and test, you use statistics to draw or reject a conclusion. Of course, it's almost impossible to eliminate all the confounding factors, but that's part of the fun...

psyklic · on Aug 19, 2016

The cornerstone of the scientific method is "hypothesis testing via experiments". Data scientists typically skip the experiments and make models based on pre-existing data. So, I am skeptical of calling it a science.

mindcrime · on Aug 19, 2016

I still can't get over the term "data science", though. Not only is it ridiculously meaningless - what sort of science doesn't involve data, and how often would data be useful to something that isn't scientific at some level - its meaninglessness derives from the hyped buzzword trendiness that drove its upswing.

I tend to rail against the needless creation of new buzzwords myself, but I can actually see some use for "data science". It is a little vague, but I see it as a slightly more concise way of saying "the confluence of applied statistics, machine learning, and analytics" or something roughly to that effect.

gaius · on Aug 19, 2016

The thing is tho', that what we call "machine learning" nowadays, statisticians were already doing for years. 25-year-old "data scientists" think that before they came along all there was was Excel... And don't realize that now most of what they do can be done in Excel...

Florin_Andrei · on Aug 19, 2016

> I still can't get over the term "data science", though. Not only is it ridiculously meaningless - what sort of science doesn't involve data

Yes, but data is usually an accessory, a means to an end.

Data science concerns itself specifically with how to better extract meaning from that data.

gaius · on Aug 19, 2016

before there was a label for it

It used to be called data mining, or business intelligence, even plain ol' statistics. People have been doing it since the 80s if not earlier. But this is how the industry works, take an old concept, slap a new buzzword on it, PROFIT!

Notre1 · on Aug 19, 2016

Clare Corthell, the creator of the Open Source Data Science Masters project, is interviewed in the 2016-07-30 episode of This Week in Machine Learning & AI (TWiML):

https://twimlai.com/twiml-talk-1-clare-corthell-open-source-...

Rogerh91 · on Aug 19, 2016

I really like this collection of resources--it's perfect for people really trying to get into the basics of data science.