Anecdotally, a lot of the people taking these classes totally lack the backgroun...

noelsusman · on March 15, 2016

I've always wondered why data science is so dominated by CS people. CS concepts are the least important thing for a data scientist to know. Fundamentals in math, statistics, and especially linear algebra are far more important. I would hire a statistician who has learned a few CS concepts over a computer scientist who has learned a few statistics concepts any day. Obviously being an expert in both is ideal, but that's pretty rare to find.

I teach at a well respected university for what is essentially a data science masters program, and most of my students come from CS. They are woefully unprepared, and even worse few of them seem to care at all about learning the mathematics behind anything that is going on.

Personally, I think if you can't read linear algebra at the same proficiency as you read English then you have no business calling yourself a data scientist. Unfortunately, in my experience that would describe most people who label themselves data scientists.

IndianAstronaut · on March 15, 2016

More worryingly, they make glaring data errors, suck as poor sample estimates, not understanding extrapolation of data to a larger subset, or even knowing what a confounding variable is.

disgruntledphd2 · on March 15, 2016

See I think that the points you've raised are far more important than linear algebra. Learning linear algebra is pretty easy compared to applying statistical concepts to real problems (source: I do this stuff for a living).

kolbe · on March 14, 2016

Agreed. It's one of those subjects that people love to be fascinated with, and they see how lucrative the field is, so it gets a ton of attention. Unfortunately, they all want to shortcut it, and there are plenty of organizations who will help them try, but machine learning requires math, stats and arguably CS proficiency at at least an undergraduate minor level. And not many people have that.

nowey · on March 14, 2016

You need to know the background which is Statistics Discrete Math Algorithms

Most books on the subject assume you already know what linear regression is, Naive Beyes is just explained briefly theoretically and it goes right into the code in spark, R or Clojure for example. However UC Berkeley's course is very theoretical, almost no code is shown and its just MATLAB code (don't remember the name off the top of my head), their spark course though is heavy on the code with IPython activities.

shas3 · on March 14, 2016

This could be a real problem in their respective careers (or not). Here's the analogy: scientists who use statistical tools as blackboxes are the ones responsible for the whole problem with misusing p-values. Similarly, poor intuition and training in machine learning leads to the blackbox mentality and consequently, problems with building working systems.

forgetsusername · on March 14, 2016

That's not really an analogy, it's an assertion without a shred of evidence.

I'm torn about the blackbox thing. On one hand, it's important to understand the underpinnings of a model. On the other, we utilize a multitude of things in our daily lives of which we have no fundamental understanding; that's abstraction in a nutshell.

hiddencost · on March 14, 2016

Machine learning gets kinda scary, though. For one example, discrimination with ML is super easy. Check out fatml.org for instance. Also, with ML it's really easyfor an amateur to over fit like crazy and draw spurious conclusions due to poor methods. People think they have an intelligence when they instead have very finicky tools

Edit: another pointer here https://algorithmicfairness.wordpress.com

shas3 · on March 14, 2016

There is a crucial distinction between the multitude of things we utilize in our daily lives and machine learning/high-dimensional data analysis: we aren't equipped to intuit the workings of high-dimensional advanced-math statistical inference in the same way that we can intuit the workings of say, a water pump, or simple arithmetic on Excel, or simple database systems, unless we are appropriately trained in the relevant math and science.

Some examples: blackbox application of classifiers (e.g. WEKA gui as used by some for data exploration) can ignore parameter optimization, unbalanced sets, parsimony in features, dimensionality reduction, etc. etc.

GFK_of_xmaspast · on March 15, 2016

Scientists are also domain experts tho, but bootcamp-style practitioners won't be.

abc_lisper · on March 14, 2016

As somebody who tries to learn from fundamentals, can you suggest a book (or two) and a course (online to accompany) it with, to get started in this area? I have been working as a programmer for a while(10 years).

hiddencost · on March 15, 2016

Sure, a couple things.

(I'm assuming you're comfortable with multivariable calculus.)

Andrew Ng's coursera course is good.

PRML (pattern recognition and machine learning) by bishop is good, and has a useful introduction to probability theory.

You also want a good grounding in linear algebra. Strang is basically the authority on linear: http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-...

You want a strong grounding in probability theory and statistics. (This is the basic language and intuition of the entire field.) I don't have as many preferences here (although its the most important); someone in this thread pointed to a course on statistical learning @ stanford that's good.

A good understanding of optimization is helpful. Here's a link that leads to a useful MOOC for that: http://stanford.edu/~boyd/cvxbook/

there's a lot of other stuff (markov decision processes, gaussian processes, monte carlo methods come to mind) that is useful that I'm not pointing to, but if you've hit the other stuff here then you'll probably be able to find out those things.

If you're into it, https://www.coursera.org/course/pgm is good but not vital.

You may want to know about reinforcement learning. This answer does better than I can: https://www.quora.com/What-are-the-best-books-about-reinforc...

Deep learning seems popular these days :) (http://www.deeplearningbook.org/)

Otherwise, it depends on the domain.

For NLP, there's a great stanford course on deep learning + NLP (http://cs224d.stanford.edu/syllabus.html), but there's a ton of domain knowledge for most NLP work (and a lot of it really centers around data preparation).

For speech, theoretical computer science matters (weighted finite state transducers, formal languages, etc.)

For vision, again, stanford: (http://cs231n.stanford.edu/syllabus.html)

For other applications, well, ask someone else? :)

Also:

arxiv.org/list/cs.CL/recent arxiv.org/list/cs.NE/recent arxiv.org/list/cs.LG/recent arxiv.org/list/cs.AI/recent

EDIT: unfortunately, there's also a lot of practitioner's dark art; I picked a lot up as a research assistant, and then my first year in industry felt like being strapped to a rocket.

hiddencost · on March 15, 2016

Oh no! I forgot about information theory! I don't have a specific recommendation, but it's very useful background.