Learn the basics of data science with these books

jeroenjanssens · on Nov 22, 2016

I'm flattered to see Data Science at the Command Line next to these great titles, but I'm not sure if I would recommend it to learn the basics of data science.

DSATCL discusses the ideas behind various cleaning and visualization approaches and several machine algorithms, but only briefly. My personal recommendation would be to first gain some experience with these topics using Python and/or R. If you're afterwards curious to find out how the Unix command line can help to do data science, well, then there's only one book I can think of! ;)

dasboth · on Nov 22, 2016

I agree. I'd start with something like Joel Grus's Data Science From Scratch to get a handle on the basics in Python (or whatever the R equivalent is, I'm not familiar with R books).

I do however find myself more and more wishing I knew data science-specific Unix commands, and I think I know what book to get to solve that problem... :)

nthot · on Nov 22, 2016

R for Data Science is a good R equivalent by Hadley Wickham. It also acts as a high level overview of the hadley/tidy verse (ggplot2, tidyr, dplyr, etc.). R4DS is free online [1].

[1] http://r4ds.had.co.nz/

samkone · on Nov 22, 2016

True. But still I really love your book. Read it, and still reading to speed up many things. Are you planning another version?

jeroenjanssens · on Nov 22, 2016

Thanks! I must admit that the thought has crossed my mind. However, these days all my time is spent consulting and giving training so I can't make any promises.

rcar · on Nov 22, 2016

Would just throw an extra plug for Python for Data Analysis. Though the title might sound a little bland, it's a good, practical summary of how to use pandas for the sorts of data analysis you often have to do in data science work.

ploika · on Nov 22, 2016

I'd add the disclaimer that while Python for Data Analysis is a great resource for learning pandas, which itself is invaluable for data science in Python, the book doesn't cover machine learning or statistical inference in any great detail. That's not a criticism, it's just (mostly) beyond the scope of the book.

rcar · on Nov 22, 2016

A fair point for sure, which is actually one of the reasons why I do tend to recommend the book.

ML and stats are generally the more flashy and well-known parts of data science, and so I've found that people new to the field often don't have major difficulties finding resources for learning them or finding the self motivation to dive into them. The data cleanup, on the other hand, is often the more important work to be done on projects while simultaneously being seen as the less enjoyable part. Learning how to do it well makes it a more interesting process, and pandas and this book lay a good foundation for that.

clumsysmurf · on Nov 22, 2016

2E is in the works http://shop.oreilly.com/product/0636920050896.do

kyleschiller · on Nov 22, 2016

Very strong third.

The appendix alone taught me most of what I know about python, and it's a great departure from the mass of online materials that focus on ML without getting into the tools you'll need for cleaning and managing data.

Plus, it's free online: http://www3.canisius.edu/~yany/python/Python4DataAnalysis.pd...

zvikara · on Nov 22, 2016

Linked pdf looks like a pirated copy from it-ebooks.info

jonathanstrange · on Nov 22, 2016

IMHO, data science == applied statistics, but you better know a lot about the underlying mathematics before you come to any conclusions.

rm_dash_rf · on Nov 22, 2016

where can i get #2?

2. Business value in the ocean of data — by Fajszi, Cser & Fehér

blahi · on Nov 22, 2016

bah.

Statistics in Plain English.

Data Analysis Using Regression by Gelman

Introduction/Elements of Statistical Learning by Jerome Friedman. I recommend reading the Introduction and using the bigger book as a reference material when tackling a problem.

Bayesian Data Analysis, 3rd edition by Gelman.

You need calc 1 & 2 and matrix algebra somewhere along the way.

Lots of papers, googling and doing. That's when you got the basics covered. You start being "operational" after Data Analysis Using Regression.

When you start working on a problem, you need to go through the relevant literature first. Nobody ix expert or even half-good in more than 2 or 3 (small) areas of statistics. Read the literature, take notes and create a plan first.