I'm flattered to see Data Science at the Command Line next to these great titles, but I'm not sure if I would recommend it to learn the basics of data science.
DSATCL discusses the ideas behind various cleaning and visualization approaches and several machine algorithms, but only briefly. My personal recommendation would be to first gain some experience with these topics using Python and/or R. If you're afterwards curious to find out how the Unix command line can help to do data science, well, then there's only one book I can think of! ;)
I agree. I'd start with something like Joel Grus's Data Science From Scratch to get a handle on the basics in Python (or whatever the R equivalent is, I'm not familiar with R books).
I do however find myself more and more wishing I knew data science-specific Unix commands, and I think I know what book to get to solve that problem... :)
R for Data Science is a good R equivalent by Hadley Wickham. It also acts as a high level overview of the hadley/tidy verse (ggplot2, tidyr, dplyr, etc.). R4DS is free online [1].
Thanks! I must admit that the thought has crossed my mind. However, these days all my time is spent consulting and giving training so I can't make any promises.
Would just throw an extra plug for Python for Data Analysis. Though the title might sound a little bland, it's a good, practical summary of how to use pandas for the sorts of data analysis you often have to do in data science work.
I'd add the disclaimer that while Python for Data Analysis is a great resource for learning pandas, which itself is invaluable for data science in Python, the book doesn't cover machine learning or statistical inference in any great detail. That's not a criticism, it's just (mostly) beyond the scope of the book.
A fair point for sure, which is actually one of the reasons why I do tend to recommend the book.
ML and stats are generally the more flashy and well-known parts of data science, and so I've found that people new to the field often don't have major difficulties finding resources for learning them or finding the self motivation to dive into them. The data cleanup, on the other hand, is often the more important work to be done on projects while simultaneously being seen as the less enjoyable part. Learning how to do it well makes it a more interesting process, and pandas and this book lay a good foundation for that.
The appendix alone taught me most of what I know about python, and it's a great departure from the mass of online materials that focus on ML without getting into the tools you'll need for cleaning and managing data.
Introduction/Elements of Statistical Learning by Jerome Friedman. I recommend reading the Introduction and using the bigger book as a reference material when tackling a problem.
Bayesian Data Analysis, 3rd edition by Gelman.
You need calc 1 & 2 and matrix algebra somewhere along the way.
Lots of papers, googling and doing. That's when you got the basics covered. You start being "operational" after Data Analysis Using Regression.
When you start working on a problem, you need to go through the relevant literature first. Nobody ix expert or even half-good in more than 2 or 3 (small) areas of statistics. Read the literature, take notes and create a plan first.
DSATCL discusses the ideas behind various cleaning and visualization approaches and several machine algorithms, but only briefly. My personal recommendation would be to first gain some experience with these topics using Python and/or R. If you're afterwards curious to find out how the Unix command line can help to do data science, well, then there's only one book I can think of! ;)