Hacker News new | past | comments | ask | show | jobs | submit login
Best Data Science Books According to the Experts (builtin.com)
284 points by mantshitla on July 24, 2020 | hide | past | favorite | 60 comments



More free books:

* Vectors, Matrices, and Least Squares — IMO the best beginner-friendly and applications-focused intro to (or review of) linear algebra. Covers a ton of fundamental ground while keeping things consistent and concise. Lots of exercises and a Julia supplement book. (http://vmls-book.stanford.edu/)

* Mathematics for Machine Learning — good coverage of the most important math concepts relevant to ML (https://mml-book.github.io/)

* Forecasting: Principles and Practice — best overall resource on forecasting that I know of; R focus. One of a zillion great R/data science books, virtually all of which are open and well-written. (https://otexts.com/fpp2)

* Dive Into Deep Learning — can’t personally vouch for this one but it looks comprehensive; numpy/PyTorch/TF focus (https://d2l.ai/)

* Speech and Language Processing — clear introduction to all things NLP, nice flow (https://web.stanford.edu/~jurafsky/slp3/)


Have you already read them all?


I’ve read all of VMLS and Forecasting (and ISLR from the original list), and maybe half of SLP. MML I have skimmed through for review / used as a reference, and the deep learning book is high on my queue.

I tend to not be a cover-to-cover reader, so I usually deep dive into a single topic for a while (e.g. forecasting, information retrieval) and read papers/tutorials/chapters related to that topic and the math concepts related to it.

PS I feel like impostor syndrome is so common among data scientists because there is so much material that feels like “must know”. Don’t feel like you need to memorize thousands of textbook pages to be effective, and you could spend a lifetime mastering any of these individual subjects. JIT learning is a great skill to have.


The frequency with which Kevin Murphy's ML book gets left out of these lists in almost bewildering.

If I had to choose 1 book as the ML bible, then it would be Murphy's (contrasted against Bishop and ESL) for the following reasons:

1. It uses CS jargon. (Bishop's book while great, uses Math/physics notation/jargon which add a barrier to entry)

2. It is more up to date and comprehensive (It covers everything from probabilistic models, traditional models, to neural networks all the way up to 2015 or so, unlike say ESL which is more introductory)

3. Everything in deep learning past 2015, is better learnt through papers/video lectures than any book. (A lot of it is intuition and not truths. There is a certain authority to books that belies our lack of understanding of NNs. Opinions on popular operations such as Dropout, Batch Norm, saliency maps have changed drastically over the last few years)

4. My ML professor used it for our upper grad level ML course and I came out very satisfied. (nothing quite like personal validation). I have read ESL and found it to be better as reading for an intro to ML course. I tried reading Bishop, and didn't like it :| )


The book in question is Kevin Murphy, machine learning from a probabilistic perspective [1]

Just googling kevin Murphy ML gives me a lot of ads and pages about hair product.

[1] https://www.cs.ubc.ca/~murphyk/MLbook/


Personally I had a really poor experience with Murphy's book because I bought one of the first editions (don't recall whether it was first or second). The list of known errors is really long [1], and the author didn't even bother to organize it properly. The author even decided to rewrite a chapter because it contained too many errors [2].

1. https://www.cs.ubc.ca/~murphyk/MLbook/errata.html

2. https://www.cs.ubc.ca/~murphyk/MLbook/pml-print3-ch19.pdf


Looks like the Murphy book is $80 in hardback, they have a similar priced kindle version that doesn’t even have a cover image. Feels like this book is really for the college textbook market. Likely why it hasn’t gotten wider coverage.


There is a new version coming out next year:

https://mitpress.mit.edu/books/machine-learning-second-editi...


I really am looking forward to it.

I will probably revisit it entirely after that.


You want to read either Elements of Statistical Learning (2nd ed.) OR Kevin Murphy's ML book for the theory.

Then you will want to consult a text book in your work domain (e.g. introduction to speech & language OR statistical natural language processing for the domain of natural language processing).

And finally, you will want either a book, or free online Web resources/tutorial videos that show you how to do things in practice, given a particular programming language and tool-set (e.g. Python + TensorFlow, Java + DeepLearing4J).

This recipe of Theory + Application + Practice/Tools should get you there.


BTW, it is okay to read stuff in parallel, and to take a non-linear approach to learning. Also people have different preferences for textbook styles, and they vary regarding pre-existing background knowledge.


I have Deep Learning with Python (Chollet, first edition) and Hands-On Machine Learning (Géron, first edition). Both books are highly recommended.

Introduction to Statistical Learning is also available for free online:

http://faculty.marshall.usc.edu/gareth-james/ISL/

Although I only read a few chapters from that book, I really like it (but I would have preferred a python version of the book).

Personally, if you have to pick three books from the list, ypu can start with these three options.


You can find a couple of repos (google them) that show the exercises in Python, I had written a post on my own blog some time ago: https://www.franzoni.eu/machine-learning-a-sound-primer/


Good to know, thanks for the link!


I wholeheartedly second "Deep Learning with Python" by François Chollet!

It's an excellent 'zero-to-hero' text for understanding deep neural networks, some common architectures, and the code (and theory) to get them to work.

One thing missing is how to prepare data for deep learning -- but that's just standard ETL you learn elsewhere.


You can check Géron's book to know more about data preparation, specifically the second chapter. This chapter details an end-to-end machine learning project (price prediction). Here, the author describes scikit learn's pipelines for automating preprocessing tasks for your dataset.


I appreciate that on HN this article may provoke intelligent conversation, but this article is smells a lot like a "top N things you need right now" blog post where everything is an Amazon affiliate link.

(Edit: I was wrong about affiliate links. Not deleting to publically self-shame.)

The real data science would be producing articles like this automatically, and with good SEO, to drive revenue.

The list is OK. I've studied ISL (and some of ESL). A friend really enjoyed Think Stats. Charles Wheelan's book is in the same vein as How To Lie With Statistics (Darrell Huff?) but in greater depth.

I started with R because that's what my team was mostly using when I got into this area. Hadley Wickham's free books are good too.


Pattern Recognition and Machine Learning (PRML) is absolutely recommended. Available for free as well! [1]

The Elements of Statistical Learning (ESL) [2] is also a very reputable source.

They also make a good pair: PRML has a Bayesian approach while ESL is traditionally frequentist.

[1] http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%...

[2] https://web.stanford.edu/~hastie/ElemStatLearn/


ESL is an excellent reference, if you are looking for an introduction instead I can recommend “An introduction to statistical learning” by (most of?) the same authors.


I like the Data Science from Scratch book. I’m working through the book and rewriting the examples in Swift:

https://github.com/melling/data-science-from-scratch-swift/b...

I haven’t read it yet but the Think Stats book is available for free online:

http://greenteapress.com/thinkstats2/html/index.html

PDF: https://greenteapress.com/thinkstats2/thinkstats2.pdf


Deep Learning with Python (Chollet) and Hands-On Machine Learning (Géron) may be a bit redundant currently. With 2.x versions of Tensorflow, Keras (which is what Deep Learning with Python covers) has been completely integrated into into the Tensorflow API (which is covered in Hands-On Machine Learning). Both books are good, but the newest edition of Hands-On Machine Learning is updated for Tensorflow 2.0, and so it is probably the more relevant of the two.


The article notes that "Designing Data-Intensive Applications" is perhaps not a typical data science book, but it is still very useful. I agree. It is a fantastic book - one of the best technical books I have read. I wrote why, and a summary of it here:

https://henrikwarne.com/2019/07/27/book-review-designing-dat...


There's a very real chance that it's mentioned in the article purely because it has the word "data" in the title.


I find it weird that this (free) book was omitted:

- Goodfellow et al. Deep learning. MIT press, 2016. https://www.deeplearningbook.org/

And there is the book about DL with Python published by Manning in 2017, but not the book about DL with PyTorch, published by Manning in 2020?

- Stevens et al. Deep Learning with PyTorch. Manning Publications, 2020. https://www.manning.com/books/deep-learning-with-pytorch


"Deep Learning with PyTorch" has a freely downloadable PDF here:

https://pytorch.org/assets/deep-learning/Deep-Learning-with-...


Goodfellow's book is a poor learning experience. It only makes sense if you already know the material in the book.


Data Science From Scratch By Joel Grus is my favorite book and it is the best for those people who doesn't know about Data Science


If someone finds it helpful, here is a list of 25 free Machine Learning books available online : https://blog.paralleldots.com/data-science/24-best-and-free-... and another list of over 50 free Data Science books : https://blog.paralleldots.com/data-science/50-must-read-free... we compiled. Pick your poison.


While there are some good picks (especially "An Introduction to Statistical Learning with Applications in R" and "Deep Learning with Python" by Chollet, the Keras author), I am surprised it is missing "Information Theory, Inference, and Learning Algorithms" by David MacKay (http://www.inference.org.uk/itila/book.html).

The "Bayesian Inference and Machine Learning" track gives a nice foundation to anything with "log loss". After that even k-means won't be an ad-hoc algorithm.


It's definitely one of the least technical ones in the list, but "Weapons of Math Destruction" is a must read for anyone deploying models that directly affect users. Its general point is that opaque judgement systems, such as how social media platforms these days seem to arbitrarily remove content, are only going to get worse with black box models. AI ethics are seldom taught, and this one provides a good framework for thinking about how to build effective, yet fair systems.


I personally learn better through coherent code examples rather than math notation / jargon. Manning's excellent "Data Science Bookcamp: Five Python Projects" teaches data science basics using code, not greek symbols. Also, the projects themselves are pretty cool and challenging, but doable. https://www.manning.com/books/data-science-bookcamp


I would love to see a book like Alex Reinhart's Statistics Done Wrong make these lists more often.

It's not focused on data science specifically. But it's short, sweet, and just might leave you better prepared to tell whether the emperor is wearing actual clothes, or if it's just an artfully arranged assemblage of TensorFlow operators.


As a dumb marketer. I very much enjoyed John Foreman's Data Smart.

https://www.goodreads.com/book/show/17682206-data-smart


I skimmed the book a few days ago and thought it looked interesting, and put it in my reading list.

Got any other companion book recos?

I'm particularly interested in data sets that I could play with in Tableau, Google Data Studio, and Excel.


Link to previous HN post which has a list of all ebooks Springer made available for free:

https://news.ycombinator.com/item?id=23520545

Has quite a few of the books mentioned in the post and the comments here.


Thanks! Didn't know about this


In my experience my professors always referenced Hastie, Tibshirani, Freeman, The Elements of Statistical Learning to be the reference for most of the tasks you would need to perform as a data scientist. For an AI/ML researcher Murphy is probably more comprehensive.


My favorite book is "Python for Data Science For Dummies". It won't help you in a FAANG job interview but in my opinion is super helpful when you just want to get some insights out of data you have on hand.


A worthy mention imo, Machine learning pocket reference by Matt Harrison: a tiny book packed with a ton of fundamental concepts and great examples.


Was expecting more like the first few.

I have plenty of textbooks in my reading list already.


Grokking deep learning is an awesome book!


This is a completely, utterly worthless list: the only things that belong on it are Grus (good for python), maybe Bishop (maybe; it's woefully out of date and light on details) and Hastie (his other book is vastly better). The "General interest" books are all horse shit. If you're a pythonista, you should buy Wes McKinney's book. If you're not, you should buy John Mount and Nina Zumel's "Practical data science" which is comparable to Grus.

If you don't know Linear Algebra; the "done right" book is absolutely not done right for people that work with data in the real world: Strang is infinitely better, and Trefethan and Bau and Gollub if you need to go deep.

As someone pointed out below: Kevin Murphy's ML book is actually very good for teaching you how the things work. What's more, you can download python or Matlab for every algorithm in his book (as you can for Bishop at this point).

Nobody needs the Deep Learning books listed. In spite of the hype, it's just not that important compared to naive bayes, LDA and GBM; most people don't have access to the hardware or data sets that make DL useful, and those who do probably studied DL in grad school.

Gads; what trash! Those experts: only one of them appears to actually be an expert; pedants and post-docs don't count.


Seconding this comment. Based on experience in hiring data scientists and comparing notes with many others that hire data scientists, the most frequent gaps in knowledge are (1) statistics specifically and scientific computing in general and (2) disciplined software engineering.

People good at (1) and bad at (2) write "PhD code" that may or may not be right but you can't tell because it's too disorganized. People good at (2) but bad at (1) get fine-ish looking numbers out of their good looking code but you can't tell whether it's right because they may have ignored or misunderstood fundamental assumptions and correctness of the underlying methods.

There are also seemingly tens of thousands of people on the market who have little experience in either but have adapted projects from examples online into their Github potfolio and put all of the relevant terms into their resume anyway.

I think most aspiring data scientists would be better served going with more introductory texts and really understanding them. Maybe Blitzstein and Hwang's "Introduction to Probability" and then McElreath's "Statistical Rethinking" or Wasserman's "All of Statistics" for people who need more stats.

I'm not even sure what to recommend for developing good software judgment and habits. There doesn't seem to be a shortcut for that. Maybe "Fluent Python" or "Effective Python" for Python people? No idea for the R ecosystem.


Perspective for (2): it's because no one in graduate training really cares about code quality. Your PI focuses more of your attention on scientific writing, and so there's little to not time to polish your work. The incentives just don't support this work at the graduate training level.


I agree. The only incentive to write neat code is to save yourself the pain and suffering of having to go back through it yourself to fix or add things. I am currently doing data analysis in MATLAB for my PhD, and I know nobody will ever use my code besides me.

I’d like to learn to do my due diligence, but without someone training me, it just takes so much time to learn things like git. I’d rather be recording more data and submitting my paper so I can get the hell out of here


Actually the pi doesn't even know about the concept of code quality. Or even your code at all.


Interesting I've never seen any issue with (1) with data scientists with Masters or PhD, more often I've seen it with software engineers who end up having to do data scientist work. Does that track with your experience?

As an software engineer (2) is the worst part of working with data scientists. The amount of times a 'professional' data scientists want to launch non-code reviewed, non source controlled, works-on-my-notebook model or analysis is shocking.


Which books would you recommend to solve the first problem? I’m a CS student, but would love to work with data one day, so I’m looking for some probability/statistics/data science books to read through. I’ve heard great things about Elements of Statisticsl Learning, for example, but I’m not sure whether it’s not too DS oriented, giving poor foundations in statistics and probability.


"R inferno" will help with the software engineering bits I suppose. R is sort of designed to do this sort of work, and it assumes the end user is more of a statistician than a programmer. Lots of foot-guns. On the other hand, Python has a lot of them as well, and it's NOT designed for this kind of work. It's a sort of mixed bag: R core is vastly better than Python for this sort of task. There's a subset of R packages which are as good as scikit learn (which is very good indeed), but there is also a pile of total shit. R's package manager is also better than anything in the python universe, but node bros manage to screw it up. I loathe python from long experience, so I polarized on R, but python is definitely winning.

I just assume anyone who calls themselves a data scientist is going to be a shit tier programmer who needs to improve over time at this point. The exceptions kind of prove the rule. Imposing test-driven discipline will cure some of the worst tendencies.

I don't have good references on stats and linear regression tier data science, but I'll take someone who understand the basics (I dunno, calculating useful moments from empirical distributions, feature selection in linreg) over some weenie who has some cribbed ipython file in his githubs who claims to understand Hastie.


Honest question: what skills should a data scientist possess to graduate out of “shit tier”? Should we have all of the skills of statisticians, ML engineers, data engineers, software engineers, visualization designers, and domain/communication experts? Can it not be valuable to have some but not all of the above skill sets? Does it matter that software engineers are often “shit-tier statisticians” that understand just enough ML lingo to dismiss it as marketing hype?

I’ve gone out of my way over the years to make learning data science skills as approachable as possible for uninitiated (giving trainings, providing customized learning paths based on someone’s background, offering encouragement), and yet this is almost never reciprocated by engineer types. It’s always just, “data scientists can’t write production quality code”, with no explanation of what production quality entail, or without consideration of the fact that notebook-based data science can have advantages over perfectly modularized code with a battery of tests. See the comment above: “I'm not even sure what to recommend for developing good software judgment and habits.“. It’s like a chess coach admonishing their subject to simply “think harder”. Not helpful.

When curious and open-minded data scientists and software engineers work together, it can be magic. When people snipe at others for their “shitty” skills, it creates a petty and toxic environment.

This comment comes off as a bit of an admonition, but I would greatly appreciate a list like TFA for data scientists looking to shore up their fundamental CS and software development skills.

(PS — The first book I read when teaching myself R was R Inferno, so that ain’t it.)


> See the comment above: “I'm not even sure what to recommend for developing good software judgment and habits.“. It’s like a chess coach admonishing their subject to simply “think harder”. Not helpful.

Hey, it seems like you took this as gatekeeping or something. These skills can definitely be taught or self-learned, I've done it and seen it done many times.

My point was only that I don't know resources that can act as a shortcut (my actual word above), i.e. ways to skip over the longer path of gaining experience through long engagement with the topic. So maybe more like a chess coach saying they don't know any books that let a beginner jump ahead to being a more experienced player?

There are hundreds of past threads on HN about books to level up in software, so clearly some people have thoughts about this. I just don't know what to recommend a data scientist who needs these skills immediately.


What you said wasn’t egregious or anything, no worries. I’ve seen some incomprehensible code from data scientists with PhDs, stuff that has no excuse. I also know of one single resource for essential coding skills specific to data scientists either.

Sometimes a rant on a topic brews in my head for weeks or months, and I will uncork it on a random passerby that brings up the subject—which happened to be you this time.

But, I’ve had coworkers who like clockwork sneer at anything a data scientist wrote. “Why did you do it that way?”. When asked for advice on how to improve it, they huffily say nevermind. It’s ingratiating as hell.


R should be killed.



I have put list per "expert" to make you feel better.

  Herman:
  - ISL
  - Hands on 
  - Chollet
  - spark+desk

  Miller:

  - Grus
  - thinks stats
  - la done right
  - bishop
  - data intensive


I personally found the Kleppman book to be great. I wouldn't characterize it as essential for data science per se but anyone designing data-intensive applications ought to give it a read.


Whenever an article tries to appeal to authority "e.g. according to the experts", you know it is trash. It's part of my personal click-bait detection heuristic.


Thanks for reaffirming my habit of checking the comments before clicking through. Learned about a lot of useful resources from this and other comments!


I agree the title is annoyingly misleading.


My favorite is "Learning from Data". Builds from first principles and simple algorithms, so I really understand what makes machine learning work, and how to debug as well as build my own techniques vs use off the shelf black boxes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: