Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: ML From Scratch – free online textbook (dafriedman97.github.io)
259 points by dafrdman on Aug 31, 2020 | hide | past | favorite | 54 comments



I'm linking to a free online book I just finished called Machine Learning from Scratch. The book aims to cover the complete, technical, "under the hood" details that other ML textbooks don't. To do that, it shows all the mathematical steps to derive common algorithms and demonstrates how to fit each one from scratch in Python (using just numpy).

You might like this book if you are interested in ML and like to really understand what's going on behind the algorithms. Personally, when learning about these algorithms I would understand them intuitively but not feel comfortable constructing them myself. Even if you totally grasp the algorithms, seeing them derived from scratch is a great way to better understand the comparative advantages between competing models.

The code and math are pretty simple, so there should be a low barrier to entry.

I'm hoping to make the book somewhat of a breathing document, so if you have any edits, questions, or suggestions, I'm all ears. You can reach me at dafrdman@gmail.com. Thanks a lot!


Nice work!

Can we flag edits on github somewhere?

I found a typo at line 60 of https://dafriedman97.github.io/mlbook/content/conventions_no..., where your second "is written as $$" is missing the empty line before the "$$".


Thanks for the catch! Ugh those pesky $$s. Changed it now.

Looks like you found the repository. Do you think it would be enough to raise an issue there? (at https://github.com/dafriedman97/mlbook)? I'll look into linking directly from the book to the repo.

Thanks!


Added the issue.

Yes, I did find the repo. Hesitated adding the issue right away because the source files have ".md.txt" and ".html", but not the original ".md" files for me to fix them directly.


Yeah the book was built with JupyterBook. It's an awesome tool but I lose track of what it does to the .md files when creating the website.


You said in other comments that you used JupyterBook to make it. Would you consider open sourcing that so that others can contribute?


Excellent! Very kind of you to do this. Are you accepting comments? If not, ignore the following. :)

As someone who learned how to program from trial and error via tutorials on the internet (some of the people who are going to read your book are people like me), I just have one comment:

Try not leaning on libraries in your tutorials.

I know it sounds insane to suggest you not use numpy in any kind of ML tutorial, but libraries like numpy used to make my new programmer's eyes glaze over. The fact that those imports are black boxes and can do literally anything used to make my noob mind overload. And I'd develop a blind spot for them. Before I could read code that exploited numpy's power, I had to work with numpy for a while to gain an intuition for it. New people don't have that. This isn't just towards numpy, but for all libraries in tutorials. If it's at all possible (I know it's not always possible in a reasonable way), make a super simple example of what you're going to use the library to do before you replace it with the library.

As mentioned elsewhere in this thread, Sentdex is actually really good at teaching new developers. I think this is because he often starts without numpy. For example in NNFS videos, he starts with just lists. He gets to numpy eventually, but knowing what he replaced with numpy helps make things more clear.

Just a thought.


Thanks so much for your feedback. Definitely open to comments!

I agree 100% that any use of packages can be intimidating for newbies. I experimented at first with creating the models without using numpy and I thought that it actually made it less clear rather than more clear. It's obviously a tradeoff--you see where everything comes from (rather than np.mysterious_function()) but you take 5 lines of code to do the same thing that a single numpy command could accomplish. I felt in the end that it distracted from the real purpose of the code, which is to demonstrate how the model works.

Do you think a compromise would be to add a section to the appendix introducing numpy? Introducing arrays, random instantiation, stuff like that? Otherwise I might consider adding a no-numpy version in the future.

Thanks so much for your feedback!


Perhaps you could reduce the set of numpy functions used in your code to a minimal set (exp, sum, max, min, etc.) and then build fancier functions up from there. This affords you the speed and conciseness of using numpy arrays while limiting the abstractions that could obfuscate the inner workings of some of the fancier functions you might use (e.g. softmax).


I think there is a balance to be struck. You should totally use numpy for the arrays and basic math applications. But say on the first example you use `self.X.T` what does `.T` even do? Not asking you to go into all the details, just more comments saying this transposes the array, see numpy docs <link>. It will ease people into the library if they are unfamiliar with it. You do have some good ones like `column of ones` already, but more of those kinds of things.

I would also avoid using pandas if at all possible. Its just another thing people have to learn if they are unfamiliar.


I definitely agree. I should add more comments explaining what things like .T does--it's not that it's hard to grasp, but it might turn away newbies. Thanks for the suggestion!

Pandas is only used in the "code" sections, which use packages like scikit-learn anyway


That's a good point. If you had to explain how everything works without the libraries, you'd probably end up writing a book on how pandas and numpy works, not how ML works.

An appendix is a great idea!


Good ideas. I think I'll try to add an appendix, minimize the number of numpy functions used, and explain any of the weird ones that are real time savers. Thanks for all your thought.


I'd love to include the book to our company internal learning resources. Can you include an official license by any chance? Thank you


I hadn't even considered licensing it. Want to email me and we can talk? My email is dafrdman@gmail.com. That said, you're welcome to use it (though my lawyer father suggests I say that this "verbal contract" is revocable and non-exclusive).


I have a different view :)

I feel this book is the ideal companion for intermediate to advanced Python developers/data scientists. People who know pandas pretty well, and so are comfortable with numpy array operations too. They have probably used scikit-learn's fit/predict API in a black-box way, but have never quite had time to look at the code underneath.

I love "ML from Scratch" because it leverages my math and Python/numpy knowledge (even if some of it is rusty and half-forgotten) to show me high quality, well-commented, mathematically rigorously explanations of how I might have implemented the types of algorithms I use from sklearn. While the maths is formal, it is totally straightforward in its presentation.

It will definitely be my bedtime reading for the next few nights.


I disagree slightly. I think numpy (and scipy is so central to do anything mathematical in python (actually you could say python is just the glue for these libraries) that maybe one should spend a chapter or two on numpy instead of not using it.


I agree with you in theory. I really hate that the fast.ai course uses their own fast.ai library.

Numpy is a bit different though as it “fixes” inefficiencies in Python’s typing system. A list of Python integers is actually a list of C structs so none of the values are sequential in memory. A numpy array of integers is sequential in memory and so the performance gains are massive.

It’s not so much a library as it as a way to access efficient data types/structures. I think any mathematical programming in python should start with numpy.

If the author had used Pandas extensively I would agree with you completely.


This was exactly what my professor used to say while we were learning programming. Our thought process should be at building the program, to construct a progressive logic train of thoughts that is robust enough to avoid basic errors. Using packages defeats that purpose. Once we have a minimal working solution, we can then worry about efficiency, scalability and other benchmarking parameters to judge how far is the first try from the best one.

While developing something though, it could be appropriate to use existing and better solutions.


I wrote this article you might like: https://github.com/jeremyong/cpp_nn_in_a_weekend/blob/master...

No dependencies. Just straight C++. The code is heavily annotated, doesn't use any matrix libraries or anything, and gets 92% accuracy or something without any fancy techniques on the MNIST handwritten digits database.


I wholeheartedly agree with this. As a software developer without a strong math education, any time I open a "from scratch" ml learning resource, I immediately close it because I don't have an understanding of the math happening in the black box of numpy.

Thank you for the Sentdex recommendation.


Can you elaborate? As an experienced numpy user, I don't really have an intuitive feel for what you describe. Numpy isn't really math it's mostly an array library for splitting, merging, reshaping arrays, swapping dimensions etc. and applying operators on all elements at once.

Outside of the things under np.linalg or np.fft, I cannot think of much really difficult math in numpy.

To grok numpy what you really need is a good mental model of the multidimensional array, reshapes, reductions, transpose and so on. But this isn't really math, it's more data wrangling and kinda boring and tedious "data chores" we need to do before we get to the interesting parts of the job.


Skimming the sections on linear models, I was surprised not to see a discussion of model fit, beyond just plotting predicted values against observed ones. Basic predictive models like linear and logistic regression are simple to construct mechanically. It's a substantially more involved task to quantify their degree of fit and, better yet, prove that the methods are optimal and unbiased (in a statistical sense).


I agree though I saw that as outside the scope of this book. I tried to be clear in the introduction that the book is a "user manual" of sorts that simply shows how to construct models, rather than how to decide between them, what the benefits of each are, etc. That information is certainly important but I felt it had been covered more than adequately by books like Elements of Statistical Learning


That's a really cool initiative, but I think we disagree on the term "from scratch". Taking a look at the source code, I see you're using sklearn - which is a great tool - but, from scratch, at least for me, implies writing your own code (logistic and linear regression, adaline, perceptron, mlp, knn, kmeans...) I mean, that's how I learned it. But again, congratulations on the initiative.


Perhaps I should have been clearer, but the "code" section within each chapter is not "from scratch". The "construction" section is "from scratch" in that it only uses numpy (not scikit learn). The scikit-learn part is just so new users can see how these could be fit in practice.


I went straight to the "code" section. Didn't know there was stuff in the "construction" section too. I would definitely consider not using sklearn for anything other than data sets. You already defined them, why not use them? Or rename the "code" section. I expected that to be the final code as you build it. Maybe show the usage of both side by side, as a way to ease people into sklearn. But the "code" section should totally be focused on what you made.


That's sensible. Maybe change construction to code and code to application? Or keep construction but rename code? I'll have to brainstorm. I definitely don't want people missing the construction section so this is great feedback. Thanks!


I like the sound of Application at least. Or 'In practice'?And Construction does make sense when I think about it more. Not sure I can think of a better name at least.


My hesitance with "Application" is that sounds like I'm going to use some interesting dataset or do some cool project (and this is essentially using iris to build basic models). How about "code" becomes "implementation"?


Feedback for the authors: I had a really hard time finding the "graph_boundaries" code (I eventually saw the "Click to show" button on the right)

I love the source code included, very handy for the future!


Thanks for the helpful feedback. I wanted to put emphasis on the graphs so I chose to hide the code but maybe it's not worth the cuteness of the "click to show". Changing that now.


Great resource! I’ve been looking for something with complete derivations all in one place. I've been looking for a while but haven't been able to find something like this. Thanks for sharing


First thing I thought of when I read the title was sentdex's nnfs.io (neural networks from scratch.) I don't have time to look at this in full right now, are there any major differences between the these? From my quick glance over this looks like it's heavy on the math notation and when it gets to the code it's already getting to complex (for someone who doesn't know python) to read.


The approach to this book is very similar to nnfs.io (a similar focus on deriving models from the bare bones). The biggest difference is that his focuses on deep learning while mine covers a) a wider range of models and b) more introductory methods (linear regression, logistic regression, naive bayes, etc.)

It is definitely heavy on the math notation. Since the purpose of the book is to provide mathematical derivations, it's impossible to do that without lots of notation. That's why I added the notation and conventions page (right after the table of contents). If you do get a chance to look through it and see any notation in particular that is difficult to follow, please let me know and I'll update that page.

As for the code, knowing Python is a loose pre-requisite, but familiarity with object oriented programming in general should be enough to at least follow along. Most of the code does one of two things: 1) manipulates data represented as vectors/matrices or 2) creates/manipulates the objects themselves. But I should work on making the commenting clearer so it's apparent what each step does, even for someone less comfortable with Python.

Thanks so much for your feedback!


Thanks for the summary, the object oriented stuff is fine in my mind. The trouble with the code comes from things like the numpy functions. I have experience in matrices for what's used in computer graphics, but some of numpys functions are conflicting with what I understand those constructs to do. I don't know maybe it's the lack of any type information that makes this hard for me, I can't just look at the something I know and figure out what is going on from there, I have to go through each function one by one to make sure I know what it actually returns.

Either way, good luck with the book. I've considered doing knowledge dumps myself for game/game engine oriented information but I don't even know where I would start with that.


Thanks for the feedback. Sounds like you're not alone in your thoughts on numpy. I'll brainstorm better solutions--maybe explaining each numpy function in a side note or adding a numpy overview to the appendix. It just makes things so much easier so I am hesitant to go numpy-less.

Getting started was definitely the hardest part. I put the book together with JupyterBook (https://jupyterbook.org/intro.html), which I would highly recommend! Really neat tool and a helpful community.


Looks like this one is free and also covers a wider breadth of ML topics (though perhaps doesn't cover NNs in as much detail -- I don't have a paid copy of nnfs).


Nice work. Seems like it focuses on machine learning stuff that's not deep learning. Maybe you could add a section on collaborative filtering.


It's definitely not deep learning focused. I wanted to start by introducing the models machine learning practitioners should all know. But collaborative filtering and stuff along those lines would be a good addition! Thanks for the feedback.


I read the title and was somehow was expecting a guide on compiler design implementing some flavor of the the ML language such as Standard ML.


Yeah, I was excited too. Of all the naming conflicts in our industry, this is one that frequently bites me.


Isn't one of Appel's books like that? I vaguely remember his CPS book being about implementing ML (His Intro to compiler in x books are not)


I'm sad it's not!


This looks fantastic! As a bit of a selfish question, have you considered also offering a downloadable epub of this book? I've been trying to keep my long-form digital reading to my eReader, for the sake of not looking at LCDs all day, but that makes web ebooks a bit of a pain due to e-ink not liking scrolling very much.


You can now find a pdf version of the book at https://github.com/dafriedman97/mlbook/blob/master/book.pdf. JupyterBook is still working on the PDF creation, so this doesn't have any of the images unfortunately. That said, most of them aren't too important (until the neural net chapter, where they get a little more important)


Good question. I definitely prefer downloadable books myself. I made it in JupyterBook because that was easiest with the executable ipynb files. I'll look into whether I can make it downloadable and update you if so.


Just did a quick scan and looks pretty good! Would it be possible to add an option for exporting the raw code (maybe broken down by chapter/section) separate from the notebooks? Ideally this would run self-contained without relying on state buildup in the notebooks.


Good call! I'll work on that ASAP


This book is awesome! How does this compare to Introduction to Statistical Learning or Elements of Statistical Learning? Other than the addition of code?


Thanks so much! I would say two major differences: 1, as you mention, it codes each method up from scratch in Python readers can really see each step the method uses. 2, it is focused on the derivations of these methods, rather than their intuition, applications, etc.


Looking through the topics, a useful addition would be SVMs.


Agreed. That's #1 on my list right now.


Unfortunate this appears next to the post about deep learning jobs collapsing on the front page




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: