Matrix Calculus for Deep Learning

madenine · on Nov 29, 2019

I’m in favor of there being more and better resources to learn anything out there, but every time I see a deep learning 101 type material all I can think is “who is this for?”.

In ~July 2016 I was at a presentation by NVidia at GW in DC. They showed off how easy it was to build out and train a model using some of their tooling (Digits maybe?). After the demo they opened it up for questions and a grad student ‘asked’ “You just did in 10 mins with 30 lines of code what I worked on for an entire semester”.

That’s been the trajectory of the tools and increasing abstraction in this space. It’s just getting easier and easier to build models that work (which is great), and it gets easier and easier to do so without knowing more than an extremely high level overview of the math behind it all.

So while this looks like a great resource - who’s it for?

For jobs/problems that need you to have a thorough understanding of the math and theory behind the networks this isn’t going to cut it.

For jobs/problems that need you to get something working math or not - this likely isn’t necessary to get started.

So it’s for people that have been getting into DL but also haven’t bothered or needed to look up the math concepts?

bonoboTP · on Nov 29, 2019

While this page looks nice, there is certainly a proliferation of beginner articles in this area. I think it's driven by demand, the same way gyms get tons of new members in January. As someone who sees all the hype and salaries involving ML, you dream of doing it too, so you look for articles to start out. That's the demand. After reading it, you don't get it and have no patience so you raise demand for titles like "ML for humans" or ML made easy or "Gentle intro to ML for the rest of us". Or ditch articles and watch Siraj ramble about making money with an AI startup today! I'd wager that only a small percentage of readers actually works their way through to advanced topics.

On the supply side (while TFA looks legit) people who are a few lessons ahead want to increase their visibility, start a blog/brand, make their CV stand out by showing community engagement and writing from a position of authority. This is mostly seen on Medium.

How to avoid the trap of being an eternal beginner? Accept that it will take time, be clear on your goals, try gathering a group of peers and expert guidance. Reddit and forums can be crap for this as you the beginner will gravitate towards the self proclaimed experts who may be full of shit and just play social games well, creating a blind leading the blind situation and cargo culting around terms that nobody really understands. There is a value in universities: they lay out a path, give guidance and let you work/learn together with peers. Ok, enough with this rant.

nabla9 · on Nov 29, 2019

> So while this looks like a great resource - who’s it for?

I give you an analogy. Electricity. Who needs to know complex numbers and differential equations to understand electricity? Technician, civil engineer, scientist or research engineer?

Technician who just wires the house don't need math. They just read the wiring instructions and follow standard practices. Nvidia boasts about the tools it builds for 'ML technicians' in this analogy.

You need to know math if you are building new architectures and applying complex models for something nontrivial. It's not going to work first time and you need to know what's going on. Even if you are the 'civil engineer' in this analogy you should be able to read the math and understand it even if you don't do the math by yourself. You won't be able to do literary research and learn new stuff if you can't read math fluently.

If you are programmer who is given ML tools to implement something someone else designed and understand you don't need this or use existing models, you don't need this. Your career might benefit from knowing it but you can manage without.

p1esk · on Nov 29, 2019

I believe the OP's point was that the math described in the article is too simple and not enough to do any serious research. Anyone who attempts to do NN research already knows this material (and a lot more). This tutorial could be useful to someone who wanted to implement simple backprop from scratch, but all DL libraries already do it automatically. Someone who just wants to learn a bit about NNs to classify images or generate text does not need to know this, and someone who wants to make a breakthrough in NN theory already knows it. So yes it's not very clear who is the target audience here. I'm guessing it's for a bright highschooler who just learned calculus and who is interested in how NNs work. For such students I'd recommend reading http://neuralnetworksanddeeplearning.com instead.

hgoel · on Nov 30, 2019

But someone who wants to contribute to the research doesn't just have this knowledge pop into their mind out of nowhere. They're going to learn it from somewhere and what's wrong with one more resource to help out with that.

sjg007 · on Nov 29, 2019

I don't think ML and NNs are at the point yet where you don't need to understand the math.

olooney · on Nov 29, 2019

> So while this looks like a great resource - Who's it for?

Undergraduates, or graduate students who didn't happen to take the right prerequisites. Most STEM degrees require vector calculus, but few require matrix calculus. A physics undergrad might see matrix calculus if they studied general relativity, or math undergrad interested in optimization or differential geometry. A statistics major might have seen it when working with multivariate distributions and regression. But it would be easy to miss.

Nevertheless, matrix calculus, which is not in fact a large subject, but only some new notation and a handful of theorems, is the key to understanding back-propagation. It's not the only way to approach it - you could just keep track of all those subscripts and indices - but it's one of the best. The differential form[1] is particularly good to learn because it maps almost 1-1 onto the error terms in a gradient descent implementation.

[1]: https://en.wikipedia.org/wiki/Matrix_calculus#Identities_in_...

> So it’s for people that have been getting into DL but also haven’t bothered or needed to look up the math concepts?

Everyone has to start somewhere. The usual pedagogical technique to teach a subject twice: once at an "undergraduate" level, omitting the technical details of proofs, with the goal of providing a big picture intuitive understanding of the subject and some practical symbol pushing ability; then again at the "graduate" level, with more formal definitions and detailed proofs. Your own education presumably used this structure, no? Even if you've already graduated, this "two pass" approach to learning new material is still a good idea. Few of us are von Neumann, able to dive immediately into the deepest depths of theory in a new field: we can all benefit from taking the time to develop some good intuitions first.

This is where all textbooks come from - a lecturer presents the material the way that seems clearest to them. They prepare notes to keep everything straight in their own head. Sometimes they find that their presentation resonates with students and is superior to what's currently available, so they start to develop their notes into something publishable. Most such projects get abandoned before too long, but many end up in some form on the internet, and a few go on to be developed into standard texts. As long as you can find even one new way to explain things that helps students, the exercise is not in vain.

_revy · on Nov 29, 2019

> Most STEM degrees require vector calculus, but few require matrix calculus

Gradients, Jacobians, etc are typically covered in a multivariable calculus class along with vector calculus (line integrals, Green's theorem, Stoke's theorem, etc). This is required for engineering and physics degrees.

mjburgess · on Nov 29, 2019

This is a very unempahtetic, almost anti-educational, comment.

Pick any journey to any destination. This article occurs at many points along them.

Need to have a thorough understanding of math? Then this is a starting point.

Don't? then this is an endpoint.

Akababa · on Nov 29, 2019

I'd guess it's for people in the second camp (practical) who are just trying to satisfy their intellectual curiosity or want an intro to the math behind it all.

I agree with your assessment that's it doesn't have much practical use on its own, nor is it an efficient means to any particular end.

unityByFreedom · on Nov 29, 2019

Well one of the authors (Jeremy Howard) strongly advocates for learning by writing. I think it's a good mission: there is always value in finding better ways to convey information. If you already know this material then it's easily ignored. Other people may be excited about something you already know, and that's fine. Unfortunately that makes it pop up in your news feed and sorry I have no solution for that =).

Edit: Just noticed the other author is the creator of ANTLR, which I recall using in school to write our own languages. Cool!

starpilot · on Nov 29, 2019

Wouldn't be a proper DL post without some good ol' gatekeeping.

madenine · on Nov 29, 2019

Honestly, I was trying to convey the opposite - the gates are wide open and it’s never been easier to drive through.

rosstex · on Nov 29, 2019

I don't even know what course you'd learn matrix calculus in, but it was a necessity for my upper level ML courses. This website would have been a godsend, and would have spared TAs many hours figuring out what knowledge we were missing. We got by with the Wikipedia page...

joshvm · on Nov 29, 2019

High level computer science courses straddle several disciplines and you end up with weird stuff like computer vision being the purview of the electrical engineering department. EEs tend to know some of this stuff because a lot of matrix algebra comes up in control/optimisation theory.

In physics we did matrix calculus primarily for electromagnetism and fluid dynamics. Maxwell's equations are the first time most students see the div/curl operator and it's also used in e.g. Navier-Stokes. But even though we were taught it, I don't think we really bothered to remember what a "Jacobian" is.

A lot of this stuff also comes up in physical rendering.

barbecue_sauce · on Nov 29, 2019

Does this not count as looking up the math concepts?

finebalance · on Nov 29, 2019

It matters, in certain contexts.

For example: a large number of clustering methods boil down to matrix factorization, with variations in the constraints. If you have both domain understanding and a general understanding of what kind of output these variations are likely to result in, you can often narrow down the list of methods you need to try.

Also, interviews.

JamesBarney · on Nov 29, 2019

My guess is it's for people who have gotten something working in the past and are now looking to go a little deeper.

stinky · on Nov 29, 2019

> So it’s for people that have been getting into DL but also haven’t bothered or needed to look up the math concepts?

Yes.

olooney · on Nov 29, 2019

I am very impressed with the clarity of presentation here. I usually link to The Matrix Cookbook[1] when I need to cite a reference for matrix calculus theorems but I might reference this instead in the future. I particularly like the section on the vector chain rule (which is very clear) and the section on element-wise operations (which uses novel notation to present many results in a compact form.)

[1]: https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

fg6hr · on Nov 29, 2019

A genuine question: is there any math behind ML at all? For example, is there any solid theory, with proven theorems, that would tell us what happens when we add another conv layer here or use a 3x3 conv kernel instead of a 2x2 one over there, or replace that tanh with a relu? From my limited understanding, ML roughly works like this: we shuffle around the ML graph, using some intuition, off-load it to a cluster of GPUs that costs $10k/hour, feed it a dataset with 1 billion images and see what happens; but nobody can predict the behavior of training or convergence or accuracy based on the ML graph and data alone.

psv1 · on Nov 29, 2019

Just want to point out that you're asking about "ML" but your questions are about neural networks/deep learning which is only a subset of machine learning.

> is there any math behind ML at all?

Yes. There is a lot of research in this area (some people argue it's excessive at the moment). You have the correct intuition that the answers aren't black and white all the time. For example, there are solid reasons to choose relu activations over tanh. Or to build certain types of network architectures for certain tasks. That doesn't mean that you can immediately calculate what would happen if you switch from one activation to another without running your network.

jeeceebees · on Nov 29, 2019

Of course there is. All the building blocks that people are mix and matching in networks nowadays were introduced at some point.

The paper that introduced batch norm, adaptive instance norm, attention heads, or any module used in a network have an extensive discussion of the motivation for their existance, some derivation or proof that they do what you want, and an empirical test to show it helps in practice. The reason some losses allow GANs to converge in certain situations while others don't isn't a complete mystery, there is theory that supports this.

Researchers designing new models are considering weak points in old approaches, identifying why they aren't working correctly, and proposing something new that solves a part of the problem. All of this is done by looking at the math behind all the operations in the network (or at least the parts relevant to a certain question).

That nobody really knows how AI works is one of those myths told by the media. Just because the model weights aren't interprable doesn't mean we don't know why that model works well. It just takes quite a bit of maths knowledge to really understand state of the art models. All that knowledge is also easily packaged into modern frameworks that make it easy to use without a deep knowledge of why it works. All of this contributes to the feeling that nobody really knows what's going on, while in reality it's onky the majority of people that don't know what's going on ;)

p1esk · on Nov 29, 2019

nobody really knows how AI works is one of those myths told by the media

It's not a myth. No one really understands how neural networks work. We don't know why a particular model works well. Or why any model works well. For example no one can answer why NNs generalize so well even when they have enough learning capacity to memorize all training examples. We can guess, but we don't know for sure. Most of the proofs you see in papers are there as fillers, so that papers seem more convincing. We rarely can prove anything mathematically about NNs that has any practical value or leads to any breakthroughs in understanding.

If we did really understand how NNs work, then we wouldn't need to do expensive hyperparameter searches - we would have a way to determine the optimal ones given a particular architecture and training data. And we wouldn't need to do expensive architecture searches, yet the best of the latest convnets have been found through NAS (e.g. EfficientNet), and there's very little math involved in the process - it's pretty much just random search.

Funny you mentioned the batchnorm paper - we still don't know why batchnorm is so effective - the paper gave an explanation (covariate shift reduction) which later was shown to be wrong (batchnorm does not reduce it), then several other explanations were suggested (smoother loss surface, easier gradient flow, etc), but we still don't know for sure. Pretty much every good idea in NN field is a result of lots of experimentation, good intuition developed in the process, looking at how a brain does it, and practical constraints. And yes, sometimes we're looking at the equations, and thinking hard, and sometimes we see a better way to do stuff. But usually it starts with empirical tests, and if successful, some math is used in the attempt to explain things. Not the other way around.

NNs are currently at a similar point as where physics was before Newton and before calculus.

chestervonwinch · on Nov 29, 2019

> NNs are currently at a similar point as where physics was before Newton and before calculus.

I'm more inclined to compare with the era after Newton and Leibniz, but prior to the development of rigorous analysis. If you look at this time period, the analogy fits a bit better IMO -- you have a proliferation of people using calculus techniques to great advantage for solving practical problems, but no real foundations propping the whole thing up (e.g., no definition of a limit, continuity, notions of how to deal with infinite series, etc.).

p1esk · on Nov 29, 2019

Maybe. On the other hand, maybe a rigorous mathematical analysis of NNs is as useful as a rigorous mathematical analysis of computer architectures - not very useful. Maybe all you need is just to keep scaling it up, adding some clever optimizations in the process (none of the great CPU ideas like caches, pipelining, out of order execution, branch prediction, etc came from rigorous mathematical analysis).

Or maybe it's as useful as a rigorous mathematical analysis of a brain - again, not very useful, because for us (people who develop AI systems), it would be far more valuable to understand a brain on a circuit level, or an architecture level, rather than on a mathematical theory level. The latter would be interesting, but probably too complex to be useful, while the former would most likely lead to dramatic breakthroughs in terms of performance and capabilities of the AI systems.

So maybe we just need to keep doing what we have been doing in DL field in the last 10 years - trying/revisiting various ideas, scaling them up, and evolving the architectures the same way we've been evolving our computers for the last 100 years, with the hope there will be more clues from neuroscience. I think we just need more ideas like transformers, capsules, or neural Turing machines, and computers that are getting ~20% faster every year.

fg6hr · on Nov 29, 2019

Actually, I read the batch norm paper and maybe I forgot important details, but it roughly went like this: "here we add a term `b` to make sure the mean values of Ax+b are zero and that will help us with convergence; ah, and here is a covariance matrix!", but no quantitative proofs about how much that convergence was helped. No, I intuitively agree that shifting the mean value to zero should help, but math taught me that there is a huge difference between a seemingly correct statement and its proof. The ML papers seem to just state these seemingly correct ideas without real, proof-backed understanding, why this works. In other words, ML is entirely about empirical results, peppered with math-like terminology. But don't take my blunt writing style personally.

Let's take the simplest example: recognizing the grayscale 30x80 pictures with 0-9 digits. IIRC, this is called the MNIST example and can be done by my cat in 1 hour without prior knowledge. Let's choose the probably simplest model: 2400 inputs are fully connected with a 1024 vector that's fully connected with a 10 vector. And let's use relu at both steps. We know that this kinda works and converges quickly. In particular, after T steps we get error E(T) and E(1e6) < 0.03 (a random guess). Can you tell me how T and E will change if we add another layer: 2400->1024->1024->10, using the same relu? Same question, but now we replace relu with tanh: 2400->1024->10.

samvher · on Nov 29, 2019

I think you and the person you're responding to might have slightly different expectations behind what level of rigor counts as "math", just like how physicists and theoretical mathematicians often have somewhat different ideas about rigor.

My impression is that obviously ML is guided by math and people want to have an understanding of why some things converge and others don't. But "in the field" many people just mess around with different set-ups and see what works (especially in deep learning). Maybe theory follows to explain why it worked. I think you're right that a lot of progress in the field is based on intuition and some reasoning (e.g. trying something like an inception network) more than derivations that show that a particular set-up should be successful. I get the impression that most low-level components are pretty well understood, but when they are stacked and combined it gets more complicated.

I would be very curious to see a video of your cat solving MNIST in 1 hour!

tel · on Nov 29, 2019

First up, what you seem to be talking about is Deep Learning, not Machine Learning in general. In more general ML there are many theorems, some also apply to DL.

Also, the step of "shuffle around the ML graph using some intuition" involves gathering that intuition, which usually arises from a great deal of mathematical competence. A 3x3 conv kernel versus a 2x2 one can, for instance, be discussed in terms of Fourier theory and mathematical image processing, but areas with huge built-in theory.

Things like replacing the activation function were initially studied anecdotally. People realized that in some settings one activation function or another would lead. Eventually, there was also theory showing that in large nets of stable configurations, there was serious interaction between the initialization method and the activation function and problems like poor backprop signal propagation were tackled theoretically and practically.

Generally, the mystery comes from the vast parameterization of these DL models. They operate in a space that's very hard to generalize—large, finite spaces. Small finite spaces get treated exhaustively. Infinite spaces get treated asymptotically. Large finite spaces get bounded on either side by those methods.

So yes, there might feel like there's a dearth of theory in DL when it comes to the large scale behavior of a general network. That can be super frustrating. At the same time, people are trying to push through and create more theory every day.

orlp · on Nov 29, 2019

There is a subfield that does serious mathematics, but their results are usually far removed from the state of the art stuff. Their results usually look like "a neural network with 1 hidden layer is a universal approximator" or "exponential expected convergence speed on linear relations for <X>".

SpaceManNabs · on Nov 29, 2019

Not necessarily true. The polynomial time for the escape of saddle points by using stochastic noise is pretty practical.

p1esk · on Nov 29, 2019

How in the world is this "pretty practical"? It would be if people used this theorem to come up with an idea of SGD, but that's not what happened. SGD appeared as a way to overcome the practical constraint of computing full GD. Not to mention that "polynomial time" is meaningless to any practitioner.

SpaceManNabs · on Dec 4, 2019

it did come with a new online update for orthogonal tensor decomposition using higher order moments and with comments on NP-hardness for 4th order and higher.

In addition, it came with tricks with how much noise to inject in certain situations. "How much noise do you need is enough to escape?" which is pretty practical

orlp · on Nov 29, 2019

Sorry, my statement was too strong. I edited it to "usually".

BiasRegularizer · on Nov 29, 2019

I'd like to chime in that Variational Autoencoder's theoretical framework is also quite useful in unsupervised learning tasks. It bridges variational inference methods (a powerful classical ML technique) with neural networks, thus allow very efficient representation, generalization and disentanglement.

m0zg · on Nov 29, 2019

There's plenty of math still. In order to not waste GPU-hours, you gotta have some intution as to what _not_ to do, because otherwise you'll be trying for a very long time and relying entirely on luck, which is combinatorially unfavorable and frustrating. That intuition is often grounded in math, mostly differential calculus and statistics. That said, there are a good number of techniques which nobody is sure why they work. I.e. batchnorm, which is used all the time. People have some theories and intuitions as to why it works, but nobody is sure why, nor is there any rigorous way to pick hyperparameters for it.

SpaceManNabs · on Nov 29, 2019

Yes.

https://joanbruna.github.io/MathsDL-spring18/

On Tanh and ReLus, the paper on SeLUs is pretty intense.

ReLUs help with vanishing gradients to some degree on a larger degree of loss and activation functions, btw.

edit: On your question about a billion data points and predicting behavior, we are getting there.

http://www.offconvex.org/2017/03/30/GANs2/

playing_colours · on Nov 29, 2019

There are maths and proves behind Neural Networks, but Machine Learning is an umbrella term. There are areas within it, maybe not that mainstream, that are also based on solid maths foundations.

One of the most interesting and elegant examples , Topological Data Analysis that is based on Topology and utilises tools like persistent homology. They can be applied to image processing, classification, etc.

publicrootkey · on Nov 29, 2019

If I understand your question correctly, this is exactly what Bayes' Theorem deals with. Namely that B must be true within some probability given that B is true represented as P(A | B).

fg6hr · on Nov 29, 2019

Let's consider a simple image classifier: 64x64 grayscale input, 10x1 output that detects 10 classes of images. The model: y=tanh(Ax+b). You'd probably say that there is no way this will work because this model is too simple. But can you explain why this won't work? Can you tell what the maximum accuracy this model can reach? What kind of datasets this model would work better on?

[Edit] Whether this model works is question. Sure, it can't recognize dogs or cats, but what if the dataset is 0-9 digits? Now it suddenly works, right? And works really well. But what's changed? Can we describe in mathematical terms what makes the 0-9 dataset so special? It'll probably work with A-J letters too, but what about hieroglyphs?

The math I'm looking for would tell that E(T) is the error and on such model and such dataset, E(T)=exp(-T^2)+O(exp(-T^3)), according to such and such theorem; and according to another theorem, if the dataset is isomorphic to that manifold, E(T) can be improved to O(exp(-3T^2)).

improbable22 · on Nov 29, 2019

I think you're asking for the holy grail here. Everyone would love to have such a thing but nobody thinks it's likely to be possible.

So they settle for much smaller targets. Either of understanding how much simpler systems work. Or of trying to understand a little bit the effect of tweaking something in some more complicated model.

Perhaps you should think of these two approaches as analogous to doing simple chemistry (what shape is a sugar molecule? A DNA molecule?) vs trying out drugs (if you eat the bark of this tree, you don't get malaria! Let's refine that stuff). Both can be useful, but they are very far from a unified theory of how your body works.

fg6hr · on Nov 29, 2019

ML is more like alchemy, I'd say: mixing components using intuition and experience, but without understanding what these components really are and why they work. In this analogy, AI is the recipe to make gold and the ML alchemists haven't invented nuclear physics yet.

improbable22 · on Dec 1, 2019

But now we know there was physics at the bottom of alchemy. Whereas demonology at best leads you to psychiatry, and we still don't have simple models of what works there. Nor much hope of finding them. Thinking is a messy business.

fg6hr · on Nov 30, 2019

My guess is that what I'm asking for isn't that complex and could be done by a few serious mathematicians in a few years. The dynamics of tanh(Ax+b) is hardly more complex than Naiver-Stokes equation or the modern topology theory.

improbable22 · on Nov 30, 2019

tanh(Ax+b) is simple, but the dataset it's supposed to work on is not easily summarised. I think that's the huge difference. The guys doing "sugar molecule" studies make progress by taking much simpler datasets, like random points.

Naiver-Stokes is much simpler because it operates by itself. Of course turbulence is hard but even there we usually care about its coarse features, we'd be content to throw away almost all the information provided the calculation of the wing's lift works out OK.

master_yoda_1 · on Nov 29, 2019

If anybody want to read the real stuff then here is the reference: Matrix Computations https://www.cs.cornell.edu/cv/Books/GVL/index.htm

FabHK · on Nov 29, 2019

The book by Golub (RIP) and Van Loan goes way beyond what's discussed here (SVD, QR decomposition, eigenvalue computations, iterative solvers, error analysis, etc.), and it's focused more on (numeric) linear algebra rather than calculus.