Neural Networks as Ordinary Differential Equations

rmbeard · on Dec 18, 2018

Early work on this approach is contained in Jean-Pierre Aubin's book on neural networks written back in'96 where he used differential equation/inclusions and control theoretic approaches to study neural networks, see here https://www.amazon.com/Neural-Networks-Qualitative-Physics-V...

pilooch · on Dec 18, 2018

Wow, Jean-Pierre Aubin mentioned on first comment on HN, that makes me so happy. I have had him as a teacher 20 years ago, an incredible teacher, always grounding the maths into their history, and its social production, with descriptions of Banach life and others along with their motivations.

Thanks to Aubin, I did my PhD in AI a long time ago and am still working in the field now!

itin · on Dec 17, 2018

Is it just me or is the math here incorrect?

https://imgur.com/a/QXUZerV

itin · on Dec 17, 2018

specifically, I don't see where the last step G(h_t) = etc. follows from

rkevingibson · on Dec 17, 2018

Yup, that's a mistake. Shouldn't have the +h term at the end.

jeromebaek · on Dec 17, 2018

Stared at this for a while also. Yes this is wrong.

_fullpint · on Dec 18, 2018

Thank you for bringing this up. Legit, was bothering me to the extent I didn’t believe I could comprehend the article.

vanderZwan · on Dec 18, 2018

So what is the correct equation? Or has the article already been fixed?

mlevental · on Dec 18, 2018

the mistake is simply that G(h(t)) = F(h(t))/dt, rather than G(h(t)) = F(h(t))/dt + h(t), i.e. it's only in the text following the inset equations. you can verify this by looking at the "Evaluating ODEs" section.

edit: looks to have been fixed now.

sparkcc · on Dec 18, 2018

Thanks a lot. reading

joe_the_user · on Dec 18, 2018

So the time term in these "neural" differential equations is the depth of the network. I see the connection between resnet and the differential equation solver. Basically, you're saying your final result is the end-point of a curve governed by a differential equation whose initial conditions are the input set. You train the equation's parameters in a fashion akin to how you train a standard neural net.

The thing is, with ordinary neural nets you have two arguments for how the thing works.

One view is you are taking inputs and projecting those into a high dimensional feature space and then drawing a separating hyperplanes between points in two or more categories. If you have a proper shaped hyperplanes, the results will tend to generalize.

Another view is you're doing something like logistic regression. You have a function and you're trying to reduce the distance between it and the "true classification function" that takes points to their proper category. You use gradient descent to minimize the distance between your function and this "real function" that you get through data.

So anyway, I am not sure how these intuitions apply to differential equation version. I'm not sure what other intuitions there are for why these equations should work.

tnecniv · on Dec 18, 2018

> Another view is you're doing something like logistic regression. You have a function and you're trying to reduce the distance between it and the "true classification function" that takes points to their proper category. You use gradient descent to minimize the distance between your function and this "real function" that you get through data.

This viewpoint doesn't change when you switch to ODEs. The evolution of the system T seconds into the future is a (piecewise) continuous function of the initial state.

joe_the_user · on Dec 18, 2018

Me: Another view is you're doing something like logistic regression....

tnecniv: This viewpoint doesn't change when you switch to ODEs. The evolution of the system T seconds into the future is a (piecewise) continuous function of the initial state.

Sure, it doesn't. But it moves to ground where it seems hard to tell what's happening. Approximations of one function by a sum of set functions with variable parameters is a standard arrow in the mathematical "quiver" - Taylor series, linear regression, etc. A neural net naturally is a complex ensemble of such approximations. You gain tremendous flexibility of representation but lose any idea of the shape of the curve that's involved.

With the ODE theory, you have a continuous function that is used instead of the parameters. It is hard for me to conceptualize exactly what's happening beyond that.

I mean, neural nets seem to able to approximate just about any complex classification function already.

So what I believe we need in recasting/representing/etc neural nets is:

1) Ways to speed training. Ways to update a trained system for small-ish changes in the system logic without needing a full retraining.

2) Ways to explain how a neural net reaches its decisions. Ways to tell a neural net to not use a given reasoning branch if it's using "garbage reasoning".

3) Ways to merge several neural nets that operate on related but not identical problems.

Etc.

I'm not sure if this model brings us closer to this.

Muximize · on Dec 17, 2018

Isn't this also what the Julia team is trying to do? https://julialang.org/blog/2018/12/ml-language-compiler

ChrisRackauckas · on Dec 18, 2018

We are building tooling which will allow you to put together differential equations and neural networks in any way you please. This is just one way of putting the two together. The applications we were looking at are different, but it utilize similar tooling. We have forward-mode, reverse-mode, forward sensitivity analysis, and adjoint sensitivity analysis (the method mentioned here, which is actually from at least the 90's) implemented and tested with the DifferentialEquations.jl integrators. We had a recent paper exploring the timing differences between the different sensitivity analysis (gradient calculation) methods:

https://arxiv.org/abs/1812.01892

The neural ODE falls into the case where the number of parameters is large, and so the traditional adjoint methods seem to do best there when optimized. So in some sense, we aren't "trying" to do it, but have had library implementations for these gradient calculations for a few years now. The documentation is here: http://docs.juliadiffeq.org/latest/analysis/sensitivity.html

But we are heading in new directions with this as well. The traditional adjoints seem to do best here because of inefficiencies with tracing-based reverse-mode autodiffs like the kind found in Flux.jl or Autograd. Basically, those kinds of tape-based autodiffs rely on each operation being expensive, which isn't true for the internals of an ODE solver (but tends to be true for applications like neural nets where large matrix multiplications dominate). This means that the overhead of building the tape is really noticable and thus it doesn't do well in this application. Additionally, the tape is dependent on the input because the trace of the operations is dependent on what branches are taken. This means that, even if the ODE solver is fully optimized and every piece is JIT compiled, the backprop tape itself cannot be compiled and saved. Again, this isn't an issue for large matmuls, but this is another factor that comes into play when the operations are not sufficiently costly.

However, the blog post you link to discusses source-to-source AD as an alternative. Source-to-source builds the backprop code directly from the full source, doing both branches at the same time. This means you can AD the source and then compile it. This methodology solves the issues we found with tracing-based autodifferentiation for this application, so I'll be happy when we get to testing it. Zygote.jl isn't quite ready for source-to-source AD on the full DifferentialEquations.jl code, but it's getting close.

leggomylibro · on Dec 18, 2018

Very interesting - does this mean that you could also rephrase arbitrary differential equations as neural networks, then apply parallel computing to solve them?

Sorry if that question doesn't make sense; I'm not really clear on how ordinary differential equations are usually solved these days.

Skunkleton · on Dec 18, 2018

> I'm not really clear on how ordinary differential equations are usually solved these days.

Its been a while, but I remember that despair and sadness are involved in some way.

sdenton4 · on Dec 18, 2018

Oh, that's what the paper meant when they said the solver was a black box...

mabbo · on Dec 18, 2018

"Black box" is the name of their liquor cabinet

gigatexal · on Dec 18, 2018

Oh good at least another person here that feels the same about ODEs as I do. Math in college was rough for me.

tomc1985 · on Dec 18, 2018

That and some crazy variable voodoo

Gladdyu · on Dec 18, 2018

You could express an arbitrary (well-behaved) ODE as a neural network by discretizing it in timesteps, but you would not extract any additional parallelism.

In normal ODEs you can already compute each X_{i}(t)/dt in parallel, but you will still have to evaluate layer/time t completely before evaluating layer/time t+Dt as it feeds forward in both the NN and ODE case.

ChrisRackauckas · on Dec 18, 2018

You could, but it doesn't work very well. It's very very slow. We built a package for it in Julia to test it out, but it wasn't competitive against any of our other methods. Our writeup is here:

https://julialang.org/blog/2017/10/gsoc-NeuralNetDiffEq

shoyer · on Dec 18, 2018

I think it's premature to write off using neural networks to accelerate solving ODEs based on this study. There are quite a few ways to formulate the problem and there's been some promising work in this area recently, e.g., using neural networks to approximate subgrid processes in climate models: https://www.pnas.org/content/115/39/9684

ChrisRackauckas · on Dec 18, 2018

What you show is something completely different though. Using neural networks to automatically learn some way to formulate a model? That sounds reasonable. Using neural networks to solve a given ODE? That doesn't seem to work well in the cases we tried.

shoyer · on Dec 18, 2018

How would you define “using neural networks to solve a given ODE”?

I’ll certainly agree that it doesn’t make sense to use a single neural net like function to model the full solution of an ODE. But the entire power of deep learning is that it doesn’t force you to use a single approach — you can compose neural nets with any function you like as long as it’s differentiable. So I think hybrid models that blend deep learning with traditional numerical methods are entirely fair game.

ChrisRackauckas · on Dec 18, 2018

>I’ll certainly agree that it doesn’t make sense to use a single neural net like function to model the full solution of an ODE.

That's exactly how I'm defining using neural networks to solve a given ODE, and yes our studies show that it's not a practical method. It was just a good Google Summer of Code where we coded up a version in TensorFlow, the student did the same thing in KNet.jl, and then we played around with a bunch of modifications (to the error function, allowing adaptive training, etc.) to end up convincing ourselves this wasn't a viable method. However, the next steps we are doing are blending deep learning with traditional numerical methods. A post earlier up shows that our differential equation software now blends with the deep learning software, and we have a few projects investigating different strategies for actually using these combinations. We should have results going onto Arxiv late January showing some promising mixed strategies.

wenc · on Dec 18, 2018

Multiple-shooting methods [1] for ODEs are amenable to parallelism in time.

[1] https://en.wikipedia.org/wiki/Direct_multiple_shooting_metho...

tanderson92 · on Dec 18, 2018

Yes and no: there is parallel time integration for ODEs which parallelizes using multigrid techniques.

ChrisRackauckas · on Dec 18, 2018

Yes, but tests tend to show you need like >64 cores for things like parareal methods to do better than standard serial methods. So they exist, but aren't quite practical yet. Maybe a GPU-based one in specific cases can be interesting, but no one has been able to demonstrate an efficient enough code yet. It's definitely an interesting topic.

tanderson92 · on Dec 18, 2018

Thanks for the perspective on the actual performance of these algorithms. So yeah, as an algorithmic fact it isn't necessary to proceed sequentially, though perhaps not performant to proceed in parallel either.

Anyway, I heard AMD has 64 cores on a single chip so it may not be too long..

TheGrassyKnoll · on Dec 18, 2018

Neuromorphic Supercomputer With 1 Million Cores Mimics the Human Brain

  https://www.tomshardware.com/news/human-brain-neuromorphic-supercomputer-manchester,38027.html

AlexCoventry · on Dec 18, 2018

How do they cope with the interactions at the boundaries of the grid?

marcosdumay · on Dec 18, 2018

All A are B does not lead to all B are A.

And arbitrary differential equations are a hell of a class.

gsyhbhsybs · on Dec 18, 2018

Is it too much to hope that eventually it will be established that all the other neural nets are just extremely hard non-linear partial differential equations?

gnulinux · on Dec 18, 2018

The language of PDEs is Turing complete so is the language of neural networks (with a non-linearity and arbitrarily many layers) so this is trivially true in general (they have equal computational expressiveness). But "compiling" neural networks to PDEs might be a hard problem that we can never solve, in general.

turingspiritfly · on Dec 18, 2018

https://link.springer.com/chapter/10.1007%2F11494645_21 seemingly states ODEs can simulate turing machines. Is the simulation enough proof ofit's turing completeness, and in what sense can PDEs be called a language?

gnulinux · on Dec 18, 2018

Being able to simulate a Turing machine is the definition of being Turing complete. So that proof is sufficient.

In mathematics (and computer science) a language is just a set of strings, and a machine is just a language attached to a semantics. I.e. whatever symbols you need to represent PDEs, inductively generated by the axioms and inference rules of mathematics and theoretical physics (like for lambda-calculus: lambda, x0,x1,... '(', ')' and reduction and construction rules)

SubiculumCode · on Dec 18, 2018

Simulation has a number meanings, depending on the field. I tend to loosely think of a simulation as a model with an acceptable level of error, but I am not a mathmatician.

gnulinux · on Dec 18, 2018

This is true, thanks. As the other answer said, in this field we say given any machine M, "M2 simulates M" if M2 computes the same function as M i.e. given any input i, iff M(i) halts then M2(i) halts and M(i) == M2(i) otherwise they both loop. This can also be called "compiling" informaly: given any Turing machine, you can "compile" it to C (i.e. constructing a C program that simulates M) which means C is Turing-complete.

Note that this is a very handwavy definition. The "types" of input, output, and what it means to "compute" will depend on your language, semantics and encoding. (E.g. for lambda calcus "compute" means appying reduction rules until the machine halts, if ever)

SubiculumCode · on Dec 18, 2018

I appreciate the explanation. Clearly in this context simulation does not equal approximation !!

0-_-0 · on Dec 18, 2018

To simplify, in algorithm theory _simulating a Turing machine_ means the ability to execute any program "written" for a Turing machine.

Turing machines, neural networks, cellular automata, computer programs and (apparently) ODEs can execute each other's "programs" (among any others).

tiborsaas · on Dec 18, 2018

Neural nets are universal function approximations, so this sounds like a safe bet :) When I first read about them with this categorisation it made the subject click.

Gladdyu · on Dec 17, 2018

How would this handle systems in which the dimensionality of the hidden layers is not equal to the input dimension, as is often the case? For solving the ODE you'd need to get at least 1 sample point, which might restrict the amount of information you can capture in such a network.

mlthoughts2018 · on Dec 18, 2018

This paper is explicit that it only applies to sequences of ResNet layers, which happen to have the property that the input and output dimensionalities are equal.

It’s a property of the functional form of stacked ResNet layers that allows the ODE layer to be used instead.

You are right that many networks will require at least some (and usually many) layers that change dimensionality. So ODE layers are not going to wholesale replace anything, apart from possibly submodules of a bigger network where the submodule is made up just of sequences of ResNet layers.

ODE layers will also have new applications, such as for irregularly spaced sequential inputs or outputs.

nicwilson · on Dec 18, 2018

I imagine that if you have one discrete layer from input to hidden that you could then treat the hidden layer(s) dynamics as continuous from there on until the next change of dimension. You should still be able to back propagate through a continuous-discrete or discrete-continuous boundary. There should be a lot of literature on this in control systems simulation.

The other thing you can do is to "pad" the input up to be the size of the hidden layers.

sampo · on Dec 18, 2018

> How would this handle systems in which the dimensionality of the hidden layers is not equal to the input dimension

You could use the ODE solver to integrate the system from t1 to t2a. Then you need one normal neural net layer (one forward Euler time step) that takes the system from t2a to t2b and changes dimensions. Then you can again use the ODE solver to integrate from t2b to t3.

foxes · on Dec 17, 2018

So this has become a problem in optimal control, and we have to boundary conditions (input/output data)?

Specifically, find a control function (so the weights of the activation function) which maps the system to your desired output?

Then there are also much better ways to solve two point boundary problems rather than just shooting forward. [0] Is a good place to start, (although its old).

[0] Numerical Methods for Two-Point Boundary-Value Problems. Herbert Keller.

duckworthd · on Dec 18, 2018

A couple of questions,

1) In the motivation for the adjoint method, is u given along with A and C? I don't see how to compute u^T B without u.

2) How do a(t), da(t)/dt, and dG(..)/dh relate to A, B, C, u, and v from the motivation?

aqsalose · on Dec 18, 2018

Also was wondering the same thing. Also, the linked adjoint method tutorial discusses more general class of problems and it is not immediately clear to me how to map them to this linear equation version: here B is the unknown we want to evaluate and the constraint is set on B; in the pdf tutorial one wants to compute gradient_p f, which sounds similar than the formulation in the blog post, because we usually want the gradient (B?) to evaluate d_p f(x,p) at some point p (u?), but the constraints are given for g(x,p) = 0.

tabtab · on Dec 18, 2018

There may be different ways to more or less model the same thing. The trade-offs may be machine efficiency, human grokkability, prove-ability (mathematically analyzable), and accuracy (sometimes approximations are good enough, and gain on other features).

Perhaps related: Factor tables for AI: https://github.com/RowColz/AI

8bitsrule · on Dec 17, 2018

Can someone tell me what I need to add to Firefox so that this:

\( \mathbf{h}_{n+1} = F(\mathbf{h}_{n}) \

is displayed as the author obviously intended?

jimrandomh · on Dec 17, 2018

You may have Javascript disabled or be blocking the script that does the rendering, KaTeX. It works for me in both Firefox and Chrome.

8bitsrule · on Dec 17, 2018

Yes, it was a blocked 'cloud' script. Thanks to you both.

etimberg · on Dec 17, 2018

I've used https://github.com/mathjax/MathJax in the past

Edit: if you're asking why you don't see the author's page as intended, you might have JS disabled

CyberDildonics · on Dec 18, 2018

This is from extension like uMatrix not getting the external markup links. If you enable some of the external sites it should start to work.

andrewtbham · on Dec 17, 2018

repeat?

https://news.ycombinator.com/item?id=18676986

traek · on Dec 17, 2018

The first sentence links to that paper. This is presented as a high-level introduction to that and related papers/ideas.

glitchc · on Dec 17, 2018

Feels like aggressive advertising. Not the best way to promote your work. Let it stand for itself, if others see value in it, they will cite it.

mbauman · on Dec 17, 2018

Are you aware the author of the blog post is doing exactly what you say? Kevin is not an author of the original paper. He's seeing value in it, citing it, and synthesizing it.