Deep neural networks as computational graphs (2018)

blackbear_ · on May 22, 2023

I cannot stress enough how foundational of an idea it is to store mathematical expressions as graphs! Analyzing and manipulating such graphs enables, among other things, to automatically obtain derivatives of a function with respect to its parameters, which is one of the cornerstones of most AI methods.

Before this was invented, people had to manually find expressions for these derivatives, and implement the code to compute them. Needless to say, this took way too much time and was likely to result in sub-optimal and unoptimized code, while nowadays nobody has to deal with this anymore (except on their homeworks)!

For the curious of how this works, I present a simple implementation on my blog [1].

[1] https://e-dorigatti.github.io/math/deep%20learning/2020/04/0...

joe_the_user · on May 22, 2023

I cannot stress enough how foundational of an idea it is to store mathematical expressions as graphs!..

Understanding a function's computational graph is certainly useful but storing a function as a computation graph is, in fact, quite expensive. Deep learning systems don't store their computational graphs, the graphs are implicit in their computation process. Deep learning systems instead store their functions as tensors; generalized arrays/matrices. This allows both efficient storage and efficient computation. It's quite possible do automatic differentiation on these structures as well - that is the basis of "backpropagation".

It's important to distinguish useful conceptual structures like computation graphs (or symbolic representations) and the structures that are necessary for efficient computation. Automatic differentiation itself is important because the traditional symbolic differentiation one learns in calculus "blows up", can have O(m^expression-length) cost, when attempted on a large expression where automatic differentiation has a cost that is not that much higher than the base cost of computing a given function (but if that base cost is high, you lose your advantage also).

Just sayin'

lgessler · on May 23, 2023

> Deep learning systems don't store their computational graphs, the graphs are implicit in their computation process

I'm not sure about other frameworks, but in PyTorch, I think it's fair to say that while graphs are implicit in the sense that they are dynamically constructed at runtime (as opposed to statically defined at compilation time as with TF), they are not implicit in the sense that they are not directly represented by PyTorch's data structures: PyTorch does store the computational graph. After a backward pass, you can ask each Tensor in the graph, for example, for its associated Node: https://pytorch.org/docs/stable/autograd.html#autograd-graph

necroforest · on May 22, 2023

you can't store a function as a tensor. the tensors are the inputs/outputs that flow along the edges of the graph. TF stores things directly as a graph: https://github.com/tensorflow/tensorflow/blob/master/tensorf... while Jax represents computations in an intermediate representation: https://jax.readthedocs.io/en/latest/jaxpr.html

curiousgal · on May 22, 2023

Another major application is in finance! Computing risk metrics for hedging purposes used to to be based on a "bump and reprice" method, i.e. you bump the input by a small amount and see how that affects the price (the definition of a derivative). But now in modern quant libraries, implementing the pricing as a graph allows you to get all of the sensitivities (derivatives) for free!

KMag · on May 22, 2023

Not quite free. Last I checked, adding automatic differentiation to GPU computation for pricing exotic derivatives roughly doubled the computation cost. (Though, traditional bump-and-reprice has exponential cost in the number of dimensions.)

blackbear_ · on May 22, 2023

That's cool! But it relies on the model being an accurate representation of reality, right? What kind of models are we talking about anyways, and what inputs do they use? (Asking out of curiosity, not skepticism)

mhh__ · on May 22, 2023

A model is a computation based on its initial assumptions. Sometimes that's fine.

The key thing is that the (good) practitioners know the finance and know the models. If it's obviously wrong that's a sign in itself - a simple model doesn't fit the market: You might be about to lose some money, or you could take on some risk from other people and get rewarded for it (people are panicking).

Weirder derivatives and so on can get more dangerous, of course. However even the really famous example from the 07 crash (pricing CDOs and CDS against them) was arguably due to a deliberate ignorance of widespread fraud and fictitious numbers than the models as per se. Garbage in garbage out (the model was also not great but still)

mhh__ · on May 22, 2023

Not quite. Some people want bump Greeks. For example a company may have a standard bump.

Make of that what you will.

mathisfun123 · on May 23, 2023

There's always a particular brand of woo on display in threads like this that irks me to my core - like someone arriving at show and tell but really they don't know enough to tell.

>I cannot stress enough how foundational of an idea it is to store mathematical expressions as graphs... one of the cornerstones of most AI methods.

Firstly what are AI methods? That's pretty vague huh? Secondly you're just saying that expression trees are useful. Ok? I mean this was known since like the days of... Church?

> Before this was invented, people had to manually find expressions for these derivatives, and implement the code to compute them.

There were (and still are) lots of AD systems and not all of them use a wengert tape.

https://www.autodiff.org/?module=Tools&language=ALL

Anyway it's just software (and not even particularly clever software), not the second coming lol.

BaculumMeumEst · on May 22, 2023

Excellent post, thank you for writing it.

data_maan · on May 22, 2023

This says nothing new. This idea was around since the 80s, if not earlier, and used extensively for automatic differentiation textbook.

But, for an audience that seems to think it rides on the frontier of knowledge and that we have just discoveres things (when, in fact, it's our ignorance of earlier research that is the reason that we are re-discovering them), such a medium post might be like a drug :D

touisteur · on May 24, 2023

The interesting thing, to me is how this taskgraph representation of neural networks has taken off AND been optimized to hell, just looking at the techs behind tensorrt, jax, openvino, tvm,finn and the whole hell these tools make the graph go through to wring interesting performance, is amazing. For once, we almost standardised on one (or 2, or 3, relatively transpilable) simpler language/DSLs, and a whole industry jumped on optimizing them is rarely heard of, except for gcc and llvm.

This actual 'at last a new reduced semantic that we can maybe optimized better' has such huge effort (and success) behind is the amazing part, for me.

garganzol · on May 22, 2023

Sometimes simple (and old) ideas start to click and ignite a tornado when the timing is right.

While many may argue that the notion of a computational graph was associated with neural networks (NN) from the very beginning, this post is quite novel. It sheds the light on how NN and the general computation theory are actually interconnected, it is basically the same thing.

For me, it was an instant blast: the computational graph reminds me a typical intermediate representation (IR) tree of a textbook compiler. And because it is a formal graph, all mathematical benefits of the graph theory suddenly start to click. For instance, the graph theory formalizes the definitions of cyclic directed graphs versus acyclic directed graphs. If we imagine for a moment that a graph vertex represents a unit of computation, and an edge represents a data flow, it immediately starts to resonate with Lambda calculus. It quickly becomes evident that when the graph is cyclic, it represents a recursion in terms of Lambda calculus and thus becomes Turing-complete.

If computational graph is acyclic, Turing completeness is not achievable.

If you are going to say that this is an obvious and well-known observation - I will be surprised, because it is not. And all that enlightenment was possible thanks to this article combined with a bit of knowledge about graphs, compilers, and lambda calculus. (It just so happened that I'm relatively well-versed in those topics due to a professional involvement.)

---

If we continue to formalize this observation further, we may soon find a formal proof that a program P and a neural network NN are equivalent:

  P ~= NN

Both have inputs (I) and outputs (O):

  O = P(I)
  O = NN(I)

Both perform a computation by calculating output O for a given input I. The Turing-completeness observation and the direct mapping to and from the Lambda calculus will give us a way to translate an arbitrary program P to an equivalent neural network NN:

  P -> NN

But the most intriguing part is that the inverse operation also becomes available:

  NN -> P

Congratulations. We have just found a way to translate a working neural network NN to a formal program P written in a programming language L.

What are consequences of that? The most obvious one is that we can now create a software that will be able to automatically translate a trained neural network NN to a working deterministic program P written in a programming language of choice and vice versa. We have just made a small step towards an imaginary AI system that can work as good as a professional software developer. Basically, one day we will be able to create a software-based software developer, a skilled one. And the level of skills will 100% surpass human abilities one day because Turing-completeness guarantees the unboundedness.

The steps of NN -> P and P -> NN translation may be performed by the neural network itself. It seems that we already started to see that with ChatGPT 3.

---

As you can see, one simple observation allowed us to precisely calculate the future for many years ahead. And this is why I love the math so much.

IIAOPSW · on May 23, 2023

I had an idea, perhaps crackpot, that will maybe catch fire for you too. What if the graph structure of the natural language parse tree was directly reflected in the literal biological neural network. What if the reason for natural language grammar is that it directly reflects the way it is processed under the hood.

bjourne · on May 22, 2023

You might be interested in the universal approximation theorem.

garganzol · on May 22, 2023

The article is a thought-provoking piece of art. A few personal observations from me:

    a) An acyclic computational graph is equivalent to a program without loops and thus not Turing-complete

    b) A cyclic computational graph is equivalent to a program with recursion. The closest analogy from the programming world is a program without loops but with recursive calls. This means that it has Turing completeness, as loops are fully interchangeable with recursion (just different facets of the same thing).

As easy as 2+2=4. This means that a neural network is essentially a program. The only difference is the way to write the program: neural networks do it themselves by "learning".

Those simple observations explain everything. Brain is essentially a biological Turing-complete processing unit with full plasticity, as you can build any network (program) by changing "weights" of the neurons. This also means that full AI is possible, it is just a matter of the available computational power. Why is that? Because Turing completeness guarantees the absence of boundaries for achieving progressively sophisticated system abilities.

gcr · on May 22, 2023

Most neural networks these days aren't recursive however.

GPT-3 and friends certainly isn't, unless you count the inference loop as recursion, which is bounded both by vocabulary size and maximum token length (hardcoded for now).

garganzol · on May 22, 2023

This only means that there is so much more space for AI to grow and improve in the future.

As far as I know, having a recursion in neural networks poses an extravagant dilemma: from one hand, recursive networks are more capable; from another, they are harder to train and may have issues with stability caused by occasional positive feedback loops.

gcr · on May 23, 2023

well, be careful with "more capable" - the universal approximation theorem shows that a NN with one hidden layer is able to represent any function (in the limit of how wide that layer is, at least). So there's nothing fundamental about "more capable" or "less capable." It's only about which representations are easy to train and which aren't.

geysersam · on May 22, 2023

> Brain is essentially a biological Turing-complete processing unit

How do you know that?

Some would answer, "well, everything in the physical universe can be simulated by a Turing machine, so also the brain" but that would be begging the question.

garganzol · on May 22, 2023

I just presume that a biological brain is a network of neurons with connections between them, including feedback connections, at least this is what we see under the microscope. In terms of math, this structure corresponds to a cyclic directed computational graph - this only fact makes it Turing-complete.

Currently it is just a highly plausible conjecture, not a formal proof. However, I think I can prove it. But should I?

geysersam · on May 23, 2023

We don't know how the brain works. Neuron activity is one aspect, and clearly an important one, but it's not everything.

And even if the connectome can be modeled as a graph, the dynamics on the graph could be totally alien to us. It might be computable by a Turing machine. But it also might involve some not computable process we can't imagine. Perhaps it involves fundamentally unknown physics. (I consider the last option to be quite likely. I think rather few actually entertain the idea that physics as we understand it today could give rise to consciousness. But here we are.)

fallous · on May 23, 2023

I may be misremembering but aren't neurons one-way connections with no feedback?

ttul · on May 22, 2023

Indeed, the self-attention mechanism within the transformer architecture is often described as a graph, where tokens are the nodes and the key and query vectors are combined to establish the affinity that each token has for each other token.

blackbear_ · on May 22, 2023

These are two different things, actually. The graph you are talking about is constructed using the inputs of the attention layer (ie, nodes=tokens, edges=affinity), while the article talks about graphs that represent the actual operations performed in the attention layer (ie, nodes=weights and intermediate results, edges=mathematical operations that combine them).

bjourne · on May 22, 2023

I don't get the profundity. Computational directed acyclic graphs have been aa mainstay of computer science for decades.

gavi · on May 22, 2023

I highly recommend the video from karpathy https://www.youtube.com/watch?v=VMj-3S1tku0 - This explains the same idea spelled out in code. It really helped me understand the mathematical expression graph of neural networks. He uses graphviz to show the expressions and initially calculate gradients manually and then implement back propagation.

dynamite-ready · on May 22, 2023

Wasn't that the paradigm behind Tensorflow?

dkislyuk · on May 22, 2023

As another commenter said, viewing a neural network as a computation graph is how all automatic differentiation engines work (particularly reverse-mode where one needs to traverse through all the previous computations to correctly apply the gradient), and there were several libraries predating Tensorflow following this idea. The initial contribution of Tensorflow and PyTorch was more about making the developer interface much cleaner and enabling training on a wider range of hardware by developing a bunch of useful kernels as part of the library.

dekhn · on May 22, 2023

I always thought of seeing most computations of functions as computational graphs- at least, when I used MathLink to connect Mathematica to Python, it basically gives you a protocol to break any Mathematica function into its recursively expanded definition. Konrad Hinsen suggested using python's built-in operator overloading, so if you said "1 + Symbol('x')" it would get converted to "Plus(1, Symbol('x')) and then sent over MathLink to Mathematica, which would evaluate Plus(1, x), and return an expression which I'd then convert back to a Python object representation.

I don't think we talked about doing any sort of automated diff (in my day we figured out our own derivatives!) but after I made a simple eigendecomp of a matrix of floats, the mathematica folks contributed an example that did eigendecomp of a matrix with symbols (IE, some of the terms weren't 5.7 but "1-x"). Still kind of blows my mind today how much mathematica can do with computation graphs.

IIUC this is the basis of LISP as well.

dkislyuk · on May 22, 2023

The one distinction I would add with neural networks is that it's not just a recursive tree traversal that one would get when evaluating an arithmetic statement, but an actual graph: a computation node can have gradients from multiple sources (e.g. if a skip connection is added), so each node needs to keep accumulated state around that can be updated by arbitrary callers.

Of course, optimized autograd / autodiff is more parallelized than node-based message passing, but it's a useful model to start with.

dekhn · on May 22, 2023

I'd have to think about this for a while but I'm not sure I see that as a distinction. if you have a skip conneciton, that's just another node in the graph you can have to execute topologically before your dependent node, and then pass the data. over the edge when the child node is ready to consume.

What you're describing with node-based message passing sounds much more like a petri net, or other agent-based discrete event modelling system. Which is another powerful mental paradigm, but challenging to reason about.

pmoriarty · on May 22, 2023

"What you're describing with node-based message passing sounds much more like a petri net, or other agent-based discrete event modelling system."

It sounds like Smalltalk to me.

garganzol · on May 22, 2023

In terms of abstract algebra, there are no distinctions. What you call gradients are actually data flows. "From multiple sources" - means that a function can take multiple parameters (= inbound gradients, inflows).

ggr2342 · on May 22, 2023

AFAIK it was first brought forward by PyTorch. I may be wrong though.

cgearhart · on May 22, 2023

It certainly pre-dates PyTorch. Applying computational graphs to neural networks was the central purpose of the Theano framework in the early 2010s. TensorFlow heavily followed those ideas, and Theano shut down a year or two later because they were so close conceptually and a grad research lab can’t compete with the resources of Google.

6gvONxR4sf7o · on May 22, 2023

It's been a core part of automatic differentiation for decades.

ggr2342 · on May 22, 2023

Oh! Can you give some pointers where to read about it more?

6gvONxR4sf7o · on May 22, 2023

The wikipedia page for automatic differentiation and its references would be a good start. https://en.wikipedia.org/wiki/Automatic_differentiation

In particular, the section "Beyond forward and reverse accumulation" tells you how hard it is to deal with the graph in optimal ways, hence the popularity of simpler forwards and backwards traversals of the graph.

PartiallyTyped · on May 22, 2023

The original 2017 PyTorch paper is probably a good one.

HarHarVeryFunny · on May 22, 2023

Before PyTorch there was Torch (which was Lua-based, hence the Py- prefix of the follow on PyTorch). With Torch, like the later TensorFlow 1.0, you first explicitly constructed the computational graph then executed it. PyTorch's later innovation was define-by-run where you just ran the code corresponding to your computational graph and the framework traced your code and built the graph behind the scenes... this was so much more convenient that PyTorch quickly became more popular than TensorFlow, and by the time TF 2.0 "eager mode" copied this approach it was too late (although TF shot itself in the foot in many ways, which also accelerated it's demise).

mhh__ · on May 22, 2023

PyTorch's insight was to be very dynamic and simple. The underlying model is basically the same with the caveat that tensorflow uses/used something called a "tape".

orbifold · on May 23, 2023

Ironically the very first publication on "Backpropagation" by a Finish master student (1972) already pointed out how to compute gradients in computational graphs. This if anything is the main thing which distinguishes Backpropagation as an algorithm from simply applying the chain rule.