Hacker News new | past | comments | ask | show | jobs | submit login
Deep neural networks as computational graphs (2018) (medium.com/tebs-lab)
103 points by softwaredoug on May 22, 2023 | hide | past | favorite | 44 comments



I cannot stress enough how foundational of an idea it is to store mathematical expressions as graphs! Analyzing and manipulating such graphs enables, among other things, to automatically obtain derivatives of a function with respect to its parameters, which is one of the cornerstones of most AI methods.

Before this was invented, people had to manually find expressions for these derivatives, and implement the code to compute them. Needless to say, this took way too much time and was likely to result in sub-optimal and unoptimized code, while nowadays nobody has to deal with this anymore (except on their homeworks)!

For the curious of how this works, I present a simple implementation on my blog [1].

[1] https://e-dorigatti.github.io/math/deep%20learning/2020/04/0...


I cannot stress enough how foundational of an idea it is to store mathematical expressions as graphs!..

Understanding a function's computational graph is certainly useful but storing a function as a computation graph is, in fact, quite expensive. Deep learning systems don't store their computational graphs, the graphs are implicit in their computation process. Deep learning systems instead store their functions as tensors; generalized arrays/matrices. This allows both efficient storage and efficient computation. It's quite possible do automatic differentiation on these structures as well - that is the basis of "backpropagation".

It's important to distinguish useful conceptual structures like computation graphs (or symbolic representations) and the structures that are necessary for efficient computation. Automatic differentiation itself is important because the traditional symbolic differentiation one learns in calculus "blows up", can have O(m^expression-length) cost, when attempted on a large expression where automatic differentiation has a cost that is not that much higher than the base cost of computing a given function (but if that base cost is high, you lose your advantage also).

Just sayin'


> Deep learning systems don't store their computational graphs, the graphs are implicit in their computation process

I'm not sure about other frameworks, but in PyTorch, I think it's fair to say that while graphs are implicit in the sense that they are dynamically constructed at runtime (as opposed to statically defined at compilation time as with TF), they are not implicit in the sense that they are not directly represented by PyTorch's data structures: PyTorch does store the computational graph. After a backward pass, you can ask each Tensor in the graph, for example, for its associated Node: https://pytorch.org/docs/stable/autograd.html#autograd-graph


you can't store a function as a tensor. the tensors are the inputs/outputs that flow along the edges of the graph. TF stores things directly as a graph: https://github.com/tensorflow/tensorflow/blob/master/tensorf... while Jax represents computations in an intermediate representation: https://jax.readthedocs.io/en/latest/jaxpr.html


Another major application is in finance! Computing risk metrics for hedging purposes used to to be based on a "bump and reprice" method, i.e. you bump the input by a small amount and see how that affects the price (the definition of a derivative). But now in modern quant libraries, implementing the pricing as a graph allows you to get all of the sensitivities (derivatives) for free!


Not quite free. Last I checked, adding automatic differentiation to GPU computation for pricing exotic derivatives roughly doubled the computation cost. (Though, traditional bump-and-reprice has exponential cost in the number of dimensions.)


That's cool! But it relies on the model being an accurate representation of reality, right? What kind of models are we talking about anyways, and what inputs do they use? (Asking out of curiosity, not skepticism)


A model is a computation based on its initial assumptions. Sometimes that's fine.

The key thing is that the (good) practitioners know the finance and know the models. If it's obviously wrong that's a sign in itself - a simple model doesn't fit the market: You might be about to lose some money, or you could take on some risk from other people and get rewarded for it (people are panicking).

Weirder derivatives and so on can get more dangerous, of course. However even the really famous example from the 07 crash (pricing CDOs and CDS against them) was arguably due to a deliberate ignorance of widespread fraud and fictitious numbers than the models as per se. Garbage in garbage out (the model was also not great but still)


Not quite. Some people want bump Greeks. For example a company may have a standard bump.

Make of that what you will.


There's always a particular brand of woo on display in threads like this that irks me to my core - like someone arriving at show and tell but really they don't know enough to tell.

>I cannot stress enough how foundational of an idea it is to store mathematical expressions as graphs... one of the cornerstones of most AI methods.

Firstly what are AI methods? That's pretty vague huh? Secondly you're just saying that expression trees are useful. Ok? I mean this was known since like the days of... Church?

> Before this was invented, people had to manually find expressions for these derivatives, and implement the code to compute them.

There were (and still are) lots of AD systems and not all of them use a wengert tape.

https://www.autodiff.org/?module=Tools&language=ALL

Anyway it's just software (and not even particularly clever software), not the second coming lol.


Excellent post, thank you for writing it.


This says nothing new. This idea was around since the 80s, if not earlier, and used extensively for automatic differentiation textbook.

But, for an audience that seems to think it rides on the frontier of knowledge and that we have just discoveres things (when, in fact, it's our ignorance of earlier research that is the reason that we are re-discovering them), such a medium post might be like a drug :D


The interesting thing, to me is how this taskgraph representation of neural networks has taken off AND been optimized to hell, just looking at the techs behind tensorrt, jax, openvino, tvm,finn and the whole hell these tools make the graph go through to wring interesting performance, is amazing. For once, we almost standardised on one (or 2, or 3, relatively transpilable) simpler language/DSLs, and a whole industry jumped on optimizing them is rarely heard of, except for gcc and llvm.

This actual 'at last a new reduced semantic that we can maybe optimized better' has such huge effort (and success) behind is the amazing part, for me.


Sometimes simple (and old) ideas start to click and ignite a tornado when the timing is right.

While many may argue that the notion of a computational graph was associated with neural networks (NN) from the very beginning, this post is quite novel. It sheds the light on how NN and the general computation theory are actually interconnected, it is basically the same thing.

For me, it was an instant blast: the computational graph reminds me a typical intermediate representation (IR) tree of a textbook compiler. And because it is a formal graph, all mathematical benefits of the graph theory suddenly start to click. For instance, the graph theory formalizes the definitions of cyclic directed graphs versus acyclic directed graphs. If we imagine for a moment that a graph vertex represents a unit of computation, and an edge represents a data flow, it immediately starts to resonate with Lambda calculus. It quickly becomes evident that when the graph is cyclic, it represents a recursion in terms of Lambda calculus and thus becomes Turing-complete.

If computational graph is acyclic, Turing completeness is not achievable.

If you are going to say that this is an obvious and well-known observation - I will be surprised, because it is not. And all that enlightenment was possible thanks to this article combined with a bit of knowledge about graphs, compilers, and lambda calculus. (It just so happened that I'm relatively well-versed in those topics due to a professional involvement.)

---

If we continue to formalize this observation further, we may soon find a formal proof that a program P and a neural network NN are equivalent:

  P ~= NN
Both have inputs (I) and outputs (O):

  O = P(I)
  O = NN(I)
Both perform a computation by calculating output O for a given input I. The Turing-completeness observation and the direct mapping to and from the Lambda calculus will give us a way to translate an arbitrary program P to an equivalent neural network NN:

  P -> NN
But the most intriguing part is that the inverse operation also becomes available:

  NN -> P
Congratulations. We have just found a way to translate a working neural network NN to a formal program P written in a programming language L.

What are consequences of that? The most obvious one is that we can now create a software that will be able to automatically translate a trained neural network NN to a working deterministic program P written in a programming language of choice and vice versa. We have just made a small step towards an imaginary AI system that can work as good as a professional software developer. Basically, one day we will be able to create a software-based software developer, a skilled one. And the level of skills will 100% surpass human abilities one day because Turing-completeness guarantees the unboundedness.

The steps of NN -> P and P -> NN translation may be performed by the neural network itself. It seems that we already started to see that with ChatGPT 3.

---

As you can see, one simple observation allowed us to precisely calculate the future for many years ahead. And this is why I love the math so much.


I had an idea, perhaps crackpot, that will maybe catch fire for you too. What if the graph structure of the natural language parse tree was directly reflected in the literal biological neural network. What if the reason for natural language grammar is that it directly reflects the way it is processed under the hood.


You might be interested in the universal approximation theorem.


The article is a thought-provoking piece of art. A few personal observations from me:

    a) An acyclic computational graph is equivalent to a program without loops and thus not Turing-complete

    b) A cyclic computational graph is equivalent to a program with recursion. The closest analogy from the programming world is a program without loops but with recursive calls. This means that it has Turing completeness, as loops are fully interchangeable with recursion (just different facets of the same thing).
As easy as 2+2=4. This means that a neural network is essentially a program. The only difference is the way to write the program: neural networks do it themselves by "learning".

Those simple observations explain everything. Brain is essentially a biological Turing-complete processing unit with full plasticity, as you can build any network (program) by changing "weights" of the neurons. This also means that full AI is possible, it is just a matter of the available computational power. Why is that? Because Turing completeness guarantees the absence of boundaries for achieving progressively sophisticated system abilities.


Most neural networks these days aren't recursive however.

GPT-3 and friends certainly isn't, unless you count the inference loop as recursion, which is bounded both by vocabulary size and maximum token length (hardcoded for now).


This only means that there is so much more space for AI to grow and improve in the future.

As far as I know, having a recursion in neural networks poses an extravagant dilemma: from one hand, recursive networks are more capable; from another, they are harder to train and may have issues with stability caused by occasional positive feedback loops.


well, be careful with "more capable" - the universal approximation theorem shows that a NN with one hidden layer is able to represent any function (in the limit of how wide that layer is, at least). So there's nothing fundamental about "more capable" or "less capable." It's only about which representations are easy to train and which aren't.


> Brain is essentially a biological Turing-complete processing unit

How do you know that?

Some would answer, "well, everything in the physical universe can be simulated by a Turing machine, so also the brain" but that would be begging the question.


I just presume that a biological brain is a network of neurons with connections between them, including feedback connections, at least this is what we see under the microscope. In terms of math, this structure corresponds to a cyclic directed computational graph - this only fact makes it Turing-complete.

Currently it is just a highly plausible conjecture, not a formal proof. However, I think I can prove it. But should I?


We don't know how the brain works. Neuron activity is one aspect, and clearly an important one, but it's not everything.

And even if the connectome can be modeled as a graph, the dynamics on the graph could be totally alien to us. It might be computable by a Turing machine. But it also might involve some not computable process we can't imagine. Perhaps it involves fundamentally unknown physics. (I consider the last option to be quite likely. I think rather few actually entertain the idea that physics as we understand it today could give rise to consciousness. But here we are.)


I may be misremembering but aren't neurons one-way connections with no feedback?


Indeed, the self-attention mechanism within the transformer architecture is often described as a graph, where tokens are the nodes and the key and query vectors are combined to establish the affinity that each token has for each other token.


These are two different things, actually. The graph you are talking about is constructed using the inputs of the attention layer (ie, nodes=tokens, edges=affinity), while the article talks about graphs that represent the actual operations performed in the attention layer (ie, nodes=weights and intermediate results, edges=mathematical operations that combine them).


I don't get the profundity. Computational directed acyclic graphs have been aa mainstay of computer science for decades.


I highly recommend the video from karpathy https://www.youtube.com/watch?v=VMj-3S1tku0 - This explains the same idea spelled out in code. It really helped me understand the mathematical expression graph of neural networks. He uses graphviz to show the expressions and initially calculate gradients manually and then implement back propagation.


Wasn't that the paradigm behind Tensorflow?


As another commenter said, viewing a neural network as a computation graph is how all automatic differentiation engines work (particularly reverse-mode where one needs to traverse through all the previous computations to correctly apply the gradient), and there were several libraries predating Tensorflow following this idea. The initial contribution of Tensorflow and PyTorch was more about making the developer interface much cleaner and enabling training on a wider range of hardware by developing a bunch of useful kernels as part of the library.


I always thought of seeing most computations of functions as computational graphs- at least, when I used MathLink to connect Mathematica to Python, it basically gives you a protocol to break any Mathematica function into its recursively expanded definition. Konrad Hinsen suggested using python's built-in operator overloading, so if you said "1 + Symbol('x')" it would get converted to "Plus(1, Symbol('x')) and then sent over MathLink to Mathematica, which would evaluate Plus(1, x), and return an expression which I'd then convert back to a Python object representation.

I don't think we talked about doing any sort of automated diff (in my day we figured out our own derivatives!) but after I made a simple eigendecomp of a matrix of floats, the mathematica folks contributed an example that did eigendecomp of a matrix with symbols (IE, some of the terms weren't 5.7 but "1-x"). Still kind of blows my mind today how much mathematica can do with computation graphs.

IIUC this is the basis of LISP as well.


The one distinction I would add with neural networks is that it's not just a recursive tree traversal that one would get when evaluating an arithmetic statement, but an actual graph: a computation node can have gradients from multiple sources (e.g. if a skip connection is added), so each node needs to keep accumulated state around that can be updated by arbitrary callers.

Of course, optimized autograd / autodiff is more parallelized than node-based message passing, but it's a useful model to start with.


I'd have to think about this for a while but I'm not sure I see that as a distinction. if you have a skip conneciton, that's just another node in the graph you can have to execute topologically before your dependent node, and then pass the data. over the edge when the child node is ready to consume.

What you're describing with node-based message passing sounds much more like a petri net, or other agent-based discrete event modelling system. Which is another powerful mental paradigm, but challenging to reason about.


"What you're describing with node-based message passing sounds much more like a petri net, or other agent-based discrete event modelling system."

It sounds like Smalltalk to me.


In terms of abstract algebra, there are no distinctions. What you call gradients are actually data flows. "From multiple sources" - means that a function can take multiple parameters (= inbound gradients, inflows).


AFAIK it was first brought forward by PyTorch. I may be wrong though.


It certainly pre-dates PyTorch. Applying computational graphs to neural networks was the central purpose of the Theano framework in the early 2010s. TensorFlow heavily followed those ideas, and Theano shut down a year or two later because they were so close conceptually and a grad research lab can’t compete with the resources of Google.


It's been a core part of automatic differentiation for decades.


Oh! Can you give some pointers where to read about it more?


The wikipedia page for automatic differentiation and its references would be a good start. https://en.wikipedia.org/wiki/Automatic_differentiation

In particular, the section "Beyond forward and reverse accumulation" tells you how hard it is to deal with the graph in optimal ways, hence the popularity of simpler forwards and backwards traversals of the graph.


The original 2017 PyTorch paper is probably a good one.


Before PyTorch there was Torch (which was Lua-based, hence the Py- prefix of the follow on PyTorch). With Torch, like the later TensorFlow 1.0, you first explicitly constructed the computational graph then executed it. PyTorch's later innovation was define-by-run where you just ran the code corresponding to your computational graph and the framework traced your code and built the graph behind the scenes... this was so much more convenient that PyTorch quickly became more popular than TensorFlow, and by the time TF 2.0 "eager mode" copied this approach it was too late (although TF shot itself in the foot in many ways, which also accelerated it's demise).


PyTorch's insight was to be very dynamic and simple. The underlying model is basically the same with the caveat that tensorflow uses/used something called a "tape".


Ironically the very first publication on "Backpropagation" by a Finish master student (1972) already pointed out how to compute gradients in computational graphs. This if anything is the main thing which distinguishes Backpropagation as an algorithm from simply applying the chain rule.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: