Differentiable Neural Computers

rkaplan · on Oct 13, 2016

This paper builds off of DeepMind's previous work on differentiable computation: Neural Turing Machines. That paper generated a lot of enthusiasm when it came out in 2014, but not many researchers use NTMs today.

The feeling among researchers I've spoken to is not that NTMs aren't useful. DeepMind is simply operating on another level. Other researchers don't understand the intuitions behind the architecture well enough to make progress with it. But it seems like DeepMind, and specifically Alex Graves (first author on NTMs and now this), can.

modeless · on Oct 13, 2016

The reason other researchers haven't jumped on NTMs may be that, unlike commonly-researched types of neural nets such as CNNs or RNNs, NTMs are not currently the best way to solve any real-world problem. The problems they have solved so far are relatively trivial, and they are very inefficient, inaccurate, and complex relative to traditional CS methods (e.g. Dijkstra's algorithm coded in C).

That's not to say that NTMs are bad or uninteresting! They are super cool and I think have huge potential in natural language understanding, reasoning, and planning. However, I do think that DeepMind will have to prove that they can be used to solve some non-trivial task, one that can't be solved much more efficiently with traditional CS methods, before people will join in to their research.

Also, I think there's a possibility that solving non-trivial problems with NTMs may require more computing power than Moore's law has given us so far. In the same way that NNs didn't really take off until GPU implementations became available, we may have to wait for the next big hardware breakthrough for NTMs to come into their own.

empath75 · on Oct 13, 2016

The brain is not a single universal neural network that does everything well. It's a collection of different neural networks that specialize in different tasks, and probably use very different methods to achieve them.

It seems like the way forward would be networking together various kinds of neural networks to achieve complex goals. For example, an NTM specialized in formulating plans that has access to a CNN for image recognition, and so on.

vinay427 · on Oct 13, 2016

This is being done using various types of networks. See these slides on image captioning by Karpathy for an example using a CNN and RNN: http://cs.stanford.edu/people/karpathy/sfmltalk.pdf

Senji · on Oct 13, 2016

If we're going with a brain metaphor. What would be the those neural networks' version of synesthesia?

empath75 · on Oct 13, 2016

Feeding mp3s to an image recognition neural net. And as soon as I typed that, I want to try it.

modeless · on Oct 13, 2016

Actually, in the architecture you described, if there is a planning net that's connected to image net and an audio net, rather than feeding audio to the image net I think synesthesia would be better modeled by feeding the output of the audio net into the image net's input on the planning net. If that makes sense.

Senji · on Oct 13, 2016

Not the output. Making several single connections from intermediate layers from the different nets.

dharma1 · on Oct 13, 2016

CNNs can actually be used for audio tasks too, on spectrograms

Senji · on Oct 14, 2016

It's how some guys defeated the first iteration of recaptcha's audio mode. Then google replaced it with something very annoying to use even for humans.

svantana · on Oct 13, 2016

They sure put a lot of focus on "toy" problems such as sorting and path planning in their papers - perhaps because they are easy to understand and show a major improvement over other ML approaches. IMHO they should focus more on "real" problems - e.g. in Table 1 of this paper it seems to be state of the art on the bAbl tasks, which is amazing.

gradys · on Oct 13, 2016

At least some of the "toy" problems aren't chosen just for being easy to solve or understand. They're chosen for being qualitatively different than the kinds of problems other neural nets are capable of solving. Sorting, for example, is not something you can accomplish in practice with an LSTM.

Mainstream work on neural nets is focused on pattern recognition and generation of various forms. I don't mean to trivialize at all when I say this - this gives us a new way to solve problems with computers. It allows us to go beyond the paradigm of hand-built algorithms over bytes in memory.

What DeepMind is exploring with this line of research is whether neural nets can even subsume this older paradigm. Can they learn to induce the kinds of algorithms we're used to writing in our text editors? Given this goal, I think it's better to call problems like sorting "elementary" rather than "toy".

sherjilozair · on Oct 13, 2016

bAbI isn't really a "real" problem either, although somewhat better than sorting and the like. bAbI works with extremely restrictive worlds and grammar. In contrast, current speech recognition, language modeling, and object detection do quite well with actual audio, text, and pictures.

I think the strength of NTMs will be best demonstrated by putting it to work on a long-range language modeling task where you need to organize what you read so that you can use it to predict better a paragraph or two later. Current language models based on LSTM are not really able to do this.

ludoplex · on Oct 15, 2016

Any chance you could link a pdf of the paper for us?

TeeWEE · on Oct 13, 2016

Once you have a learning machine that can solve simple problems. You can scale it up to solve very complex problems. Its a first step to true AI imho. Al lot of small steps are needed to go towards this goal. Integrating Memory & Neural Nets is a big step imho.

chriswarbo · on Oct 13, 2016

> Once you have a learning machine that can solve simple problems. You can scale it up to solve very complex problems.

Nope. It's really easy to solve simple problems; it can sometimes even be done by brute-force.

That's what caused the initial optimism around AI, e.g. the 1950s notion that it would be an interesting summer project for a grad student.

Insights into computational complexity during the 1960s showed that scaling is actually the difficult part. After all, if brute-force were scalable then there'd be no reason to write any other software (even if a more efficient program were required, the brute-forcer could write it for us).

That's why the rapid progress on simple problems, e.g. using Eliza, SHRDLU, General Problem Solver, etc. hasn't been sustained, and why we can't just run those systems on a modern cluster and expect them to tackle realistic problems.

visarga · on Oct 13, 2016

Deep mind is breaking new ground in number of directions. For example, "Decoupled Neural Interfaces using Synthetic Gradients" is simply amazing - they can make training a net async and run individual layers on separate machines by approximating the gradients with a local net. It's the kind of thing that sounds crazy on paper, but they proved it works.

Another amazing thing they did was to generate audio by direct synthesis from a neural net, beating all previous benchmarks. If they can make it work in real time, it would be a huge upgrade in our TTS technology.

We're still waiting for the new and improved AlphaGo. I hope they don't bury that project.

shostack · on Oct 13, 2016

I'm not super knowledgeable about the space, but would the audio generation you mentioned be what is needed to let their Assistant communicate verbally in any language, any voice, add inflections, emotion, etc. without needing to pre-record all the chunks/combinations?

outsideline · on Oct 13, 2016

Decoupled Neural Interfaces using Synthetic Gradients is a fancy name for the electro-chemical gradient that lies outside the cell wall of neurons : https://en.wikipedia.org/wiki/Electrochemical_gradient

It's decoupled yet stores transient local information regarding previous neuron activity.

Another bio-inspired copy-pasta.

vintermann · on Oct 13, 2016

You should absolutely get a job doing it, if you think bio-inspired copy-pasta is all it takes. May I recommend Numenta?

xpe · on Oct 13, 2016

Please choose derogatory phrases like 'copy pasta' intentionally and carefully.

Many algorithms are bio-inspired -- good artists borrow, the best steal.

wiz21c · on Oct 13, 2016

>> DeepMind is simply operating on another level.

Would you be so kind as to to explain what you mean here ?

Thanks !

outsideline · on Oct 13, 2016

They're taking features that are present in the brain that aren't modeled and are making computational models for them. They're not a gold standard. You can create your own in under an hour. It's not another level. It's bio-inspired computing.

Here.. take the 'Axon Hillock' https://en.wikipedia.org/wiki/Axon_hillock code up a function for it, attach it to present day neuron models, make it do something fancy, write a white-paper and kazaam you're operating on another level..

Get it?

wiz21c · on Oct 13, 2016

ok I get it :) Nice little sarcasm, I'm loving it :-)

outsideline · on Oct 13, 2016

Alan Turing's tape machine + neuron model.

In the human brain, Neurons store an incredible amount of information. Neuron models in neural networks only did so with weights.

There is still a lack of understanding on how the human brain does it. Deep Mind grabbed a proven memory model from Alan Turing's work and applied it to the feature barren neuron models in use. Sprinkle magic ...

They are not operating on another level, they're bringing over features that are well documented in the human brain and in white papers from a past period when people actually thought deeply about this problem and applying it.

https://en.wikipedia.org/wiki/Bio-inspired_computing

There is no 'intuition' about the architecture. Study the human brain and copy pasta into the computing realm.

Others are doing this as well. If anyone bothered to read the white papers people publish, you'll see that many people have presented similar ideas over the years.

You can come up with your own neural Turing machine. Take a featureless neuron model, slap a memory module on it and you have a neural turing machine.

vintermann · on Oct 13, 2016

In order to use a turing machine in a neural network - or at least to train it, in any way that isn't impractical and/or cheating - you need to make it differentiable somehow.

Graves and co. have been really creative in overcoming problems in their ongoing program to differentiate ALL the things.

igravious · on Oct 13, 2016

In this context what does differentiable mean?

gugagore · on Oct 13, 2016

I think the easiest way to see this is by an example of a non-differentable architecture.

Let's suppose on the current training input, the network produces some output that is a little wrong. It produced this output by reading a value v at location x of memory.

In other words, output = v = mem[x]

It could be wrong because the value in memory should have been something else. In this case, you can propagate the gradient backwards. Whatever the error was at the output, is also the error at this memory location.

Or it could be wrong because it read from the wrong memory location. Now you're a bit dead in the water. You have some memory address x, and you want to take the derivative of v with respect to x. But x is this sort of thing that jumps discretely (just as an integer memory address does). You can't wiggle x to see what effect it has on v, which means that you don't know which direction x should move in in order to reduce the error.

So (at least in the 2014 paper, ignoring the content-addressed memory), memory accesses don't look like v = mem[x]. They look like v = sum_i(a_i * mem[i]). Any time you read from memory, you're actually reading all the memory, and taking a weighted sum of the memory values. And now you can take derivatives with respect to that weighting.

To me, the question this raises is, what right do we have to call this a Turing machine. This is a very strong departure from Turing machines and digital computers.

iandanforth · on Oct 13, 2016

Turing didn't specify how reads and writes happened on the tape. For the argument he was making it was clearer to assume there was no noise in the system.

As for "digital" computers remember they are built out of noisy physical systems. Any bit in the CPU is actually a range of voltages that we squash into the abstract concept of binary.

gugagore · on Oct 14, 2016

I don't think that is really relevant to the discussion. Regardless of how a digital computer is physically implemented, we use it according to specification. We concretize the concept of binary by designing the machine to withstand noise. The thing what we get when we choose the digital abstraction is that this is actually realistic. Digital computers pretty much operate digitally. Corruption happens, but we consider that an error, and we try to design so that a programmer designing all but the most critical of applications, should assume that memory does not get corrupted

We don't squash the range of voltages. The digital component that interprets that voltage does the squashing. And we design it that way purposefully. https://en.wikipedia.org/wiki/Static_discipline

Turing specified that the reads and the writes are done by heads, which touch a single tape position. You can have multiple (finitely many) tapes and heads, without leaving the class of "Turing machine". But nothing like blending symbols from adjacent locations on the tape, or requiring non-local access to the tape.

igravious · on Oct 13, 2016

No wonder Google built (is building) custom accelerators in hardware. This points to a completely different architecture from Von Neumann, or at least it points to MLPUs, Machine Learning Processing Units.

shostack · on Oct 13, 2016

Pardon my ignorance as I'm not super knowledgeable on this, but is what you described around reading all the memory and taking the weighted sum of values similar in a sense to creating a checksum to compare something against?

gugagore · on Oct 14, 2016

I suppose I can see the similarity, in that there's some accumulated value (the sum) from reading some segment of memory, but otherwise I don't think the comparison is helpful.

ebalit · on Oct 13, 2016

It means it can be trained by backpropagating the error gradient through the network.

To train a neural network, you want to know how much each component contributed to an error. We do that by propagating the error through each component in reverse, using the partial derivatives of the corresponding function.

TeeWEE · on Oct 13, 2016

Dont forget you first need to understand the mathemetical theory of how a brain does computation and pattern recognition. Off course they look into how a brain does it. But the mathematical underpinnings, and how the information flows is much more important than how an individual neuron works in real live. Abstraction and applying it to real data is what they are doing.

usgroup · on Oct 13, 2016

Any chance you could fix this statement:

Input = Data

Process = Optimisation to create an automata.

Output = Automata

Computer power means much larger variable spaces can be handled in optimisation problems. NN are a means to prune the variable space during optimisation in a domain unspecific way.

IanCal · on Oct 13, 2016

Does anyone have a readcube link/similar for the paper?

http://www.nature.com/nature/journal/vaop/ncurrent/full/natu...

ericjang · on Oct 13, 2016

here you go http://rdcu.be/kXdj

pmatev · on Oct 13, 2016

This is great, thanks. Anyone know how those charts / graphs have been generated?

hyperbovine · on Oct 13, 2016

By hand, I imagine. There is an author credited specifically with the graphics and nothing else.

IanCal · on Oct 13, 2016

Fantastic, thanks.

dharma1 · on Oct 13, 2016

Waiting for Schmidhuber to pipe up that he wrote about something similar in -93 and Alex Graves was his student anyway

singham · on Oct 13, 2016

Exactly. I saw his webpage and was overawed until I read about him on reddit. That guy is full of himself.

dharma1 · on Oct 13, 2016

He has done a lot of pioneering work, to be honest. I recommend seeing him talk (or watch a video), I think his humour comes across better that way

chriswarbo · on Oct 13, 2016

It's interesting to think that Schmidhuber's actually applying machine learning methods to the field of machine learning, e.g. see the opening of http://people.idsia.ch/~juergen/deep-learning-conspiracy.htm...

If AGI is the goal and machine learning research is the search algorithm, then Schmidhuber's attempting to perform backpropagation by pushing rewards back along the connections :)

Senji · on Oct 13, 2016

We should use back propagation with government.

tvural · on Oct 13, 2016

The idea of using neural networks to do what humans can already write code to do seems a bit wrong-headed. Why would you take a system that's human-readable, fast, and easy to edit, and make it slow, opaque, and very hard to edit? The big wins for ml have all been things that people couldn't write code to do, like image recognition.

JeffreyKaine · on Oct 13, 2016

I think they just want to teach the system to crawl before it can walk, run then eventually fly. Doing something that would be easy for a human to code makes it easy for a human to see what's going on and help train the system to think like a human.

AgentME · on Oct 13, 2016

Even if it only works for problems that are trivially solved by people, the fact that it can be done automatically is useful. A system could react to changes and some new problems automatically by continually retraining itself.

orthoganol · on Oct 13, 2016

It appears they are touting 'memory' as the key new feature, but I know at least in the deep learning NLP world there already exists models with 'memory', like LSTMs or RNNs with dynamic memory or 'attention.' I can't imagine this model is too radically different than the others.

Maybe I just feel a bit uneasy with a claim such as:

> We hope DNCs provide a new metaphor for cognitive science and neuroscience.

modeless · on Oct 13, 2016

The "memory" in a typical RNN is akin to a human's short term working memory. It only holds a few things and forgets old things quickly as new things come in. This new memory can hold a large number of things and stores them for an unlimited amount of time, more like a human's long term memory or a computer's RAM. It's a big difference, and the implementation is completely different too.

orthoganol · on Oct 13, 2016

I was not referring to typical RNNs, but LSTMs or RNNs with 'attention'. They are designed to overcome vanishing/ exploding gradient problems and hold arbitrary memory lengths.

vintermann · on Oct 13, 2016

They can technically be as long as you want them, but in practice there are still severe constraints. LSTMs alleviate the gradient problems, but you still get real trouble with long-term dependencies.

Alex Graves and some others in DeepMind have focused a lot in the past year or so on developing practical differentiable data structures, so that the LSTM can read and write to an external memory (and save its precious internal state for more immediate needs) yet still be trainable via backpropagation.

modeless · on Oct 13, 2016

The memory provided by an attention architecture is immutable and has only one addressing mode. The memory in this paper is mutable (hence "Neural Turing Machine") and has several different addressing modes.

empath75 · on Oct 13, 2016

Basically they're using differentiable memory allocation, which means that training can not just change 'what' is stored in memory but 'where' it's stored.

praccu · on Oct 13, 2016

LSTMs are very different. Think of it this way. LSTMs store information about the current problem you're solving. DNCs store information about the world.

LSTMs are designed to capture long range dependencies, e.g., "this word at the start of the sentence interacts with this word at the end of the sentence."

DNCs are designed to incorporate outside information, e.g., "i happen to know (from background knowledge) that these two people in this sentence are married"

tim333 · on Oct 13, 2016

I wonder how close these differentiable neural computers are functionally to cortical columns in the brain that are "are often thought of as the basic repeating functional units of the neocortex." (https://en.wikipedia.org/wiki/Neocortex#Cortical_columns)

carapace · on Oct 13, 2016

(What the hell with the thin grey sans-serif body text font? Seriously, do you hate your readers' eyes that much?)

partycoder · on Oct 13, 2016

I wonder if they will put this to use in their StarCraft bot.

saguppa · on Oct 13, 2016

Yeah, games like StarCraft will probably need a working memory component. The task that they solve here with RL is a simple puzzle game. It'll be interesting to see if this works for Atari games or StarCraft.

bytefactory · on Oct 13, 2016

Nerdgasm.

outsideline · on Oct 13, 2016

https://en.wikipedia.org/wiki/Bio-inspired_computing

Present day Neuron models lack an incredible number of functional features that are clearly present in the human brain.

NTMs = representing memory that is stored in neurons https://en.wikipedia.org/wiki/Neuronal_memory_allocation

Decoupled Neural Interfaces using Synthetic Gradients = https://en.wikipedia.org/wiki/Electrochemical_gradient

Differentiable Neural Computers = Won't specify what natural aspect of the brain this derives from.

Pick an aspect of a neuron or the brain that isn't modeled, write a model...

Bleeding edge + Operating on another level

The fact that someone is going out of there way to remove points from my posts so that this doesn't see tomorrow's foot traffic instead of replying and critiquing me just goes to show how truthful these statements are.

Anyone can create such models. No one has a monopoly or patent on how the brain functions. Thus, expect many models and approaches.. Some better than others.

You can down-vote all you want. The better model and architecture wins this game. It would help the community if people were honest about what's going on here but people instead want to believe in magic and subscribe to the idea that only a specific group of people are writing biologically inspired software and are capable authoring a model of what is clearly documented in the human brain. Interesting that this is the reception.

gjm11 · on Oct 13, 2016

You're not getting downvoted for being mean about DeepMind, you're getting downvoted for making overconfident pronouncements about things you don't understand.

"Neural Turing machines" are not the same thing as neuronal memory allocation: NTMs' memory is external and neuronal memory allocation is all about how memory is stored in neurons in the brain.

The "synthetic gradients" in that paper have nothing to do with the electrochemical gradients you mention other than the name.

No one is claiming that the DeepMind guys are "operating on another level" because they do bio-inspired things. They are claiming that because they are getting more impressive results than anyone else.

Now: Are they really? If so, is that enough justification for such a grand-sounding claim. I don't know. That would be an interesting discussion to have. But "Boooo, these people are just copying things present in the brain, there's nothing impressive about that" is not, especially when the parallels between the brain-things and the DeepMind-things are as feeble as in your examples.

outsideline · on Oct 13, 2016

Overconfident pronouncement by indicating that they are making computational models of natural processes that no one can confidently state are correct or are the most efficient?

Making statements that allow people to see behind the curtains and maybe go off and make their own competitive models... Yes, this is a disservice to the advancement of A.I and should be downvoted : Removing the prestigious veil and illusion from published works.

NTMs memory is external in what sense? Please detail what this means in a 'functional' sense. It's biologically inspired. Neurons maintain memory beyond synaptic weights. The neuron models of present day A.I were basic. Someone comes along and sees the obvious : There is no computational model for how neurons utilize memory and suddenly they're thinking on another level? Give me a break..

Synthetic gradients have everything to do w/ electro-chemical gradients : http://www.nature.com/articles/srep14527 http://www.pnas.org/content/110/30/12456.full.pdf So, where is your establishment that I am incorrect. It is nowhere to be found. Again, biologically inspired computational models.

Oh look, someone published a paper back in June that is an implementation of Differentiable Neural Computers: https://arxiv.org/abs/1607.00036

It's hype and that is a disservice to the community of people completing similar work and taking similar approaches.

It would be an interesting discussion to have. That discussion was terminated in favor of downvoting me.

They're feeble to someone who isn't well informed on neuroscience. Thus, you'd rather be wow'd and believe in the fantasy that only a small segment of people can write computational models of biology.

Continue believing the hype. Rarely will someone be truthful and honest about where they got their ideas when hype follows. An interesting conversation could have transpired. Enjoy the feels from the downvotes.

TeeWEE · on Oct 13, 2016

An "Electro chemical gradient" is an ion. That works on small scale within cells. The gradient here is the "electrochemical potential" of the ion.

An Synthetic Gradient is a way to allow learning Forward Propagated Neural Nets in a parallel way. The gradient here is referring to the 'error' backpropagation that is part of the training process of an neural net (im talking about computer science neural nets).

They have nothing todo with each other. The papers that you are referring to have nothing todo with the process of training a neural net.

TeeWEE · on Oct 13, 2016

Even if they would do copy-pasta from nature.. Even if they copy everything..

They are the first who have a machine learn to solve problems that require memory. They are the first. These are the stepping stones to artificial Intelligence.

Note: The whole point of the Synthetic gradients, is to learn a network in parallel. This allows Google to make computers learn recognize things in images even better. To recognize human speech even beter... To make self driving cars even better.....

I don't know if they are copied or not from nature (doesnt look like). The point is that they are improving mankind.

outsideline · on Oct 13, 2016

> They are the first who have a machine learn to solve problems that require memory.

Incorrect. It was named a Neural (Turing) machine for a reason. Maybe people should go back and dust off the white papers from the 70s like those who are borrowing from that era and respectfully giving credit where credit is due.

They do great work and they are making great progress in Artificial Intelligence. Many people are. Everything is a stepping stone. It serves no good to over-hype one person's stones over another's or ignore/downplay where they were inspired from. Notable visionaries of a past time were visionaries because they detailed the depths of their thinking and centered on the hows/whys. It seems it is fashionable now-a-days to do the exact opposite. This is to a disservice to learning and progress.

The whole point of the human brain is parallel processing. Extra-cellular chemical Gradients function the same way in the human brain and serve the same purposes. Take a look at the papers I linked.

> I don't know if they are copied or not from nature (doesnt look like).

Extra-cellular chemical Gradients. I linked to white papers that explain how memory is stored in them and shared across neurons. This is how it works in nature and biology.

They named their approach 'Synthetic Gradients'. An artificial form of the biological Gradient that is decoupled and lies outside of a neuron. They are clearly giving credit to nature.

They and many other people are improving mankind. Many others can improve mankind if there was less hype and more of a focus on where the ideas originated.

That was my point..

The behavior of people regarding selective 'hype' is one of the big reasons why a tremendous amount of deeply functional work that centers on hard intuitions and ideas for this area will remain closed source when a real break is made.

Enjoy the hype train I guess... They're operating on another level than anyone else.

znah · on Oct 13, 2016

Ideas are cheap, making them work is hard.

outsideline · on Oct 13, 2016

The brain's architecture is laid bare for anyone to see. Making models of features is cheap. Anyone can do it. There are loads of white papers and benchmarks. Some approaches beat others depending on the benchmark. Claiming it 'works' by tweaking it until it fits a canned benchmark is not hard work. Not having any explanation as to why it works is not hard work.

Coming up with a functional systems architecture that ties the bits and pieces together is hard work. Understanding what is really happening in the human brain, how/why it is performing various functions, and how this provides for an intelligent architecture is hard work. Creating an 'aware' platform is hard and elusive work which is why people chase the low hanging fruit of optimization algorithms.

*Cheers

akssri · on Oct 13, 2016

Do you know if CompNeuro models have been trained to do things interesting to CS folk ?