They simulate neurogenesis, I guess, but they do not incorporate the most interesting part of that neurogenesis: That is the new neurons are born into the dentate gyrus, a region thought to have a particular capacity to orthoganalize feature representations that are similar (e.g. pattern separate) allowing distinct memories to be formed for similar events. The dentate gyrus outputs to a region called Cornu ammonis 3 (CA3) which is heavily recurrent, and thought to br able to pattern complete a full representation from partial inputs. That is, CA3 can encode and retrieve the relations between 2 or more features or objects.
For a mathematical model and review one might read:
Rolls (2013)
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3812781/
but many others exist. I'd write more but typing in my phone is driving me to distraction.
this is really interesting, where/how did you learn this?
I'd like to learn more about these things - brain regions, connections, functions - and what they might imply about the kinds of computations that are going on, but my background is mainly on the AI/math side of things.
I'd like to add that our knowledge of the details of hippocampal neuroanatomy are probably the most advanced of any brain region, which allows the somewhat informed construction of computational models. I wish I had more time to learn modeling methods, I have specific developmental hypotheses I'd like to test in such a model. In the end, I'd probably need to find a knowledgable collaborator though.
My dissertational research was on the development of the subfields of the hippocampus in childhood. So these papers from the rat literature were relevant and often inspiring.
There is also this one https://arxiv.org/abs/1506.02515 which takes pruning a step further to reduce the sparsity. Also this one https://arxiv.org/abs/1608.04493 which makes sure not to kill any neurons which proves to be useful at a later stage in the pruning process.
Modern applications of small networks regularly reduce sizes from larger state-of-the-art networks using distillation. Distillation compacts neural networks while affecting accuracy minimally.
Instead of pruning directly from the large network, just learn how it generalizes. Takes fewer nodes / overall operations (Multiplications / Additions).
Certain companies use these methods to make state of the art neural nets work on your phones :)
Also "combine" might not be the right word, since it's really transfer learning. "Distill" is really a descriptive verb.
Maybe my original wording was confusing; I shouldn't have said "distillation compacts" -- distillation is a process by which you can create a more compact version of a complex neural net.
This idea is at least partially in use with regularisation and dropout. The difference at least with dropout is that the "killed" neurons are then massaged back into the network in order become useful again.
Agreed that this is another way of framing the problem of regularizing a network. Rather than starting with a big network and penalizing complexity, they are starting with a simple network and adding complexity. To that end, I'd've liked to see a comparison to dropout or L1/L2 regularization.
Biological neurons themselves are stochastic so they have an internal "dropout" that doesn't seem to hurt, on the contrary, these perturbations and imperfect communication increase learning ability.
Looks like these researchers are trying to make a network more adaptive, I think that deleting nodes would only make them worse at the current task they're being trained on as well as worse on the tasks they're being adapted to.
You could train a model using neurogenesis to increase its accuracy, and then use distillation to train a smaller network to comparable accuracy.
But these are two very different, but complementary, problems.
I'm not assuming that, I'm giving the model more options and letting it decide what is functionally important/non-spurious. It might take it longer, but I don't assume that.
More parameters also means that the likelihood of overfitting (the training set) increases. Currently (and rather unintuitively, considering that ML is an applied optimization field, and optimization is usually concerned with underfitting), the bane of ML is overfitting. It's easy to supply a model with high representational capacity, but it's impossible to learn anything interesting in a reasonable amount of time. You'll learn how to fit your training set perfectly because your model has enough degrees of freedom to let you fit a million points arbitrarily well, but that doesn't mean that the resulting fit describes the data in a meaningful way. This is why a core tenet of ML is to prune parameters whenever possible. Neurogenesis increases representational capacity whenever it detects that your underlying model does not have sufficient representational capacity to fit the data; from this perspective, you start small (undercapacity) and then you gradually increase your capacity until you hit the optimal model. In other words, Neurogenesis is also a way for you to minimize the number of options.
On the other hand, giving the model with more options than it necessarily needs and letting it decide what is important will usually backfire. Rather than learning a few meaningful/functional features, it can just go ahead and completely fit the training data from the very beginning. It will therefore decide that everything is important, because all those extraneous parameters will let it squeeze that last 0.5% out of your training set.
As others have mentioned, there are approaches like regularisation and dropout which try to do similar things. What I find interesting is the fact there are two reasons to do this: to generalise/avoid-overfitting and to reduce resource usage.
It seems like almost all effort is spent on the former, since everyone's aiming for higher accuracy numbers. Are there any widely-used methods to tackle the latter?
For example, I'm imagining a system which is either given measurements of its resource usage (time, memory, etc.) or uses some simple predictive model (e.g. time ~ number of layers * some constant), and works within some resource bound:
- If we're below the bound, expand the model (add neurons, etc.) to allow accuracy increases (note "allow": it's ok to ignore/regularise-to-zero the extra parameters to avoid overfitting)
- If we're above the bound, prune the model (in a way which tries to preserve accuracy)
- Allocate resources to optimise some objective, e.g. reduce variance by pruning the parameters of the best-performing class/predictor/etc. and using those resources to expand the worst performer.
The closest thing I know of are artificial economies, but they seem to be more like a selection mechanism (akin to genetic programming) than a direct optimisation procedure (like gradient descent on an ANN).
There are many ways to compress networks - by pruning neurons, by enforcing sparsity, by representing activations and gradients on one bit (or a few bits), and by transfer learning where a large net is transferred into a smaller one.
Yes, my question was more about meta-level algorithms for balancing size against performance. Especially adaptive methods such that we're not just growing up to a limit and stopping, but selectively allocating resources to those parts which need them. Adapting over time would be nice too: "thinking harder" when there are idle resources, but shrinking the results back down under load.
This paper http://dl.acm.org/citation.cfm?id=2830854 kind of has a solution to being more efficient. It has two networks and uses the smaller one (more efficient) to infere first. If the result is accurate with high probability (the probability of one class is much larger than the probability of any other class) then there is no need to run the big (expensive) network.
For that, check out our OpenReview ICLR submission on
NEUROGENESIS-INSPIRED DICTIONARY LEARNING: ONLINE MODEL ADAPTION IN A CHANGING WORLD, by
Sahil Garg, Irina Rish, Guillermo Cecchi, Aurelie Lozano
https://openreview.net/revisions?id=HyecJGP5ge
To add on to this - they "specifically consider the case
of adding new nodes to pre-train a stacked deep autoencoder", by basically keeping track of when certain layers cannot reproduce their input and then adding more nodes+retraining with both new (not reproduced) and old data. It is quite intuitive, basically the most naive and obvious first attempt at the problem (not meant in a condescending way, just want to point out it's not that generalizable and is pretty ad-hoc).
Sorry if I'm being snobbish, but I do wonder why this paper is only being submitted to IJCNN, a 2nd tier machine learning conference. I know students who publish undergrad research at workshops with lower acceptance rates than IJCNN. I can't think of any important machine learning papers published in IJCNN in the recent past.
It depends on what conclusions you're trying to draw from that information. What conference a paper was accepted to is a second-order signal of the noteworthiness. It's probably easier for someone versed in the field to just read the paper to determine if it's interesting. If you're using the conference as a quick pass/fail as you skim through the abstracts of hundreds of papers, ok, but you probably wouldn't make time to comment on HN about it in that case.
This paper looks like it builds on pretty well-known techniques like stacked autoencoders, so let's see what first-order noteworthiness data we can gather from a quick skim of the paper. If I had to guess why it wasn't accepted into a better conference:
- It uses stacked autoencoders, which are pretty out of fashion
- It bothers reporting results on MNIST
- (more subjectively) It pulls an unfortunately common technique of saying "here's something the brain does" and then hand-waving that it's a deep reason why a technique they've come up with is useful, when in fact the relationship is just "inspired by the general idea of", not "performs the same function as" the biological mechanism. In this case, I think the tenuous connection of their technique to research on neurogenesis is pretty flimsy. Clearly neurogenesis is not how an adult human brain forms new memories or gains proficiency in new skills (which they acknowledge in the conclusion)
> It's probably easier for someone versed in the field to just read the paper to determine if it's interesting.
> If you're using the conference as a quick pass/fail as you skim through the abstracts of hundreds of papers, ok
You answered your own statement, I think. Most researchers will skip a paper in a second tier conference. In fact, most I know won't read an entire paper - they'll only read some of it and skip stuff.
You're correct that I am not an active researcher (otherwise I would not have time to be commenting). I merely did some research back in college. But honestly that little experience gives me a huge leg up on most HN commenters in understanding research. It's unfortunate that the only reason this paper is #1 on HN is because it has a cool title.
That being said, MNIST is not really a disqualifier. (Unfortunately) MNIST is the most popular dataset referenced in NIPS 2016 papers (https://twitter.com/benhamner/status/805864969065689088). The handwaving is also forgivable; many NIPS papers handwave a lot too.
See my sibling comment. It matters because it's a very strong (to academic/industry researchers) sign of quality and whether the paper is worth reading. If you wanted to just put something out there you could just put on arxiv. The authors are academics (?) so they clearly want to publish in the best possible venue.
A good point was made that a model of neurogenesis must also incorporate neuronal death besides neuronal birth (since hippocampus and the brain as a whole have physical constraints, you can't keep growing your network infinitely :). That's why any model of neurogenesis must incorporate interplay between birth and death of new (and old) neurons; that's was the main idea of the paper I mentioned in an earlier post (this year ICLR submission https://openreview.net/forum?id=HyecJGP5ge)
Note that just adding nodes to networks was proposed before, eg. the classical work on cascade correlations.
I think what we might see is a kind of autonomous corporation that is nominally under the control of shareholders, a CEO or a board, but which makes decisions without very much or any human input, and which gains some amount of legal rights through corporate personhood.
It won't be a 'general ai', though. More like a set of loosely connected systems that operate 'in the best interests of the shareholders', however that's defined.
It's pretty much the end state of the trend of pushing decision making to algorithms to remove moral and legal culpability from individuals.
Well, an interesting property of the brain is that any I/O relation happens within X milliseconds, which puts a limit on the depth of the network (if the speed of a neuron is limited). It would be nice to have some hard numbers on this.
I'll say... 5 - 7 years. This is all based on pure speculation, and being a little more than a ML hobbiest. One thing is for sure though - when we do reach that point, everything changes forever.
There's no reason to think it does, and plenty of reason to think it shouldn't. We have a fairly good reason of why humans have empathy; it has to do with our evolution as a social species.
I completely blame my own community, rather than you, for writing this, but as an AI researcher, your comment is terribly painful to read. We have little to no idea how actual neurons (let alone entire brains) really work. The things that are often called "(artificial) neural networks" really shouldn't be called that. I strongly prefer terms like "computational networks" or (where applicable) "recurrent/convolutional networks".
Actually we really know a lot about how neurons work. We've got the biophysical properties down, and we understand neurotransmission at the cellular/molecular level for a lot of different types of neurons. We understand signal processing where we transduce sound, smell, sight, touch, taste into neurochemical signals. We even know a decent amount about the early phases of the processing of these "raw data" signals into higher levels of abstraction (e.g. edge detection for vision). What we don't understand is the later phases of processing (advanced layers of abstraction) all the way up to conscious sensation.
> We have little to no idea how actual neurons (let alone entire brains) really work.
I think that slights neuroscience, which has devoted the past 60 years to answering this question, to a fair degree. But I agree that the biomimetic motivations offered up for various flavors of neural net feel pretty bogus. It seems to me like, among the major old-school researchers in the field, only Geoff Hinton still does this.
Fair. I was definitely unnecessarily harsh on neuroscience; my quibble is only with my own community's claims that what we're doing is anything like how the brain works. Thanks to you and sxg for correcting the record.
In a very hand-wavy sense, yes. The same can be said of paths to food by ant colonies. The way that ANNs have been drawn as circles with arrows between them looks like a cartoon version of neurons and synapses, which is the origin of the "neural network" part. The timing of data from hidden node to hidden node, the activation functions, and the hidden node outputs have very little to do with biological neurons. ANNs have more in common with a CPU than a brain.
Wasn't the attempt of modeling a collective of neurons, their synapsis, and the way that some connections are reinforced the genesis of the artificial neural networks?
That's how the first person brought that concept to life, no? He didn't even have a theoretical explanation on how/why that would work for something, right?
> ANNs have more in common with a CPU than a brain
Sure, they are still running on CPUs, but ANNs are still modeled with CPUs to do what NNs do, at least at some levels where experiments showed that they work.
Sure, some of the properties of NNs do not transpose well to ANNs. As someone pointed out in a comment here with an article showing that if you apply the same kind of signal it doesn't work.
But the fact remains: we are being more successful on AI advancements by trying to emulate parts of our brain than we were with other techniques.
We didn't knew that this would happen when it all started, but it did.
Out of the blue, no-one could look at a model of a yet to be implemented ANN and say it would work, and why. It has all been experimentation, taking the brain as a raw blueprint.
And although many other phenomenas that happened with the brain didn't work well with ANNs, neurogenesis apparently did.
It's impressive IMO and quite humbling that we are getting so many achievements out from mimicking nature, and we aren't 100% sure why it worked in the first place.
I just don't want you to get the wrong impression. This is a single paper about a technique for adding neurons to ANNs over time, and it is only one of many over the last few decades. The paper does not have the evidence to indicate that this is a major breakthrough. The industry as a whole generally does not add neurons to an existing model when updating that model. The vast majority of applications also use backpropagation for training, which is not what our brains use. So even if we ignore the implementation on CPUs, ANNs are still far from behaving similarly, even in a conceptual way, from brains. I must disagree that "we are getting so many achievements out from mimicking nature".
It adds an additional parameter that influences RNN behavior over time, so I could see it possibly being useful. I would speculate that this could have value for providing slowly-updating subsystem information to real-time control systems.
Oh yes. If we want to put a bigger one on the list, then there's the whole matter of vast quantities of circumstantial data being left out by ANNs (sight, sound, past memories, emotions, arousal, touch sensations, etc.). But lists like this can go on for very long.
Spiking Neural Networks [0] attempt to be more accurate representations of human neurons, but haven't really caught on because they aren't really much better than our perceptron model of neurons, at least for the things we are trying to do with them.
Well, neurons have many properties... their information processing capabilities are one aspect, but they also deal with the physical level of communication and staying healthy.
Neurons are also a family of cells, and are very diverse in shapes and functions. We tend to oversimplify our representation of neurons. There are simple neurons and then you have neurons like the Purkinje cell that are massive.
Neurons also rely on their counterparts, the glial cells, that are much less often mentioned.
I think because of this, it will be a while until we fully understand the role of each one of them.
If i'm not mistaken (I had the introductory class on Neural networks quite some years ago) this all started out of trying to simulate neurons in a manner.
Much so that one of the pioneers of this got discredited by other scientist that for some reason simply could not accept that these would work, and that same pioneer started to get his funding discredited and started believing in his opposition so much so that he sailed away to his death, some argue intentionally (as his life's work had been, even by his eyes, seen as useless).
Now, I'm having a hard time remembering the names of the people on that story, if someone knows who I'm talking about, please remind me of those
- real neurons are stochastic and communicate through spikes, artificial neurons can communicate real values efficiently
- real neurons are more like automatons, they have a dynamic in time, learning happens as a continuous interaction with only its neighbors; artificial neurons are "static" (use discrete time) and implemented by forward and backward pass, and also can use nonlocal information
- real neurons can't backpropagate, because backprop requires the transmission of gradients back the same connections, but in reverse - brain connections don't support that kind of bidirectional data flow; artificial neurons work best by backprop
- real neurons can't implement convolutions, it would require a neuron to slide over a field; also real neurons can't implement RNNs as they are, and don't use backpropagation through time BPTT
So, artificial neurons are much less hampered and can do many things that real neurons can't do or have to use some less efficient method. That means brain neurons still have some tricks up their sleeve. Artificial neurons are quite different from brain neurons, and it's right to be so, because they can be more efficient that way.
Why attribute the idea of introducing new nodes to a graph to biological concepts? It seems like a simple step in exploration, similar to how one might think to vary the weights of the nodes randomly over some range.. unless there is some technique biology uses to pre-configure the nodes upon introduction to the network, that might be rather interesting.
Because they tried to model neurogenesis, the same way that artificial neural networks were invented while trying to mimic some parts of how neurons work I guess..
Slightly off topic, but I hate how publications are written. It seems like authors are purposely using big words and sentences that are often 5-6 lines long in order to make it seem more clever.
I find myself often having to reread a sentence in order to understand it.
These algorithms are often very simple and can be easily explained. Don't over complicate them.
A lot of the time the verbosity isn't so much to sound more clever as it is to be very specific and explicit about what the author is trying to convey. There's a lot of changing assumed knowledge and jargon in various fields and our use of language changes over time. The publication writing style is an attempt to factor that out.
That's what an appendix is for. Letters to Nature are 1500 words. You'd be surprised how effective that forcing function is. Not only do you preserve meaning, but you can convey it better because it respects the readers cache size limitations.
Here's my attempt; not a huge number of changes because it was not too bad to begin with, but with slightly less self-indulgent language, and a lot of the jargon has to stay (partly because I don't know the field):
Neural machine learning methods, such as deep neural networks (DNN), have achieved remarkable success in a number of complex data processing tasks. These methods have arguably had their strongest impact on tasks such as image and audio processing - areas where humans have always performed better than conventional algorithms. In contrast to biological neural systems, which are capable of learning continuously, deep artificial networks have a limited ability for incorporating new information after a network has been trained. As a result, continuous learning methods could be very helpful in allowing deep networks to handle data sets which change over time. Here, inspired by the process of adult neurogenesis in the hippocampus, we investigate how adding new neurons to artificial neural networks can allow them to acquire new information, while preserving what they have already learned. Our results on the MNIST handwritten digit dataset and the NIST SD 19 dataset, which includes lower and upper case letters and digits, show that neurogenesis looks like a good approach for tackling the "stability-plasticity dilemma" that has been a problem for adaptive machine learning algorithms for some time.
As an academic, I tend to agree that we frequently feel compelled to apply more verbosity than is strictly required in order to communicate the intended semantic constructs.
Problem: current methods for training neural networks cannot learn continuously. Once they have been trained, they cannot easily learn new information. We propose a method to fix this problem. Our algorithm adds new neurons to the neural network which allows it to learn new information, while maintaining everything it already "knows".
I showed this to my father, a surgeon, and he said he understood it. But not the original abstract.
Although I just glanced at a few parts of this, I did not find it to be poorly written. Can you give an example where you thought it was too verbose or unnecessarily complex?
For example in the abstract: "adding new neurons to deep layers of artificial neural networks in order to facilitate their acquisition of novel information while preserving previously trained data representations"
Nobody talks like this. In my head I read this sentence and I have to translate it to "we add extra neurons to existing networks so they can learn new information while remembering everything it already knows".
But you have to distinguish active and passive vocabulary. Words you commonly use and words you understand when others use them. And you also have to distinguish between written and spoken language. E.g. nobody would actually say "exempli gratia" but it's commonly used in writing.
English is not my mother tongue and for me it mostly falls into passive vocabulary, but it is perfectly understandable. I don't have to mentally translate the whole sentence into simpler words before being able to grasp its meaning.
And it's not like those words are obscure, they are just place 2 or 3 of most common uses among their respective set of synonyms.
However, your statement is actually vague and ambiguous.
"Extra neurons"? Input layer? Output layer? Just before a final, fully-connected layer? Somewhere in between?
"Everything it already knows"? What does it know? Character probabilities, like a charnn? Image categories like in a CNN? Input distributions, like a GAN?
From the abstract, I can immediately tell that this paper is about modifying deep auto-encoders in the hidden layers. From that, I can immediately understand that the paper is not about adapting to new input formats or output formats, but instead about inputs from a new distribution but in the same format.
The author's intended audience, researchers and academics, do talk like this. They do so because it is quickly understandable and actually information dense, as indicated in my above paragraph.
Anytime someone uses the phrase "in order to facilitate" they are being more verbose than is necessary in order to signify their greater erudition. There is no meaningful semantic or technical distinction between "in order to facilitate" and "to help."
- "We specifically consider the case of...a stacked deep autoencoder (AE), which is a type of neural network designed to encode a set of data samples such that they can be decoded to produce data sample reconstructions with minimal error
- "The first step of the NDL algorithm occurs when a set of new data points fail to be appropriately reconstructed by the trained network...When a data sample’s RE is too high, the assumption is that the AE level under examination does not contain a rich enough set of features to accurately reconstruct the sample.
- "The second step of the NDL algorithm is adding and training a new node, which occurs when a critical number of input data samples (outliers) fail to achieve adequate representation at some level of the network.
- "The final step of the NDL algorithm is intended to stabilize the network’s previous representations in the presence of newly added nodes. It involves training all the nodes in a level with both new data and replayed samples from previously seen classes on which the network has been trained.
but many others exist. I'd write more but typing in my phone is driving me to distraction.