Maybe I'm missing something, but from the paper https://www.cs.toronto.edu/~hinton/FFA13.pdf, they use non-conv nets on CIFAR-10 for back prop, resulting in 63% accuracy. And FF achieves 59% accuracy (at best).
Those are relatively close figures, but good accuracy on CIFAR-10 is 99%+ and getting ~94% is trivial.
So, if an improper architecture for a problem is used and the accuracy is poor, how compelling is using another optimization approach and achieving similar accuracy?
It's a unique and interesting approach, but the article specifically mentions it gets accuracy similar to backprop, but if this is the experiment that claim is based on, it loses some credibility in my eyes.
I think you have to set expectations based on how much of the ground you're ripping up. If you're adding some layers or some little tweak to an existing architecture, then yeah, going backwards on cifar-10 is a failure.
If, however, you are ripping out backpropagation like this paper is, then you get a big pass. This is not the new paradigm yet, but it's promising that it doesn't just completely fail.
This seems to be Hinton's MO though. A few years back he ripped out convolutions for capsules and while he claims it's better and some people might claim it "has potential", no one really uses it for much because, as with this, the actual numerical performance is worse on the tests people care about (e.g. imagenet accuracy).
Specifically, tenure is to remove the pressure that "you'd better be right" so professors are free to take meandering tangents through the solution space that don't seem like they'll pay off immediately.
The failure mode of tenure is that the professor just rests on their past accomplishments and doesn't do anything. That's a risk the system takes. In this case though, Geoff Hinton is doing everything right: he's not only not sitting around doing nothing, he's actively trying to obsolete the paradigm he helped usher in, just in case there is a better option out there. I think that's admirable
Not sure what you mean by >20 years to be right. I built and trained a 3-layer back-propagating neural net to do OCR on an Apple 2 in 1989 based on that paper. Admittedly, just the 26 upper case characters. But it clearly worked better than the alternatives.
Also Hinton doesn't have the best track record with his already forgotten/abandoned Capsule networks. I wonder what's the next thing he's going to come up with? He gets a pass because he is famous.
I think it can be simultaneously true that things like this should be tested with toy models we wouldn't expect to do great on CIFAR and also that we shouldn't expect exceptional results just because this person is already famous.
The best context to view the paper, is as part of an algorithm search.
Until the brain's algorithm is "solved", half steps are important. We need as many alternate half steps as we can find until one or more lead to a better understanding of the brain. (And potentially, better than backdrop efficiency or results.)
Achieving <80% on CIFAR10 in the year >2020 is an example of a failed toy model, not a successful toy model.
Almost any ML algorithm can be thrown at CIFAR10 and achieve ~60% accuracy; this ballpark of accuracy is really not sufficient to demonstrate viability, no matter how aesthetically interesting the approach might feel.
Hinton is doing basic science, not ML, here. Given who he is, trying to move the needle on traditional benchmarks would be a waste of his time and skills.
If he invents the new back propagation, an army of grad students can turn his ideas into the future. Like they've done for the last 15 years.
He's posting incremental work towards rethinking the field. It's pretty interesting stuff.
I haven't seen this to be the case, fwiw. There was a paper in 2016 that did this and most were in the ~40% range.
But "any ml algorithm" isn't the point. It's a new optimization technique and should be applied to models/architectures that make sense with the problems they are being used on.
For example, they could have used a pretrained featurizer and trained the two layer model on top of it, with both back prop and FF and compared.
> For example, they could have used a pretrained featurizer and trained the two layer model on top of it, with both back prop and FF and compared.
Making the assumption that weights/embeddings produced by a backprop-trained network are equally intelligible to a network also trained by backprop vs. one trained by this alternative method.
This is not a benchmark of some model on cifar-10, it's a benchmark of the training algorithm.
But, model size and complexity also matters. MLP with backprop gets about 63% on cifar-10, for various reasons. So achieving 59% accuracy means this algorithm is about 93% as good as backprop in this case.
However, 63% accuracy on cifar-10 can be achieved with two (maybe three) layers IIRC. The output is a 10-way classifier, which is handled in one layer. If the output requires multi-layer transformations, then gradients need to be back-propagated.
As long as the batch activation vectors are trained to max separation (or orthogonality or whatever) at each layer, one output layer can match them to labels. But this is unlikely in problems where the output is more "transformed or complicated".
The article links to an old draft of the paper (it seems that the results in 4.1 couldn't be replicated). The arxiv has a more recent one: https://arxiv.org/abs/2212.13345
I skimmed through the paper and am a bit confused. There's only one equation and I feel like he rushed to publish a shower thought without even bothering to flesh it out mathematically.
So how do you optimize a layer? Do you still use gradient descent? So you are have a per layer loss with a positive and negative component and then do gradient descent?
So then what is the label for each layer? Do you use the same label for each layer?
And what does he mean by the forward pass not being fully known? I don't get this application of the blackbox between layers. Why would you want to do that?
Probably because the idea is trivial in hindsight (always is) so publishing fast is important. Afaict, the idea is to compute the gradients layer by layer and applying them immediately without bothering to back-propagate from the outputs. So in between layers would learn what orientation vectors previous layers emit for positive samples and themselves emit orientation vectors. Imagine a layer learning what regions of a sphere's (3d) surface are good and outputting what regions of a circle's (2d) perimeter are good. This is why he mentions the need for normalizing vectors otherwise layers would cheat and just look at the vector's magnitude.
The idea is imo similar to how random word embeddings are generated.
> because the idea is trivial in hindsight (always is) so publishing fast is important.
Unfortunately I've also seen papers get rejected because their idea was "trivial", yet no one had thought of it before. Hinton has an edge here though.
> Afaict, the idea is to compute the gradients layer by layer and applying them immediately without bothering to back-propagate from the outputs.
I'm not sure where you get that impression. Forward-Forward [1] seems to eschew gradients entirely:
The Forward-Forward algorithm replaces the forward and backward passes of backpropagation by two forward passes, one with positive (i.e. real) data and the other with negative data which could be generated by the network itself
It eschews back-propagation but not gradient calculation. You still have to nudge the activations' weights upward for positive examples and downward for negative ones. Positive examples should give a long vector and negative ones a small vector.
In my opinion, yes if and only if the update does not use a stateful optimiser, and the computation is easy / simple enough that the updated parameter value can be computed immediately.
In linear layers, it is possible. Once you have computed the gradient of the output of the vector ith vector, so a scalar, you scale the input by that value and add it to the parameters.
This is a simple FMA op: a=fma(eta*z, x, a), with z the gradient of the vector, x the input, a the parameters, and eta the learning rate. This computes a = a + eta*z*x in place.
Perhaps a better place to find algorithmic details is this related paper, also with Hinton as a co-author, which implements similar ideas in more standard networks:
Scaling Forward Gradient With Local Losses
Mengye Ren, Simon Kornblith, Renjie Liao, Geoffrey Hinton
https://arxiv.org/abs/2210.03310
The optimizers take parameters and their gradients as inputs and apply update rules to them, but the gradients you supply can come from anywhere. Backdrop is the most common way to assign gradients to parameters, but other methods can work too—as long as the optimizer is getting both parameters and gradients, it doesn't care where they come from.
I found this paragraph from the paper very interesting:
> 7 The relevance of FF to analog hardware
> An energy efficient way to multiply an activity vector by a weight matrix is to implement activities
as voltages and weights as conductances. Their products, per unit time, are charges which add
themselves. This seems a lot more sensible than driving transistors at high power to model the
individual bits in the digital representation of a number and then performing O(n^2) single bit
operations to multiply two n-bit numbers together. Unfortunately, it is difficult to implement the
backpropagation procedure in an equally efficient way, so people have resorted to using A-to-D
converters and digital computations for computing gradients (Kendall et al., 2020). The use of two
forward passes instead of a forward and a backward pass should make these A-to-D converters
unnecessary.
It was my impression that it is difficult to properly isolate an electronic system to use voltages in this way (hence computers sort of "cut" voltages into bits 0/1 using a step function).
Have these limitations been overcome or do they not matter as much, as neural networks can work with more fuzzy data?
Photonic/optical neural networks are an interesting related area of research, using light interference to implement convolution and other operations without (I believe?) needing a bitwise representation of intensity.
The small deltas resulting from electrical noise generally aren't an issue for probabilistic computations. Interestingly, people have quantized many large DL models down to 8/16 bits, and accuracy reduction is often on the order of 2-5%. Additionally, adding random noise to weights during training tends to act as a form of regularization.
There's been unhappiness in some quarters that back propagation doesn't seem to be something that biology does. That may be part of the motivation here.
The paragraph about Mortal Computation is worth repeating:
If these FF networks can be proven to scale or made to scale similarly to BP networks, this would enable making hardware several orders of magnitude more efficient, for the price of loosing the ability to make exact copies of models to other computers.
(The loss of reproducibility sits well with the tradition of scientific papers anyway/s;)
2.) How does this paper relate to Hintons feedback alignment from 5 years ago? I remember it was feedback without derivatives. What are the key new ideas? To adjust the output of each individual layer to be big for positive cases and small for negative cases without any feedback? Have these approaches been combined?
Not a deep learning expert, but: it seems that without backpropagation for model updates, the communication costs should be lower. And that will enable models that are easier to parallelize?
Nvidia isn't creating new versions of its NVLink/NVSwitch products just for the sake of it, better communication must be a key enabler.
Can someone with deeper knowledge can comment on this? Is communication a bottleneck, and will this algorithm uncover a new design space for NNs?
It’s more that without backpropagation, you no longer need to store your forward activations across many layers to compute the backwards pass which usually is dependent on a forward pass. When a network is hundreds of layers, and batches are very large, the forwards and backwards accumulations add up in terms of memory required.
Communication across GPUs doesn’t solve this but instead allows to have either many models running in parallel on different GPUs to increase Barch size or to share many layers across GPUs to increase model size. Quick communication is critical to maintain training speeds that aren’t astronomical
> will this algorithm uncover a new design space for NNs?
No.
Hinton "discovered" stacking ensembles and gave it a new name, fancy analogies to biological brains and then made it worse.
The gist of this is that you can select a computational unit, be it a linear layer, or a collection of layers, compute the derivative of the output with respect to the parameters, and update them.
Each computational unit is independent, meaning that you don't calculate gradients going outside of it.
This is the same as training a bunch of networks, computing predictions, and then using another layer to combine the predictions. This is called stacking, and the networks are called an "ensemble". You can do this multiple times and have N levels of meta estimators.
Instead of fitting the ensemble and then the meta estimator, Hinton proposes training both simultaneously but without allowing gradients to flow through.
That is stupid because if you don't allow gradients to flow through, you will see a context drift as the data distribution changes. Hinton observed this context drift, to deal with that, he proposed normalizing the data.
On one extreme, you can use individual linear units as the models, and on the other extreme, you can combine all units into a single neural network and treat that as a module.
So no, this does not open any new design, it's an old idea, worsened, and wrapped in fancy words and post-facto reasoning.
If you are curious how a linear layer is an ensemble, observe that each vector is its own linear estimator, making the linear mapping an ensemble of estimators.
Yeah, no. Reading the paper I don't really see anything but a superficial resemblance to stacking. Hinton was active back when Wolpert introduced stacking and I'm fairly sure he is aware of it. If anything it much more closely resembles his own prior work in Boltzmann machines, unsurprisingly (and which he cites), or even his prior work on capsules. I don't know if this will really pan out into anything that different or useful for the field, but it's unfair and inaccurate to dismiss it as derivative of stacking.
A single linear layer is for all intents and purposes equivalent to running an ensemble of linear estimators. By disallowing gradients to flow between two layers A, B, computing (B . f . A)(x) with f being a non linearity, the second layer is an ensemble of linear estimators of the outputs of the first, and for all intents and purposes, the output of (f.A)(x) is just preprocessing for B.
Since gradients don't flow from B to A in (B.f.A)(x), A is trained independently of B, meaning that the training distribution of B changes without B influencing it, i.e. context drift. B doesn't know the difference, and B doesn't influence it.
For all intents and purposes, you can compute all the outputs as training of A happens, meaning training A to completion, and then feed them into B and B will still compute the same outputs and derivatives as it did before.
To deal with context drift, Hinton proposes normalizing the data, so the distribution does not change significantly.
Whatever he proposed is not "backprop-free" either. It still involves backprop, but the number of layers gradients flow through is 1, the layer itself.
The argument that you can still train through non-differentiable operations is not particularly convincing either; the reparameterization trick shows that is trivial to pass gradients through non differentiable operations if we are smart about it.
Given non differentiable operator Z: R^N -> R^N; let A, B, C be R^N -> R^N linear layer, B(Z(C(x)) * A(C(x))) allows gradients to flow through B and A all the way to C. The output of Z is for all intents and purposes a Hadamard product with (A . C)(x) that is runtime constructed and might as well be part of the input.
You can even run Z(C(x)) through a neural network and learn how to transform that and still provide useful and informative gradients back to C(x) via (A . C)
I'm not sure what the main point is here. The paper is definitely sketchy on details, and the main idea is definitely simple enough to resemble a lot of other work. I wouldn't be surprised if someone (maybe a certain Swiss researcher) comes out and says, actually, this is the same as this other paper from the early 90s. If you squint hard enough a lot of ideas (especially simple ones) can be seen as being the same as other, older ideas. I'm not too interested in splitting those hairs, really. I'm more curious on whether this eventually leads to something that sets it apart from the SOTA in some interesting way.
My claim is that this work is simply worse ensembles wrapped in a biologically inspired claims, and that arguments made in support of it by the author compared to other approaches are simply not sound.
By looking at it through that perspective, the issues with the approach become evident, and are fundamental in my opinion.
Unfortunately, I've been here for a decade in one form or another. Every now and then someone writes something so pompous that I just can't help myself but post. Back to lurking now. Cheers!
If we can't publicly scrutinize people who have great sway in the industry, what does that say about us as a research community?
The fact that I argued why I found it bogus based on well established principles, and I get shitted on by people who by all means have provided nothing to this conversation and except suppressing criticism or throwing ad-hominems should tell all about the quality of discourse.
Dismissing criticism, not by arguments, but by the mere name of the person does a disservice to everyone.
If the research can't stand on its own, independent of the author, then it is not good research.
The hard dismissal with 'No' is likely why you got down voted. I am not able to do that.
With that kind of tonal promise, especially considering the source you are dismissing outright is important in their field, you have to show, not just tell.
If you just left that No out, and gave room for the chance that you are wrong, people wouldn't downvote, they'd upvote. People like to hear smart arguments. No one wants to hear outward dismissal. Especially of known experts.
Thats a lot less cheaty, biologically speaking, than full backprop. This Hinton guy sounds like he knows what he's talking about.
"Context drift as data distribution changes" sounds a hell of a lot like real life to me.
Normalized = hedonic treadmill on long view
At the micro scale, data that overflows the normalization is stored in emotional state, creating an orthogonal source of truth that makes up for the lack of full connected learning.
It's not like I have been doing many things outside reading papers and books over the past few years... The post that person used as an ad-hominem even says so.
This person has made a similar remark and purposefully looked into my post history - which doesn't make any claims about my knowledge or skills - lied about it, and made a sarcastic remark, see "everything".
Hinton's networks become the neuron of novel networks. It is important to know that these types of weights don't learn features, they map a compressed representation of the learned info, which is the input. Classification through error correction. That is actually what labels do for supervised learning (IOW they learn many ways to represent the label, and that is what the weights are).
Modern AI do that plus learn features, but the weights are nevertheless a representation of what was learned, plus a fancy way to encode and decode into that domain.
What Hinton and Deepmind will do is use neural-network learned-data, or perhaps the weights, as input to this kind of network. In other words, the output of another NN is labeled a priori, ergo you can use it "unsupervised" networks, which this research expounds. This will allow them to cook the input network into a specific dish, by labels even. Now give me my phd.
The divulgational title is almost an understatement: the Forward-Forward algorithm is an alternative to backpropagation.
Edit: sorry, the previous formulation of the above in this post, relative to the advantages, was due to a misreading. Hinton writes:
> The Forward-Forward algorithm (FF) is comparable in speed to backpropagation but has the advantage that it can be used when the precise details of the forward computation are unknown. It also has the advantage that it can learn while pipelining sequential data through a neural network without ever storing the neural activities or stopping to propagate error derivatives....The two areas in which the forward-forward algorithm may be superior to backpropagation are as a model of learning in cortex and as a way of making use of very low-power analog hardware without resorting to reinforcement learning
I think it is scrambled nonsense but it's scrambled in a way that still makes it look like a plausible sample. I remember watching a video of Hinton saying that just using white noise or similarly randomized data does not work but I'm now forgetting why.
Meta question on HN implementation: Why do sometimes submitting a previously submitted resource links automatically to the previous discussion, while other times is considered a new submission?
As far as I know it's a simple string match on the url. If the url is different (for example a new anchor tag is added) then it's considered a new submission.
It’s incredible to think that dreams are just our brains generating training data, and lack of sleep causes us to overfit on our immediate surroundings.
We tend to start hallucinating when we don't have enough sleep. So generating training data is necessary, but way safer when our muscles are turned off.
This makes me cheerful because it suggests a way that studying systems which appear intelligent might be able to teach us more about how human intelligence works.
in addition: During pre electricity time humen woke up after 4 hours sleep, got awake for some time and then continue to sleep. My guess, this sleep pattern is better for learning.
It's called biphasic sleep for people that want to read up on it.
> My guess, this sleep pattern is better for learning.
That might be true. One of the techniques to induce lucid dreaming works similarly – sleep for 4-5 hours, wake up, stay awake for 15-60mins then go back to sleep. It's called "wake back to bed" technique. Many lucid dreamers report increased capacity for learning in dreams.
Unlikely. The human genome comprises only billions of bits, much of which is low-information repetition. The amount of information sensed over a lifetime is vastly greater. To sense less than a billion bits over a 30-year development period would imply less than one bit per second. We clearly perceive more than one bit per second. For this reason, it seems likely that more information comes from learning post-birth than is pre-conditioned by evolution pre-birth. (Though of course post-birth learning cannot take place without the fantastic foundation set by evolution.)
> The human genome comprises only billions of bits, much of which is low-information repetition.
We constantly find out that certain things are actually really important even though we thought it was junk. Recall that our best ability to test Genes is by knocking them out one by one and trying to observe the effect
The brain is comprised of many extremely specialized sub systems and formulas for generating knowledge. We don’t know English at birth, sure, but we do have a language processing capability. The training baked into the brain is a level of abstraction higher, establishing frameworks to learn other things. It may not be as storage data heavy, but it’s much harder to arrive at and is the bulk of the learning process (learning to learn)
It seems that the point is that the objective function is applied layerwise, still computes gradient to get the update direction, it's just that gradients don't propagate to previous layers (detatched tensor).
As far as I can tell, this is almost the same as stacking multiple layers of ensembles, except worse as each ensemble is trained while previous ensembles are learning. This is causing context drift.
To deal with the context drift, Hinton proposes to normalise the output.
This isn't anything new or novel. Expressing "ThIs LoOkS sImIlAr To HoW cOgNiTiOn WoRkS" to make it sound impressive doesn't make it impressive or good by any stretch of the imagination.
Hinton just took something that existed for a long time, made it worse, gave it a different name and wrapped it in a paper under his name.
With every paper I am more convinced that the Laureates don't deserve the award.
Sorry, this "paper" smells from a mile away, and the fact that it is upvoted as much shows that people will upvote anything if they see a pretty name attached.
Edit:
Due to the apparent controversy of my criticism, I can't respond with a reply, so here is my response to the comment below asking what exactly makes this worse.
> As far as I can tell, this is almost the same as stacking multiple layers of ensembles
It isn't new. Ensembling is used and has been used for a long time. All kaggle competitions are won through ensembles and even ensembles of ensembles. It is a well studied field.
> except worse as each ensemble is trained while previous ensembles are learning.
Ensembles exhibit certain properties, but only iff they are trained independently from each other. This is well studied, you can read more about it in Bishop's Pattern recognition book.
> This is causing context drift.
Context drift occurs when a distribution changes over time. This changes the loss landscape which means the global minima change / move.
> To deal with the context drift, Hinton proposes to normalise the output.
So not only is what Hinton built a variation of something that existed already, made it worse by training the models simultaneously, and to handle the fact that it is worse, he adds additional computations to deal with said issue.
You train the network to detect correlations between the values of the ten first pixels and the rest of the image. Imagine you have a bunch of images of digits. For images with digit three you set the third pixel to white, for images of the digit four, you set the fourth pixel to white, and so on (actually, zero-indexing so fourth and fifth pixel for digit three and four but whatever). The other nine pixels among the first ten you set to black. These are positive samples and training the network with them will make it output a big number when it encounters them. Then you swap the pixels so that the images with the digit three has the fourth pixel set to white and the images with the digit four has the third pixel set to white. These are negative samples and they cause the network to output a small number. Thus, the only difference between positive and negative samples is the location of the white pixel. So for an image you want to classify you run it through the network ten times and each time shifting the location of the white pixel. The location for which the network outputs the biggest number is the predicted class.
Obviously, this method is problematic if you have thousands of labels or if your network is not a classifier.
A model is a function F that minimizes error in y_i = F(X_i) + error. Inserting a label simply means a function F(X_i,y_j). Then you optimize it in some way to separate true labels from false label, e.g. F(X_i, y_j) = (y_i==y_j)-(y_i!=y_j) + error.
So the initial “hey, this might be a good idea” implementation performs slightly worse than something that has had literally billions of dollars thrown at it?
The question was "If it's not new or novel, why aren't people using it?". For example, take a look at this paper: https://arxiv.org/abs/1905.11786. It was published 3 years ago, it also does parallel layer-wise optimization, it even talks about being inspired by biology, though the objective function is different from Hinton's. Why aren't people using it? Because it performs worse. It is that simple. Is it an interesting area to explore? Probably. There are millions of interesting areas to explore. It doesn't mean it is worth using, at least yet.
It is far less accurate compared to SOTA models. The paper says it should train faster, but it doesn't provide any metrics; so it's hard to make any "pound for pound" comparisons.
That's the same impression I had. I was afraid I am not getting something or missing a bigger picture. I am glad I am not the only one who feels that way.
It turns out there’s a whole subfield for quantum ML. I don’t know much about it, but it’s neat that there’s any applicability. It’s not obvious that there was any connection.
Those are relatively close figures, but good accuracy on CIFAR-10 is 99%+ and getting ~94% is trivial.
So, if an improper architecture for a problem is used and the accuracy is poor, how compelling is using another optimization approach and achieving similar accuracy?
It's a unique and interesting approach, but the article specifically mentions it gets accuracy similar to backprop, but if this is the experiment that claim is based on, it loses some credibility in my eyes.