Hacker News new | past | comments | ask | show | jobs | submit login
Geoffrey Hinton publishes new deep learning algorithm (infoq.com)
319 points by danboarder on Jan 12, 2023 | hide | past | favorite | 121 comments



Maybe I'm missing something, but from the paper https://www.cs.toronto.edu/~hinton/FFA13.pdf, they use non-conv nets on CIFAR-10 for back prop, resulting in 63% accuracy. And FF achieves 59% accuracy (at best).

Those are relatively close figures, but good accuracy on CIFAR-10 is 99%+ and getting ~94% is trivial.

So, if an improper architecture for a problem is used and the accuracy is poor, how compelling is using another optimization approach and achieving similar accuracy?

It's a unique and interesting approach, but the article specifically mentions it gets accuracy similar to backprop, but if this is the experiment that claim is based on, it loses some credibility in my eyes.


I think you have to set expectations based on how much of the ground you're ripping up. If you're adding some layers or some little tweak to an existing architecture, then yeah, going backwards on cifar-10 is a failure.

If, however, you are ripping out backpropagation like this paper is, then you get a big pass. This is not the new paradigm yet, but it's promising that it doesn't just completely fail.


This seems to be Hinton's MO though. A few years back he ripped out convolutions for capsules and while he claims it's better and some people might claim it "has potential", no one really uses it for much because, as with this, the actual numerical performance is worse on the tests people care about (e.g. imagenet accuracy).

https://en.wikipedia.org/wiki/Capsule_neural_network


I mean yes, this should be the MO of a tenured professor, making large speculative bets, not hyper optimizing benchmarks


But some of those bets should be right, or else he'd be better spending his time and accumulated knowledge writing a historical monograph.


Specifically, tenure is to remove the pressure that "you'd better be right" so professors are free to take meandering tangents through the solution space that don't seem like they'll pay off immediately.

The failure mode of tenure is that the professor just rests on their past accomplishments and doesn't do anything. That's a risk the system takes. In this case though, Geoff Hinton is doing everything right: he's not only not sitting around doing nothing, he's actively trying to obsolete the paradigm he helped usher in, just in case there is a better option out there. I think that's admirable


The backpropagation paper was published in 1986.

It took >20 years for it to be right.

Maybe we ought to give this one some time?


Not sure what you mean by >20 years to be right. I built and trained a 3-layer back-propagating neural net to do OCR on an Apple 2 in 1989 based on that paper. Admittedly, just the 26 upper case characters. But it clearly worked better than the alternatives.


There is no shortage of paradigms that rip out backprop and deliver worse results.


This is so true! But we should keep trying :)


Also Hinton doesn't have the best track record with his already forgotten/abandoned Capsule networks. I wonder what's the next thing he's going to come up with? He gets a pass because he is famous.


I think it can be simultaneously true that things like this should be tested with toy models we wouldn't expect to do great on CIFAR and also that we shouldn't expect exceptional results just because this person is already famous.


The best context to view the paper, is as part of an algorithm search.

Until the brain's algorithm is "solved", half steps are important. We need as many alternate half steps as we can find until one or more lead to a better understanding of the brain. (And potentially, better than backdrop efficiency or results.)


You have to start with toy models before scaling up.


Achieving <80% on CIFAR10 in the year >2020 is an example of a failed toy model, not a successful toy model.

Almost any ML algorithm can be thrown at CIFAR10 and achieve ~60% accuracy; this ballpark of accuracy is really not sufficient to demonstrate viability, no matter how aesthetically interesting the approach might feel.


Hinton is doing basic science, not ML, here. Given who he is, trying to move the needle on traditional benchmarks would be a waste of his time and skills.

If he invents the new back propagation, an army of grad students can turn his ideas into the future. Like they've done for the last 15 years.

He's posting incremental work towards rethinking the field. It's pretty interesting stuff.

Edit: grammar


Plain MLP acc. is 63% vs 59% with FF, not so bad? By the same logic, MLP is a failed toy model.


I haven't seen this to be the case, fwiw. There was a paper in 2016 that did this and most were in the ~40% range.

But "any ml algorithm" isn't the point. It's a new optimization technique and should be applied to models/architectures that make sense with the problems they are being used on.

For example, they could have used a pretrained featurizer and trained the two layer model on top of it, with both back prop and FF and compared.


> For example, they could have used a pretrained featurizer and trained the two layer model on top of it, with both back prop and FF and compared.

Making the assumption that weights/embeddings produced by a backprop-trained network are equally intelligible to a network also trained by backprop vs. one trained by this alternative method.


I have personally seen them used successfully with all kinds of classic ml algorithms (enets, tree-based, etc) that have nothing to do with back prop.


Any ML algorithm that already has tooling written, CUDA scripts, etc. to run it faster.

That said, I am also short-term bearish on backprop-free methods (although potentially long-term bullish).


This is not a benchmark of some model on cifar-10, it's a benchmark of the training algorithm.

But, model size and complexity also matters. MLP with backprop gets about 63% on cifar-10, for various reasons. So achieving 59% accuracy means this algorithm is about 93% as good as backprop in this case.

However, 63% accuracy on cifar-10 can be achieved with two (maybe three) layers IIRC. The output is a 10-way classifier, which is handled in one layer. If the output requires multi-layer transformations, then gradients need to be back-propagated.

As long as the batch activation vectors are trained to max separation (or orthogonality or whatever) at each layer, one output layer can match them to labels. But this is unlikely in problems where the output is more "transformed or complicated".


The article links to an old draft of the paper (it seems that the results in 4.1 couldn't be replicated). The arxiv has a more recent one: https://arxiv.org/abs/2212.13345


I skimmed through the paper and am a bit confused. There's only one equation and I feel like he rushed to publish a shower thought without even bothering to flesh it out mathematically.

So how do you optimize a layer? Do you still use gradient descent? So you are have a per layer loss with a positive and negative component and then do gradient descent?

So then what is the label for each layer? Do you use the same label for each layer?

And what does he mean by the forward pass not being fully known? I don't get this application of the blackbox between layers. Why would you want to do that?


Those details have to be omitted from manuscripts in order to avoid having to cite the works of Jürgen Schmidhuber.


Jürgen did it all before in the 80s, however it was never translated to English so Geoffrey could happily reinvent it.


Jürgen invented AGI in the early 90s but someone pressed the red button on his website and it committed suicide.


Just curious, why would one want to avoid citing Schmidhuber's work?


It is a bit of a meme in AI research, as Schmidhuber often claims that he hasn't received the citations that he thinks he deserves.

https://www.urbandictionary.com/define.php?term=schmidhuber


Its not just a meme in this case btw :D https://twitter.com/SchmidhuberAI/status/1605246688939364352


Probably because the idea is trivial in hindsight (always is) so publishing fast is important. Afaict, the idea is to compute the gradients layer by layer and applying them immediately without bothering to back-propagate from the outputs. So in between layers would learn what orientation vectors previous layers emit for positive samples and themselves emit orientation vectors. Imagine a layer learning what regions of a sphere's (3d) surface are good and outputting what regions of a circle's (2d) perimeter are good. This is why he mentions the need for normalizing vectors otherwise layers would cheat and just look at the vector's magnitude.

The idea is imo similar to how random word embeddings are generated.


> because the idea is trivial in hindsight (always is) so publishing fast is important.

Unfortunately I've also seen papers get rejected because their idea was "trivial", yet no one had thought of it before. Hinton has an edge here though.


> Afaict, the idea is to compute the gradients layer by layer and applying them immediately without bothering to back-propagate from the outputs.

I'm not sure where you get that impression. Forward-Forward [1] seems to eschew gradients entirely:

    The Forward-Forward algorithm replaces the forward and backward passes of backpropagation by two forward passes, one with positive (i.e. real) data and the other with negative data which could be generated by the network itself
[1] https://www.cs.toronto.edu/~hinton/FFA13.pdf


It eschews back-propagation but not gradient calculation. You still have to nudge the activations' weights upward for positive examples and downward for negative ones. Positive examples should give a long vector and negative ones a small vector.


Let's suppose that you are correct, which direction are the weights updated towards?

The implementations of this compute gradients locally.


It makes sense that all gradients are local. Does it make sense to say that gradient propagation through the layers is memoryless?


In my opinion, yes if and only if the update does not use a stateful optimiser, and the computation is easy / simple enough that the updated parameter value can be computed immediately.

In linear layers, it is possible. Once you have computed the gradient of the output of the vector ith vector, so a scalar, you scale the input by that value and add it to the parameters.

This is a simple FMA op: a=fma(eta*z, x, a), with z the gradient of the vector, x the input, a the parameters, and eta the learning rate. This computes a = a + eta*z*x in place.


Perhaps a better place to find algorithmic details is this related paper, also with Hinton as a co-author, which implements similar ideas in more standard networks:

Scaling Forward Gradient With Local Losses Mengye Ren, Simon Kornblith, Renjie Liao, Geoffrey Hinton https://arxiv.org/abs/2210.03310

and has code: https://github.com/google-research/google-research/tree/mast...


> There's only one equation

Not accurate for the version another commenter linked: https://www.cs.toronto.edu/~hinton/FFA13.pdf

I see four equations.


Deep dive tutorial for learning in a forward pass [1]

[1] https://amassivek.github.io/sigprop


> There are many choices for a loss L (e.g. gradient, Hebbian) and optimizer (e.g. SGD, Momentum, ADAM). The output(), y, is detailed in step 4 below.

I don't get it, don't all of those optimizers work via backprop?


The optimizers take parameters and their gradients as inputs and apply update rules to them, but the gradients you supply can come from anywhere. Backdrop is the most common way to assign gradients to parameters, but other methods can work too—as long as the optimizer is getting both parameters and gradients, it doesn't care where they come from.


I found this paragraph from the paper very interesting:

> 7 The relevance of FF to analog hardware

> An energy efficient way to multiply an activity vector by a weight matrix is to implement activities as voltages and weights as conductances. Their products, per unit time, are charges which add themselves. This seems a lot more sensible than driving transistors at high power to model the individual bits in the digital representation of a number and then performing O(n^2) single bit operations to multiply two n-bit numbers together. Unfortunately, it is difficult to implement the backpropagation procedure in an equally efficient way, so people have resorted to using A-to-D converters and digital computations for computing gradients (Kendall et al., 2020). The use of two forward passes instead of a forward and a backward pass should make these A-to-D converters unnecessary.

It was my impression that it is difficult to properly isolate an electronic system to use voltages in this way (hence computers sort of "cut" voltages into bits 0/1 using a step function).

Have these limitations been overcome or do they not matter as much, as neural networks can work with more fuzzy data?

Interesting to imagine such a processor though.


Photonic/optical neural networks are an interesting related area of research, using light interference to implement convolution and other operations without (I believe?) needing a bitwise representation of intensity.

https://www.nature.com/articles/s41467-020-20719-7

https://opg.optica.org/optica/fulltext.cfm?uri=optica-5-7-86...


The small deltas resulting from electrical noise generally aren't an issue for probabilistic computations. Interestingly, people have quantized many large DL models down to 8/16 bits, and accuracy reduction is often on the order of 2-5%. Additionally, adding random noise to weights during training tends to act as a form of regularization.


There's been unhappiness in some quarters that back propagation doesn't seem to be something that biology does. That may be part of the motivation here.


The paragraph about Mortal Computation is worth repeating:

If these FF networks can be proven to scale or made to scale similarly to BP networks, this would enable making hardware several orders of magnitude more efficient, for the price of loosing the ability to make exact copies of models to other computers. (The loss of reproducibility sits well with the tradition of scientific papers anyway/s;)

2.) How does this paper relate to Hintons feedback alignment from 5 years ago? I remember it was feedback without derivatives. What are the key new ideas? To adjust the output of each individual layer to be big for positive cases and small for negative cases without any feedback? Have these approaches been combined?


Discussion last month when the preprint was released: https://news.ycombinator.com/item?id=33823170



Not a deep learning expert, but: it seems that without backpropagation for model updates, the communication costs should be lower. And that will enable models that are easier to parallelize?

Nvidia isn't creating new versions of its NVLink/NVSwitch products just for the sake of it, better communication must be a key enabler.

Can someone with deeper knowledge can comment on this? Is communication a bottleneck, and will this algorithm uncover a new design space for NNs?


It’s more that without backpropagation, you no longer need to store your forward activations across many layers to compute the backwards pass which usually is dependent on a forward pass. When a network is hundreds of layers, and batches are very large, the forwards and backwards accumulations add up in terms of memory required.

Communication across GPUs doesn’t solve this but instead allows to have either many models running in parallel on different GPUs to increase Barch size or to share many layers across GPUs to increase model size. Quick communication is critical to maintain training speeds that aren’t astronomical


> will this algorithm uncover a new design space for NNs?

No.

Hinton "discovered" stacking ensembles and gave it a new name, fancy analogies to biological brains and then made it worse.

The gist of this is that you can select a computational unit, be it a linear layer, or a collection of layers, compute the derivative of the output with respect to the parameters, and update them.

Each computational unit is independent, meaning that you don't calculate gradients going outside of it.

This is the same as training a bunch of networks, computing predictions, and then using another layer to combine the predictions. This is called stacking, and the networks are called an "ensemble". You can do this multiple times and have N levels of meta estimators.

Instead of fitting the ensemble and then the meta estimator, Hinton proposes training both simultaneously but without allowing gradients to flow through.

That is stupid because if you don't allow gradients to flow through, you will see a context drift as the data distribution changes. Hinton observed this context drift, to deal with that, he proposed normalizing the data.

On one extreme, you can use individual linear units as the models, and on the other extreme, you can combine all units into a single neural network and treat that as a module.

So no, this does not open any new design, it's an old idea, worsened, and wrapped in fancy words and post-facto reasoning.

If you are curious how a linear layer is an ensemble, observe that each vector is its own linear estimator, making the linear mapping an ensemble of estimators.


Yeah, no. Reading the paper I don't really see anything but a superficial resemblance to stacking. Hinton was active back when Wolpert introduced stacking and I'm fairly sure he is aware of it. If anything it much more closely resembles his own prior work in Boltzmann machines, unsurprisingly (and which he cites), or even his prior work on capsules. I don't know if this will really pan out into anything that different or useful for the field, but it's unfair and inaccurate to dismiss it as derivative of stacking.


A single linear layer is for all intents and purposes equivalent to running an ensemble of linear estimators. By disallowing gradients to flow between two layers A, B, computing (B . f . A)(x) with f being a non linearity, the second layer is an ensemble of linear estimators of the outputs of the first, and for all intents and purposes, the output of (f.A)(x) is just preprocessing for B.

Since gradients don't flow from B to A in (B.f.A)(x), A is trained independently of B, meaning that the training distribution of B changes without B influencing it, i.e. context drift. B doesn't know the difference, and B doesn't influence it.

For all intents and purposes, you can compute all the outputs as training of A happens, meaning training A to completion, and then feed them into B and B will still compute the same outputs and derivatives as it did before.

To deal with context drift, Hinton proposes normalizing the data, so the distribution does not change significantly.

Whatever he proposed is not "backprop-free" either. It still involves backprop, but the number of layers gradients flow through is 1, the layer itself.

The argument that you can still train through non-differentiable operations is not particularly convincing either; the reparameterization trick shows that is trivial to pass gradients through non differentiable operations if we are smart about it.

Given non differentiable operator Z: R^N -> R^N; let A, B, C be R^N -> R^N linear layer, B(Z(C(x)) * A(C(x))) allows gradients to flow through B and A all the way to C. The output of Z is for all intents and purposes a Hadamard product with (A . C)(x) that is runtime constructed and might as well be part of the input.

You can even run Z(C(x)) through a neural network and learn how to transform that and still provide useful and informative gradients back to C(x) via (A . C)


I'm not sure what the main point is here. The paper is definitely sketchy on details, and the main idea is definitely simple enough to resemble a lot of other work. I wouldn't be surprised if someone (maybe a certain Swiss researcher) comes out and says, actually, this is the same as this other paper from the early 90s. If you squint hard enough a lot of ideas (especially simple ones) can be seen as being the same as other, older ideas. I'm not too interested in splitting those hairs, really. I'm more curious on whether this eventually leads to something that sets it apart from the SOTA in some interesting way.


My claim is that this work is simply worse ensembles wrapped in a biologically inspired claims, and that arguments made in support of it by the author compared to other approaches are simply not sound.

By looking at it through that perspective, the issues with the approach become evident, and are fundamental in my opinion.


You're accusing one of the foundations of modern AI with either being a fraud, or incompetent. At best that seems short sighted, no?


You're obviously new to Hacker News :-D


Unfortunately, I've been here for a decade in one form or another. Every now and then someone writes something so pompous that I just can't help myself but post. Back to lurking now. Cheers!


If we can't publicly scrutinize people who have great sway in the industry, what does that say about us as a research community?

The fact that I argued why I found it bogus based on well established principles, and I get shitted on by people who by all means have provided nothing to this conversation and except suppressing criticism or throwing ad-hominems should tell all about the quality of discourse.

Dismissing criticism, not by arguments, but by the mere name of the person does a disservice to everyone.

If the research can't stand on its own, independent of the author, then it is not good research.


We are all looking forward to your research paper that disproves his claims. Or you know, any proof.


I argued for it and all I got was downvoted without criticism of the substance of my arguments, only ad-hominems and fallacies.

If you can point to _fundamental_ criticism of my arguments, and not fallacies or attacks, I'd be more than happy to discuss them.


The hard dismissal with 'No' is likely why you got down voted. I am not able to do that.

With that kind of tonal promise, especially considering the source you are dismissing outright is important in their field, you have to show, not just tell.

If you just left that No out, and gave room for the chance that you are wrong, people wouldn't downvote, they'd upvote. People like to hear smart arguments. No one wants to hear outward dismissal. Especially of known experts.


I am neither the first, nor the last who believe that the Laureates have not done their due diligence properly with respect to citing sources.

I could name many other people who have actually been more influential in the field.


Thats a lot less cheaty, biologically speaking, than full backprop. This Hinton guy sounds like he knows what he's talking about.

"Context drift as data distribution changes" sounds a hell of a lot like real life to me.

Normalized = hedonic treadmill on long view

At the micro scale, data that overflows the normalization is stored in emotional state, creating an orthogonal source of truth that makes up for the lack of full connected learning.


[flagged]


A lot can change in 60 days.


They do have a master's degree according to the post.

The claims made are not that deep for researchers.


The use of the word masters is now considered not cool according to Stanford... :-)


It's not like I have been doing many things outside reading papers and books over the past few years... The post that person used as an ad-hominem even says so.


Ad-hominems are not a particularly nice way to argue about correctness of a claim.


Maybe you know something but I don't, but I believe the comment you are replying to was a compliment and not a sarcastic dig.


This person has made a similar remark and purposefully looked into my post history - which doesn't make any claims about my knowledge or skills - lied about it, and made a sarcastic remark, see "everything".


This is an interesting approach and I have read that this is more closer to how our brains works.

We extract learning, while we are imbibing the data and there seems to be no mechanism in the brain that favors backprop like learning process.


Fact: Geoffrey Hinton has discovered how the brain works. Every few years actually.


Once a year for the last 30 years.


yeah, whatever happened to capsule networks?


Capsule networks were conceptually an early attempt at transformers.


Hinton's networks become the neuron of novel networks. It is important to know that these types of weights don't learn features, they map a compressed representation of the learned info, which is the input. Classification through error correction. That is actually what labels do for supervised learning (IOW they learn many ways to represent the label, and that is what the weights are). Modern AI do that plus learn features, but the weights are nevertheless a representation of what was learned, plus a fancy way to encode and decode into that domain.

What Hinton and Deepmind will do is use neural-network learned-data, or perhaps the weights, as input to this kind of network. In other words, the output of another NN is labeled a priori, ergo you can use it "unsupervised" networks, which this research expounds. This will allow them to cook the input network into a specific dish, by labels even. Now give me my phd.

edit: edit


There's an open source implementation of the paper in pytorch https://github.com/nebuly-ai/nebullvm/tree/main/apps/acceler... by @diegofiori_

He also wrote an interesting thread on the memory usage of this algo versus backprop https://twitter.com/diegofiori_/status/1605242573311709184?s...


Direct link to an implementation on GitHub: https://github.com/nebuly-ai/nebullvm/tree/main/apps/acceler...

--

The divulgational title is almost an understatement: the Forward-Forward algorithm is an alternative to backpropagation.

Edit: sorry, the previous formulation of the above in this post, relative to the advantages, was due to a misreading. Hinton writes:

> The Forward-Forward algorithm (FF) is comparable in speed to backpropagation but has the advantage that it can be used when the precise details of the forward computation are unknown. It also has the advantage that it can learn while pipelining sequential data through a neural network without ever storing the neural activities or stopping to propagate error derivatives....The two areas in which the forward-forward algorithm may be superior to backpropagation are as a model of learning in cortex and as a way of making use of very low-power analog hardware without resorting to reinforcement learning


What exactly is the negative data? Seems like it's just scrambled nonsense (aka what the truth is not)


I think it is scrambled nonsense but it's scrambled in a way that still makes it look like a plausible sample. I remember watching a video of Hinton saying that just using white noise or similarly randomized data does not work but I'm now forgetting why.


@dang

Meta question on HN implementation: Why do sometimes submitting a previously submitted resource links automatically to the previous discussion, while other times is considered a new submission?


As far as I know it's a simple string match on the url. If the url is different (for example a new anchor tag is added) then it's considered a new submission.


If you click on "past" under this submission, you see two identical URLs:

https://hn.algolia.com/?query=Geoffrey%20Hinton%20publishes%...


Which is odd, because I checked in the minutes following the submission and I remember "[past]" did not return anything.


I believe one factor is the amount of time between the submissions.


Having read through the forward-forward paper, it feels like it's Oja's rule adapted for supervised learning but I can't really articulate why...


It’s incredible to think that dreams are just our brains generating training data, and lack of sleep causes us to overfit on our immediate surroundings.


We tend to start hallucinating when we don't have enough sleep. So generating training data is necessary, but way safer when our muscles are turned off.


Thanks for this little comment thread folks!

This makes me cheerful because it suggests a way that studying systems which appear intelligent might be able to teach us more about how human intelligence works.


in addition: During pre electricity time humen woke up after 4 hours sleep, got awake for some time and then continue to sleep. My guess, this sleep pattern is better for learning.


It's called biphasic sleep for people that want to read up on it.

> My guess, this sleep pattern is better for learning.

That might be true. One of the techniques to induce lucid dreaming works similarly – sleep for 4-5 hours, wake up, stay awake for 15-60mins then go back to sleep. It's called "wake back to bed" technique. Many lucid dreamers report increased capacity for learning in dreams.


> During pre electricity time humen woke up after 4 hours sleep, got awake for some time and then continue to sleep.

The confusing thing with this claim is what did people actually do during this time, given bad (and expensive!) lighting only?


Thinking, talking to peers, Smartphones are a new invention and night was a dangerous time for a long time.


We definitely do not know nearly enough to say anything like that with confidence.

Most of the "training process" of our brain likely occurred prior to our birth in evolutionarily optimized structure of brain.


Unlikely. The human genome comprises only billions of bits, much of which is low-information repetition. The amount of information sensed over a lifetime is vastly greater. To sense less than a billion bits over a 30-year development period would imply less than one bit per second. We clearly perceive more than one bit per second. For this reason, it seems likely that more information comes from learning post-birth than is pre-conditioned by evolution pre-birth. (Though of course post-birth learning cannot take place without the fantastic foundation set by evolution.)


> The human genome comprises only billions of bits, much of which is low-information repetition.

We constantly find out that certain things are actually really important even though we thought it was junk. Recall that our best ability to test Genes is by knocking them out one by one and trying to observe the effect

The brain is comprised of many extremely specialized sub systems and formulas for generating knowledge. We don’t know English at birth, sure, but we do have a language processing capability. The training baked into the brain is a level of abstraction higher, establishing frameworks to learn other things. It may not be as storage data heavy, but it’s much harder to arrive at and is the bulk of the learning process (learning to learn)


Isn't this similar to how GAN networks learn? Edit: Yes, there is small chapter in paper comparing it to GAN


Interesting, my first take was that it's like contrastive divergence in Restricted Boltzmann Machines (RBMs). There's also a chapter for that.


It seems that the point is that the objective function is applied layerwise, still computes gradient to get the update direction, it's just that gradients don't propagate to previous layers (detatched tensor).

As far as I can tell, this is almost the same as stacking multiple layers of ensembles, except worse as each ensemble is trained while previous ensembles are learning. This is causing context drift.

To deal with the context drift, Hinton proposes to normalise the output.

This isn't anything new or novel. Expressing "ThIs LoOkS sImIlAr To HoW cOgNiTiOn WoRkS" to make it sound impressive doesn't make it impressive or good by any stretch of the imagination.

Hinton just took something that existed for a long time, made it worse, gave it a different name and wrapped it in a paper under his name.

With every paper I am more convinced that the Laureates don't deserve the award.

Sorry, this "paper" smells from a mile away, and the fact that it is upvoted as much shows that people will upvote anything if they see a pretty name attached.

Edit:

Due to the apparent controversy of my criticism, I can't respond with a reply, so here is my response to the comment below asking what exactly makes this worse.

> As far as I can tell, this is almost the same as stacking multiple layers of ensembles

It isn't new. Ensembling is used and has been used for a long time. All kaggle competitions are won through ensembles and even ensembles of ensembles. It is a well studied field.

> except worse as each ensemble is trained while previous ensembles are learning.

Ensembles exhibit certain properties, but only iff they are trained independently from each other. This is well studied, you can read more about it in Bishop's Pattern recognition book.

> This is causing context drift.

Context drift occurs when a distribution changes over time. This changes the loss landscape which means the global minima change / move.

> To deal with the context drift, Hinton proposes to normalise the output.

So not only is what Hinton built a variation of something that existed already, made it worse by training the models simultaneously, and to handle the fact that it is worse, he adds additional computations to deal with said issue.


Since you seem to understand what he is saying, can you explain to me how the per layer objective function looks like?

I don't get what he means by inserting the label into the input and what labels he is using per layer.


You train the network to detect correlations between the values of the ten first pixels and the rest of the image. Imagine you have a bunch of images of digits. For images with digit three you set the third pixel to white, for images of the digit four, you set the fourth pixel to white, and so on (actually, zero-indexing so fourth and fifth pixel for digit three and four but whatever). The other nine pixels among the first ten you set to black. These are positive samples and training the network with them will make it output a big number when it encounters them. Then you swap the pixels so that the images with the digit three has the fourth pixel set to white and the images with the digit four has the third pixel set to white. These are negative samples and they cause the network to output a small number. Thus, the only difference between positive and negative samples is the location of the white pixel. So for an image you want to classify you run it through the network ten times and each time shifting the location of the white pixel. The location for which the network outputs the biggest number is the predicted class.

Obviously, this method is problematic if you have thousands of labels or if your network is not a classifier.


A model is a function F that minimizes error in y_i = F(X_i) + error. Inserting a label simply means a function F(X_i,y_j). Then you optimize it in some way to separate true labels from false label, e.g. F(X_i, y_j) = (y_i==y_j)-(y_i!=y_j) + error.


This command is a lot of words to say "I don't like it" without giving any reason to believe you.

If it's not new or novel, why aren't people using it? If it's bad, what's wrong with it?


It performs worse than baсkprop.


So the initial “hey, this might be a good idea” implementation performs slightly worse than something that has had literally billions of dollars thrown at it?


The question was "If it's not new or novel, why aren't people using it?". For example, take a look at this paper: https://arxiv.org/abs/1905.11786. It was published 3 years ago, it also does parallel layer-wise optimization, it even talks about being inspired by biology, though the objective function is different from Hinton's. Why aren't people using it? Because it performs worse. It is that simple. Is it an interesting area to explore? Probably. There are millions of interesting areas to explore. It doesn't mean it is worth using, at least yet.


It is slower to the same backprop that we have used for decades now.

No comparisons to AdamW were made.

In fact, this algorithm uses backprop at its core, but propagating through 0 layers.


Very significantly worse.


Is it slower and less accurate ? Or just slower.


It is far less accurate compared to SOTA models. The paper says it should train faster, but it doesn't provide any metrics; so it's hard to make any "pound for pound" comparisons.


That's the same impression I had. I was afraid I am not getting something or missing a bigger picture. I am glad I am not the only one who feels that way.


Is the derivative calculated by forward-forward as analytic as in backpropagation?


Is this a game changer?


This is old news already.


Quantum would absolutely change everything in DL/ML space.


It turns out there’s a whole subfield for quantum ML. I don’t know much about it, but it’s neat that there’s any applicability. It’s not obvious that there was any connection.


Most problems are less computationally intensive than one would think


I remember a long time ago there was a collection of Geoffrey Hinton facts. Like "Geoffrey Hinton once wrote a neural network that beat Chuck Norris."


I saw the same for Jeff Dean




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: