Hacker News new | past | comments | ask | show | jobs | submit login
Gradients are not all you need (arxiv.org)
163 points by bundie on April 23, 2023 | hide | past | favorite | 68 comments



Seems to me like the whole history of neural nets is basically crafting models with well-behaved gradients to make gradient descent work well. That, and models that can achieve high utilization of available hardware. The surprising thing is that models exist where the gradients are so well-behaved that we can learn GPT-4 level stuff.


There's plenty of interesting neural network designs out there but they're being overshadowed by transformers due to their recent success. I personally thing that the main reason transformers work so well is because they actually step away from the multi layer perceptron stuff and introduce some structure and in a way sparsity.


Also, multi-head attention strikes me as being about as close to how language semantics seems to actually work in human brains as I've seen.

Lots of caveats there, of course. First off, I don't know much about the neurology, I just have an amateur interest in second language acquisition research that sometimes brings me into contact with this sort of thing. On the ANN side, which is closer to my actual wheelhouse, we definitely don't actually have any way of knowing if the actual mechanism is all that close, and I'm guessing it probably isn't even close since ANN's don't actually work that similarly to brains. Nor does it need to be, but, intuitively, there's still something promising about an ANN architecture that's vaguely capable of mimicking the behavior of modules in an existing system (human brains) that's well known to be capable of doing the job. I'm not super wild about the bidirectional recurrent layers, either, because they impose some restrictions that clearly aren't great, such as the hard limit on input size. et cetera. But it still strikes me as another big step in a good direction.


I'm currently working on a variation of a spiking neural network that learns by making and purging connections between neurons, which so far has been pretty interesting, though I am having a hard time getting it to output anything more than just the patterns it recognised. I did play around with adding its outputs to the input list, making it sort of recurrent but its practically impossible to decode anything thats going on inside of the network. Im thinking of tracking the inputs around to see what its doing right now, might be interesting to see it generate some sort of tree-like structure.


Are you familiar with the edge popup algorithm introduced in "What's Hidden in a Randomly Weighted Neural Network?" https://arxiv.org/abs/1911.13299v2

Seems relevant to what you're working on. It starts with a randomly initialized, overparameterized neural net, but instead of gradient descent backpropagation, it learns by deleting connection edges.


I haven't read it, thanks a lot! I'm probably going to use it in an essay I'm writing about the topic.


That's probably true for most kinds of NN architectures, including convolutional layers and older recurring architectures (LSTM, etc). Fully connected networks do not seem to be a necessary and certainly not efficient way to represent the mechanisms that operate in the "real world", so clever way to make the networks sparse is an important key.

But it's equally important to create architectures that allow efficient backpropagation of errors.

It does seem like transformers are pretty good at both, already.

I kind of hope we're not getting much something radically better anytime soon, because it seems like AGI is already approaching faster than we can prepare for.

Then again, I would expect that someone somewhere is already using transformer based networks to develop some brand new architecture that does in fact provide such a leap.


>There's plenty of interesting neural network designs out there

Where could a person learn more about these?


It's less about enumerating the architectures that have been tried before, and more about recognizing the modularity of NN components and the different perspectives on what those modules might represent.


> gradients are so well-behaved that we can learn GPT-4 level stuff

What are "well-behaved" gradients?

What type of GPT-4 level stuff?


> Gradients Are Not All You Need

Sometimes you need to peek at the Hessian.

Seriously though, what is intelligence if not creative unrolling of the first few terms of the Taylor expansion?


Of what?


`a + ax + ax^2+...`


That's the expansion. What does it represent?


Given the context of the original statement, the Taylor series expansion represents the statistical learning process of any topic anywhere.


I think you mean loss function, not process. And I'm failing to make any connection between intelligence and polynomial approximations to loss functions.


Intelligence is rather ill defined which I suppose is why you are having difficulties.


Yes.

(And it goes down all the way)


No that's turtles.


Global optimization techniques which don't rely on gradients seems theoretically superior in all instances, except that we haven't found super fast ways to run these kinds of optimizers.

The cartpoll demo famously tripped up derivative based reinforcement learning for awhile.


> Global optimization techniques which don't rely on gradients seems theoretically superior in all instances, except that we haven't found super fast ways to run these kinds of optimizers.

Did you mean "Global optimization techniques which do rely on gradients..."? Because exact gradient-based global optimization (GBD or branch-and-bound based) methods for general nonconvex problems are theoretically superior (bounding with McCormick relaxations etc.) but also more challenging to practically deploy than say stochastic methods or metaheuristics like local search.


Could you point me toward any more info or keywords on "the cartpoll demo famously tripped up derivative based reinforcement learning for awhile"? It sounds like an interesting bit of history but my google searches are only bringing up python tutorials.


My guess: the cart pole is an inverted pendulum, and requires multiple left-right-swing-up movements to bring it from the “hanging” position to the “standing” position. Finding this action sequence using gradients of “where is the tip” vs “where should it be” is very hard, as swinging the pendulum to the left and right goes against the gradient.

Instead, using stochastic gradient approximations (policy gradient method such as proximal policy optimization) might be better suited to solving these kinds of problems. Effectively, they do not compute the exact gradient locally, but rather kind of a global approximation by trying out random sequences of actions and determining which of them are closest to the desired outcome.

Hence, stochastic gradient approximations might be considered some kind of hybrid between greedy local optimization (such as following the exact gradient) and global optimization.


Sorry, I meant "Mountain Cart" not cartpoll - https://www.gymlibrary.dev/environments/classic_control/moun...

The reason for this is that the algorithm doesn't like to have to "spend" energy, reducing its score. Without huge amounts of trickery to get the gradient descent algorithm to stop getting stuck in the center, this is never solved - due to using a local optimizer for a global optimization problem (finding good weights in a NN)


Chatgpt:

What does this refer to: cartpoll demo famously tripped up derivative based reinforcement learning

The phrase "cartpole demo famously tripped up derivative-based reinforcement learning" is likely referring to a classic problem in the field of reinforcement learning, which involves balancing a pole on a cart. The pole is attached to the cart via a hinge, and the goal is to keep the pole upright by moving the cart left or right in response to its angle. This problem is often used as a benchmark for testing reinforcement learning algorithms.

The phrase suggests that derivative-based reinforcement learning algorithms, which rely on computing gradients of a function with respect to its parameters, were not successful at solving this problem. This could be due to the fact that the problem is highly non-linear and requires precise control, which may be difficult to achieve with gradient-based methods.

Edit: bard got it too, with more detail, which is surprising


You should be really careful with asking this kind of question to ChatGPT, because now you think you've learned the answer, but in fact there are two options with very different outcome:

- ChatGPT was trained on a corpus of data containing the answer and is able to give you a decent answer

- ChatGPT was never exposed to the answer and will hallucinate a plausible-sounding response, and because it will answer in a really convincing way, you'll get tricked into believing complete bullshit



Interesting paper, thanks for bringing this up! I have been working on methods for trajectory optimization using both, analytic gradient computations and black box stochastic gradient approximations (proximal policy optimization).

I was always wondering about a question that is touched in the paper: despite the analytic gradient computation being intuitively more efficient and mathematically correct, it is much harder to learn a policy with it than with the “brute force trial-and-error” black box methods.

This paper brings many new perspectives on why.


> chaos based failure mode

I studied this in undergrad, but it’s not the same thing the paper is talking about


My one wish is that machine learning papers would use paper titles that actually described what the paper was about. I suppose there is a certain 'evolutionary pressure' where clever titles 'outcompete' dryer, more descriptive titles (or it seems that way). But i don't like it.


Remember the days after Rowhammer and Heartbleed, when every new security vulnerability needed its own catchy name and domain name website? This is the science version of that.

Branding is eating the world.


It's inescapable.

40 years ago, when articles went into print publications, you'd just get your paper into a key print journal and then trust that everyone who gets it would at least look through the article headlines and read the abstracts of articles that seemed relevant to them. And it was manageable because you'd only have a few new issues rolling in per month.

But arXiv had an average of 167 CS papers being submitted per day in 2021. An academic who wants to keep their career alive needs to resort to every trick in the book to be heard above that din.


Isn't this more a problem of not good enough curation? There are more papers now than ever, but the signal is probably getting worse and worse.


How does one, practically speaking, curate article publication in a world with arXiv?

Curation kind of requires gatekeepers, and good gatekeeping is a full-time job that needs to be paid for. In the days of print journals, there was both a motivation, because you can only print so many articles per month due to materials costs, and a steady revenue stream to pay for it, because everyone has to pay for the journals to get any access to articles at all. The Internet's been eroding both of those factors for a good quarter century now.


hmm. sounds like a good use case for the chatgpt api


> Remember the days after Rowhammer and Heartbleed, when every new security vulnerability needed its own catchy name and domain name website? This is the science version of that. Branding is eating the world.

Sorry, are you referring to brandpocalypse?


> Branding is eating the world.

Not realizing that it always has is why it's still effective.

History is written by the victors and victors tend to have the best marketting.


Is branding the concept in play? I don’t see it being a very good metaphorical fit in comparison to my suggestion below. Branding requires a common theme over multiple things (such as products or events).

I think we’re seeing something closer to title optimization in the service of marketing. But even marketing isn’t a great fit.

Upon reflecting on the Wikipedia definition of marketing, it strikes me that title optimization is only a tiny part of what “academic research marketing” could be. In the most generous sense, it could mean reaching out to media sources fairly early in the research/writing process to help craft a paper that will be interesting to the desired audience.

Wikipedia quotes:

> Marketing is the process of exploring, creating, and delivering value to meet the needs of a target market in terms of goods and services; potentially including selection of a target audience; selection of certain attributes or themes to emphasize in advertising; operation of advertising campaigns; attendance at trade shows and public events; design of products and packaging attractive to buyers; defining the terms of sale, such as price, discounts, warranty, and return policy; product placement in media or with people believed to influence the buying habits of others; agreements with retailers, wholesale distributors, or resellers; and attempts to create awareness of, loyalty to, and positive feelings about a brand.

> A brand is a name, term, design, symbol or any other feature that distinguishes one seller's good or service from those of other sellers. Brands are used in business, marketing, and advertising for recognition and, importantly, to create and store value as brand equity for the object identified, to the benefit of the brand's customers, its owners and shareholders.

> Personal branding is the conscious and intentional effort to create and influence public perception of an individual by positioning them as an authority in their industry, elevating their credibility, and differentiating themselves from the competition, to ultimately advance their career, widen their circle of influence, and have a larger impact.


Your comment reminded me this talk [1] by David Duvenaud where he discusses his work on Neural ODEs, introduced in the paper [2] titled "Neural Ordinary Differential Equations" (this was one of the Best Papers in Neurips 2018). Apparently the original name of the paper was "Training Infinitesimally-Layered Neural Networks by Backpropagating through Black-box ODE Solvers". I found that to be much better description for the paper. It seems he thinks so too, but eventually the catch-phrase-y title won out. In the video [1], this bit is at 00:48-01:22.

PS: [1] is a good talk in general where he discusses some of the limitations of the paper and things that could have been done better.

[1] https://www.youtube.com/watch?v=YZ-_E7A3V2w

[2] https://papers.nips.cc/paper_files/paper/2018/file/69386f6bb...


> "Understanding Limitations and Chaos-Based Failures in Gradient-Based Optimization Methods." (gpt4)

Fixed with ML.


I’ve provided the PDF URL to ChatGPT 4.0 and asked it to summarize the article with alternative titles, but for some reason it keeps getting the original title, authors, abstract, and body wrong. What prompt did you use?


That's because chatgpt doesn't have the ability to retrieve PDFs from the internet. (Unless maybe if you have early access to the version with plugins?)

Bing chat does have the ability to read opened PDFs when used from the Edge side bar.


I asked a bunch of things. So initially I posted the abstract and title, then prompted just: "Explain". Then I asked "Give me background knowledge on spectrum of the Jacobian" and "Explain the title of the paper" and in the end "Reformulate the title in a more explicit manner". Maybe you can skip directly to the last prompt.


There's quite a long history of these titles, even before ML:

"Sometime’ is Sometimes ‘Not Never’" - Lamport

"Reflections on trusting trust" - Thompson

"On the cruelty of really teaching computer science" - Dijkstra

I'm sure there are more


The clever titles are more brandable. See the citations for "Attention is all you need" or "Chinchilla limit" versus more mundane titles.


I'm pretty sure Attention is all you need would have gotten a lot of citations even if it had a "boring" title. It was a groundbreaking paper with lots of good ideas.


The Chinchilla paper was called "Training Compute-Optimal Large Language Models", which is exactly on point.


I mean, it would be slightly more accurate if it had been “Compute-Optimal Training of Large Language Models”, since the models so-trained aren’t the thing that is compute-optimal, the training is compute-optimal.

But… yeah, its hardly a title chosen for marketing rather than description.


Interestingly, the models themselves also have compute cost associated with them, namely the cost of running them (inference). The smaller a model is, the less compute it needs for inference. Which is an interesting trade-off, because you can overtrain (rather than undertrain, as in the past) a model, i.e. use more tokens, fewer parameters, and ultimately more compute than optimal during training, to get lower inference cost.

https://www.harmdevries.com/post/model-size-vs-compute-overh...

This website has an interesting graph which visualizes this:

> For example, the compute overhead for 75% of the optimal model size is only 2.8%, whereas for half of the optimal model size, the overhead rises to 20%. As we move towards smaller models, we observe an asymptotic trend, and at 25% of the compute-optimal model size, the compute overhead increases rapidly to 188%.

So if you train your model you shouldn't just look at compute-optimal training but try to anticipate how much the model will probably be used for inference, to minimize total compute cost. Basically, the less inference you expect to do with your model, the closer the training should be to Chinchilla-optimality.


Stupid question, is training considered a part of the model? Or is this only common parlance for GPT (the P).


But we all know what you mean when you say “the Chinchilla paper”


That's better than other meme titles, but compute is still not a noun.


Is there a good replacement in noun form?


computing power


There is a lot of jargon in ML, an example is found in the object detection literature where you will often find sentences like such: "we present a novel simple non-hierarchical feature network backbone with window-shifted attention based learning accommodating neck+regression head fine-tuning or masked cascade RCNN second stage." I'm half joking. Surveys are often a godsend.


The major ML conferences all have pretty tight page limits, so more expository sentences usually get cut. This also means that papers usually only explain how their work is different from previous work, so they assume you are familiar with the papers they cite or are willing to read the cited papers.

This means that people who have an up-to-date knowledge of a given subfield can quickly get a lot out of a new papers. Unfortunately, it also means that it usually takes a pretty decent stack of papers to get up to speed on a new subfield since you have to read the important segments of the commonly cited papers in order to gain the common knowledge that papers are being diffed against.

Traditionally, this issue is solved by textbooks, since the base set of ideas in a given field or subfield is pretty stable. ML has been moving pretty fast in recent years, so there is still a sizable gap between the base knowledge required for productive paper reading and what you can get out of a textbook. For example, Goodfellow et al [1] is a great intro to the core ideas of deep learning, but it was published before transformers were invented, so it doesn’t mention them at all.

[1] https://www.deeplearningbook.org/


You forgot to toss in a few "state-of-the-art"


At some point you run into the problem where titles become useless because there are 100 papers on the same exact topic with very slight variation in the title. At this point people use surnames and date to cite papers.

But then the title can become something catchy that will give you more visibility.


Getting cited is probably the largest individual incentive. To various degrees, authors also want to get their ideas and reputation “out there”. To that end, authors want their work to be (i) skimmed (the abstract at least); (ii) shared; (iii) read. A catchy title almost always helps (right?); there doesn’t seem to be significant downside to catchy titles.

So how do we get more desirable system behavior? It seems we have a collective action problem.


I'd like to imagine quirky titles make them harder for other researchers to find, leading to lower citation counts, but that may be wishful thinking.


I don’t think it makes much difference. If anything, it might help? It’s easier to search for a paper if you remember its name. If not, you can search on author or words in the abstract.

The problem isn’t quirky titles, it’s with websites like Hacker News that only display a headline and not the abstract.


I think this trend is happening because of the entire buzz around OPENAI LLMs and that 2017 Google Research Paper


Ironically, attention is what you need to get high citation counts

Whether quirky titles help or hurt with that i think depends on name recognition and the publication venue


This also happens with security vulnerabilities as well (Heartbleed, POODLE, GHOST, Shellshock, FREAK, CRIME, SMACK)


Agree -- owl talk of Winnie-the-pooh childhood fame --- sucks.


I asked GPT > https://arxiv.org/abs/2111.05803 make a good useful title that's not silly for this paper and wht's about

I dunno if it even read the article lols

====

Title: “Deep Reinforcement Learning for Multi-Agent Navigation in Complex Environments” This paper proposes a deep reinforcement learning approach for multi-agent navigation in complex environments. The proposed method uses a centralized critic and decentralized actor architecture, where each agent has its own policy network and learns from its local observations, while the critic network takes in the global state information to provide a centralized value estimate. The method is evaluated on a variety of benchmark environments and achieves state-of-the-art performance compared to existing methods.


ChatGPT doesn’t read URL’s on the fly but it will pretend it does.


lol




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: