Seems to me like the whole history of neural nets is basically crafting models with well-behaved gradients to make gradient descent work well. That, and models that can achieve high utilization of available hardware. The surprising thing is that models exist where the gradients are so well-behaved that we can learn GPT-4 level stuff.
There's plenty of interesting neural network designs out there but they're being overshadowed by transformers due to their recent success. I personally thing that the main reason transformers work so well is because they actually step away from the multi layer perceptron stuff and introduce some structure and in a way sparsity.
Also, multi-head attention strikes me as being about as close to how language semantics seems to actually work in human brains as I've seen.
Lots of caveats there, of course. First off, I don't know much about the neurology, I just have an amateur interest in second language acquisition research that sometimes brings me into contact with this sort of thing. On the ANN side, which is closer to my actual wheelhouse, we definitely don't actually have any way of knowing if the actual mechanism is all that close, and I'm guessing it probably isn't even close since ANN's don't actually work that similarly to brains. Nor does it need to be, but, intuitively, there's still something promising about an ANN architecture that's vaguely capable of mimicking the behavior of modules in an existing system (human brains) that's well known to be capable of doing the job. I'm not super wild about the bidirectional recurrent layers, either, because they impose some restrictions that clearly aren't great, such as the hard limit on input size. et cetera. But it still strikes me as another big step in a good direction.
I'm currently working on a variation of a spiking neural network that learns by making and purging connections between neurons, which so far has been pretty interesting, though I am having a hard time getting it to output anything more than just the patterns it recognised. I did play around with adding its outputs to the input list, making it sort of recurrent but its practically impossible to decode anything thats going on inside of the network. Im thinking of tracking the inputs around to see what its doing right now, might be interesting to see it generate some sort of tree-like structure.
Are you familiar with the edge popup algorithm introduced in "What's Hidden in a Randomly Weighted Neural Network?" https://arxiv.org/abs/1911.13299v2
Seems relevant to what you're working on. It starts with a randomly initialized, overparameterized neural net, but instead of gradient descent backpropagation, it learns by deleting connection edges.
That's probably true for most kinds of NN architectures, including convolutional layers and older recurring architectures (LSTM, etc). Fully connected networks do not seem to be a necessary and certainly not efficient way to represent the mechanisms that operate in the "real world", so clever way to make the networks sparse is an important key.
But it's equally important to create architectures that allow efficient backpropagation of errors.
It does seem like transformers are pretty good at both, already.
I kind of hope we're not getting much something radically better anytime soon, because it seems like AGI is already approaching faster than we can prepare for.
Then again, I would expect that someone somewhere is already using transformer based networks to develop some brand new architecture that does in fact provide such a leap.
It's less about enumerating the architectures that have been tried before, and more about recognizing the modularity of NN components and the different perspectives on what those modules might represent.
I think you mean loss function, not process. And I'm failing to make any connection between intelligence and polynomial approximations to loss functions.
Global optimization techniques which don't rely on gradients seems theoretically superior in all instances, except that we haven't found super fast ways to run these kinds of optimizers.
The cartpoll demo famously tripped up derivative based reinforcement learning for awhile.
> Global optimization techniques which don't rely on gradients seems theoretically superior in all instances, except that we haven't found super fast ways to run these kinds of optimizers.
Did you mean "Global optimization techniques which do rely on gradients..."? Because exact gradient-based global optimization (GBD or branch-and-bound based) methods for general nonconvex problems are theoretically superior (bounding with McCormick relaxations etc.) but also more challenging to practically deploy than say stochastic methods or metaheuristics like local search.
Could you point me toward any more info or keywords on "the cartpoll demo famously tripped up derivative based reinforcement learning for awhile"? It sounds like an interesting bit of history but my google searches are only bringing up python tutorials.
My guess: the cart pole is an inverted pendulum, and requires multiple left-right-swing-up movements to bring it from the “hanging” position to the “standing” position.
Finding this action sequence using gradients of “where is the tip” vs “where should it be” is very hard, as swinging the pendulum to the left and right goes against the gradient.
Instead, using stochastic gradient approximations (policy gradient method such as proximal policy optimization) might be better suited to solving these kinds of problems. Effectively, they do not compute the exact gradient locally, but rather kind of a global approximation by trying out random sequences of actions and determining which of them are closest to the desired outcome.
Hence, stochastic gradient approximations might be considered some kind of hybrid between greedy local optimization (such as following the exact gradient) and global optimization.
The reason for this is that the algorithm doesn't like to have to "spend" energy, reducing its score. Without huge amounts of trickery to get the gradient descent algorithm to stop getting stuck in the center, this is never solved - due to using a local optimizer for a global optimization problem (finding good weights in a NN)
What does this refer to: cartpoll demo famously tripped up derivative based reinforcement learning
The phrase "cartpole demo famously tripped up derivative-based reinforcement learning" is likely referring to a classic problem in the field of reinforcement learning, which involves balancing a pole on a cart. The pole is attached to the cart via a hinge, and the goal is to keep the pole upright by moving the cart left or right in response to its angle. This problem is often used as a benchmark for testing reinforcement learning algorithms.
The phrase suggests that derivative-based reinforcement learning algorithms, which rely on computing gradients of a function with respect to its parameters, were not successful at solving this problem. This could be due to the fact that the problem is highly non-linear and requires precise control, which may be difficult to achieve with gradient-based methods.
Edit: bard got it too, with more detail, which is surprising
You should be really careful with asking this kind of question to ChatGPT, because now you think you've learned the answer, but in fact there are two options with very different outcome:
- ChatGPT was trained on a corpus of data containing the answer and is able to give you a decent answer
- ChatGPT was never exposed to the answer and will hallucinate a plausible-sounding response, and because it will answer in a really convincing way, you'll get tricked into believing complete bullshit
Interesting paper, thanks for bringing this up! I have been working on methods for trajectory optimization using both, analytic gradient computations and black box stochastic gradient approximations (proximal policy optimization).
I was always wondering about a question that is touched in the paper: despite the analytic gradient computation being intuitively more efficient and mathematically correct, it is much harder to learn a policy with it than with the “brute force trial-and-error” black box methods.
My one wish is that machine learning papers would use paper titles that actually described what the paper was about. I suppose there is a certain 'evolutionary pressure' where clever titles 'outcompete' dryer, more descriptive titles (or it seems that way). But i don't like it.
Remember the days after Rowhammer and Heartbleed, when every new security vulnerability needed its own catchy name and domain name website? This is the science version of that.
40 years ago, when articles went into print publications, you'd just get your paper into a key print journal and then trust that everyone who gets it would at least look through the article headlines and read the abstracts of articles that seemed relevant to them. And it was manageable because you'd only have a few new issues rolling in per month.
But arXiv had an average of 167 CS papers being submitted per day in 2021. An academic who wants to keep their career alive needs to resort to every trick in the book to be heard above that din.
How does one, practically speaking, curate article publication in a world with arXiv?
Curation kind of requires gatekeepers, and good gatekeeping is a full-time job that needs to be paid for. In the days of print journals, there was both a motivation, because you can only print so many articles per month due to materials costs, and a steady revenue stream to pay for it, because everyone has to pay for the journals to get any access to articles at all. The Internet's been eroding both of those factors for a good quarter century now.
> Remember the days after Rowhammer and Heartbleed, when every new security vulnerability needed its own catchy name and domain name website? This is the science version of that. Branding is eating the world.
Is branding the concept in play? I don’t see it being a very good metaphorical fit in comparison to my suggestion below. Branding requires a common theme over multiple things (such as products or events).
I think we’re seeing something closer to title optimization in the service of marketing. But even marketing isn’t a great fit.
Upon reflecting on the Wikipedia definition of marketing, it strikes me that title optimization is only a tiny part of what “academic research marketing” could be. In the most generous sense, it could mean reaching out to media sources fairly early in the research/writing process to help craft a paper that will be interesting to the desired audience.
Wikipedia quotes:
> Marketing is the process of exploring, creating, and delivering value to meet the needs of a target market in terms of goods and services; potentially including selection of a target audience; selection of certain attributes or themes to emphasize in advertising; operation of advertising campaigns; attendance at trade shows and public events; design of products and packaging attractive to buyers; defining the terms of sale, such as price, discounts, warranty, and return policy; product placement in media or with people believed to influence the buying habits of others; agreements with retailers, wholesale distributors, or resellers; and attempts to create awareness of, loyalty to, and positive feelings about a brand.
> A brand is a name, term, design, symbol or any other feature that distinguishes one seller's good or service from those of other sellers. Brands are used in business, marketing, and advertising for recognition and, importantly, to create and store value as brand equity for the object identified, to the benefit of the brand's customers, its owners and shareholders.
> Personal branding is the conscious and intentional effort to create and influence public perception of an individual by positioning them as an authority in their industry, elevating their credibility, and differentiating themselves from the competition, to ultimately advance their career, widen their circle of influence, and have a larger impact.
Your comment reminded me this talk [1] by David Duvenaud where he discusses his work on Neural ODEs, introduced in the paper [2] titled "Neural Ordinary Differential Equations" (this was one of the Best Papers in Neurips 2018). Apparently the original name of the paper was "Training Infinitesimally-Layered Neural Networks by Backpropagating through Black-box ODE Solvers". I found that to be much better description for the paper. It seems he thinks so too, but eventually the catch-phrase-y title won out. In the video [1], this bit is at 00:48-01:22.
PS: [1] is a good talk in general where he discusses some of the limitations of the paper and things that could have been done better.
I’ve provided the PDF URL to ChatGPT 4.0 and asked it to summarize the article with alternative titles, but for some reason it keeps getting the original title, authors, abstract, and body wrong. What prompt did you use?
That's because chatgpt doesn't have the ability to retrieve PDFs from the internet. (Unless maybe if you have early access to the version with plugins?)
Bing chat does have the ability to read opened PDFs when used from the Edge side bar.
I asked a bunch of things. So initially I posted the abstract and title, then prompted just: "Explain". Then I asked "Give me background knowledge on spectrum of the Jacobian" and "Explain the title of the paper" and in the end "Reformulate the title in a more explicit manner". Maybe you can skip directly to the last prompt.
I'm pretty sure Attention is all you need would have gotten a lot of citations even if it had a "boring" title. It was a groundbreaking paper with lots of good ideas.
I mean, it would be slightly more accurate if it had been “Compute-Optimal Training of Large Language Models”, since the models so-trained aren’t the thing that is compute-optimal, the training is compute-optimal.
But… yeah, its hardly a title chosen for marketing rather than description.
Interestingly, the models themselves also have compute cost associated with them, namely the cost of running them (inference). The smaller a model is, the less compute it needs for inference. Which is an interesting trade-off, because you can overtrain (rather than undertrain, as in the past) a model, i.e. use more tokens, fewer parameters, and ultimately more compute than optimal during training, to get lower inference cost.
This website has an interesting graph which visualizes this:
> For example, the compute overhead for 75% of the optimal model size is only 2.8%, whereas for half of the optimal model size, the overhead rises to 20%. As we move towards smaller models, we observe an asymptotic trend, and at 25% of the compute-optimal model size, the compute overhead increases rapidly to 188%.
So if you train your model you shouldn't just look at compute-optimal training but try to anticipate how much the model will probably be used for inference, to minimize total compute cost. Basically, the less inference you expect to do with your model, the closer the training should be to Chinchilla-optimality.
There is a lot of jargon in ML, an example is found in the object detection literature where you will often find sentences like such: "we present a novel simple non-hierarchical feature network backbone with window-shifted attention based learning accommodating neck+regression head fine-tuning or masked cascade RCNN second stage." I'm half joking. Surveys are often a godsend.
The major ML conferences all have pretty tight page limits, so more expository sentences usually get cut. This also means that papers usually only explain how their work is different from previous work, so they assume you are familiar with the papers they cite or are willing to read the cited papers.
This means that people who have an up-to-date knowledge of a given subfield can quickly get a lot out of a new papers. Unfortunately, it also means that it usually takes a pretty decent stack of papers to get up to speed on a new subfield since you have to read the important segments of the commonly cited papers in order to gain the common knowledge that papers are being diffed against.
Traditionally, this issue is solved by textbooks, since the base set of ideas in a given field or subfield is pretty stable. ML has been moving pretty fast in recent years, so there is still a sizable gap between the base knowledge required for productive paper reading and what you can get out of a textbook. For example, Goodfellow et al [1] is a great intro to the core ideas of deep learning, but it was published before transformers were invented, so it doesn’t mention them at all.
At some point you run into the problem where titles become useless because there are 100 papers on the same exact topic with very slight variation in the title. At this point people use surnames and date to cite papers.
But then the title can become something catchy that will give you more visibility.
Getting cited is probably the largest individual incentive. To various degrees, authors also want to get their ideas and reputation “out there”. To that end, authors want their work to be (i) skimmed (the abstract at least); (ii) shared; (iii) read. A catchy title almost always helps (right?); there doesn’t seem to be significant downside to catchy titles.
So how do we get more desirable system behavior? It seems we have a collective action problem.
I don’t think it makes much difference. If anything, it might help? It’s easier to search for a paper if you remember its name. If not, you can search on author or words in the abstract.
The problem isn’t quirky titles, it’s with websites like Hacker News that only display a headline and not the abstract.
Title: “Deep Reinforcement Learning for Multi-Agent Navigation in Complex Environments”
This paper proposes a deep reinforcement learning approach for multi-agent navigation in complex environments. The proposed method uses a centralized critic and decentralized actor architecture, where each agent has its own policy network and learns from its local observations, while the critic network takes in the global state information to provide a centralized value estimate. The method is evaluated on a variety of benchmark environments and achieves state-of-the-art performance compared to existing methods.