Simplifying Transformer Blocks

low_tech_love · on Nov 28, 2023

“This complexity leads to brittle architectures, where seemingly minor changes can significantly reduce training speed, or render models untrainable.”

I am by no means an expert and I can’t verify the authors’ claims about reduced speed and untrainability, but this reflects an impression I’ve been having on the papers I read and review. The field of ML research is moving so fast that people don’t even take time anymore to explain the design decisions behind their architectures. It’s basically “we got nice results, and here is the architecture of the model” (proceeds to show a figure with a hundred coloured blocks connected together in some seemingly random complex way).

It used to be that such a thing would get backlash from reviewers, and they would require you to actually justify the design. I don’t see that anymore. The problem with this for me is that we fail to build a nice, crisp understanding of the effects of each design decision in the final outcomes, which hurts the actual “science” of it. It also opens up the field for bogus and unreproducible claims.

But at least other people are picking up on the thread and doing that in follow-up papers, which is good.

loveparade · on Nov 28, 2023

It has always been like this in DL research. Also, providing justification doesn't necessarily help, because they are usually just guesses. The problem goes deeper than this. It's easy, and common, to make up justifications after the fact. You can try a bunch of random stuff, some thing randomly works, and then you make up an explanation for why it works that sounds plausible, but in reality is nothing more than a fictional narrative to convince reviewers (and possibly yourself). The underlying problem here are the incentives of the academic system. Reviewers reward good narratives and stories. Also, in DL especially, it's quite difficult to perform ablation studies or get statistical significance numbers due experiments taking a long time to run.

I hate to say it, because that's not how science should work, but with the flood of papers, looking at who wrote a paper is probably one of the more reliable indicators of quality. The reason is that people with a reputation are putting their reputation at stake when publishing something. Publishing BS would reflect badly on them, so they tend to go through more quality control. An unknown researcher at an unknown lab has nothing to lose by flooding arxiv with BS papers, which on the surface are indistinguishable from high quality work.

gumby · on Nov 28, 2023

> Also, in DL especially, it's quite difficult to perform ablation studies or get statistical significance numbers due experiments taking a long time to run.

Aww, the poor babies. This the the nature of working with complex systems (life sciences, physics, social sciences, etc): teasing out signal from very noisy data that is quite expensive (in time at least) to obtain.

On the engineering side things can (but don't always) move faster with a different kind of rigour. And that applies to observational reports in the sciences as well.

All of which is to say that there's nothing wrong with the flood of papers that have a bunch of colored boxes and some running code. But consider those to be the equivalent of most of the reports in the 18th century Royal Society: "I have just returned from being a passenger on the HMS Foobar; while on the Malabar coast I saw an odd insect that looked like this drawing"

And conversely spend referees' precious time on actual science.

godelski · on Nov 28, 2023

I mean I'm with you on the side of "ML needs more math" because we're about as sciencey as psychology, but I'm not quite on the "aww poor babies" side (yet). It's just the nature of how sciences evolve. The truth is that with deep rigorous theory, things like physics are easier. I'd say the same for psych too. It's a crazy amount of stats you need to do to account for all the unknowns and uncertainties but I'm not sure why these sciences push the theory people away when they can't even be bothered to add an epsilon to their models (which are always linear), but I digress. The thing that concerns me more though is the push against math. We see it a lot in ML but weirdly enough reviews will reject you if you don't have equations so you get tons of papers just copy pasting math that adds nothing to the paper. It's really weird and a clear indication of the laziness of reviews and area chairs.

What I tell my students is "you don't need math to make good models but you need math to know why your models are wrong." This seems to be becoming more important than ever. Hell, how many people know about the Whitney embedding theorem? That itself helps answer half my undergrad questions and I think it's relatively unknown.

But you need both. The thinkers and tinkerers. Referring to my longer rant, industry has captured the research field. I'm really happy they're involved but we're responding in insane ways. Let academia be the thinkers and let them make small models and don't expect a $100M budget for every paper. It's the railroading that annoys me and I think it's anti science. But no one wants to even acknowledge it's existence. I'll go ahead and say there's not much use to GPT papers unless you work for OpenAI (not to be confused with LLM papers). Let people explore new architectures and don't make them compete with sota models that have billions of dollars and man hours put into them. Hype is killing our field and it's hard to see because we're a high functioning addict.

jprafael · on Nov 28, 2023

> I hate to say it, because that's not how science should work

I have the opposite view. Science should be incremental and authors should be incentivized to share their (interesting) findings early and often. This makes the community as whole move faster because you get more visibility, funding, man-hours dedicated to things that are on the leading edge of research. Consider a scenario where a researcher is required to explain exactly why some phenomenon happens. Maybe it took 1 year to find the original phenomenon and then it takes 10 years to explain it to a reasonable level. Everyone only gets the benefit of this research 11 years after. Now consider the opposite scenario. After 1 year the author publishes and gets the attention of fellow colleagues. Some of them will collaborate together adding more man-hours / year, reducing the total time. Some of them might have already discovered something similar and thus avoid all repeated work. Some of them might be better positioned to solve the explanation piece based on their field of expertise, personal interests, availability. All of this makes the innovation happen faster.

What you often see (or should see in high quality papers) is an hypothesis of why something happens. This in itself is valuable. Many hypotheses are unproven until today. If you assume that these hypotheses are true you'll often find better results or find them faster; and if you don't you have discovered something interesting to report on.

psyklic · on Nov 28, 2023

I'm of the opposite view - publish only if you can prove it. If you reward half results, the field will be deluged with half truths. If someone publishes a hypothesis you're working on with only a half proof (and stands to gain the future credit), then there is little incentive to continue doing the work to prove it.

This was actually a major issue one of my PhD advisors had, since it led to poor foundations for the field with little incentive to ensure their validity.

rdedev · on Nov 28, 2023

We need something in between. The paper author may not know why something is happening but has showed that it is statistically significant. Maybe he just does not have the context or background but someone else along the way. Of course the assumption is that people would do replication studies but no one is incentivized to do them so better to be on the safer side

loveparade · on Nov 28, 2023

I don't think we disagree. All I am saying is that if your goal is to filter papers by their likelihood of having a useful result, the best signal is the author and their reputation, not necessarily the content (unless there is perhaps some obviously amazing demo). We don't disagree that for the community as a whole the system of publishing early and often is much better than the alternative of imposing restrictions.

tpoacher · on Nov 28, 2023

This guys agree with you: https://www.octopus.ac/

Alas, whether their idea will withstand the current incentives structure in academia, remains to be seen.

low_tech_love · on Nov 28, 2023

That works well only if people have no motivation to fabricate or embellish results, and/or if they fully share the code and experimental setup. But in general I agree.

lossolo · on Nov 28, 2023

> The field of ML research is moving so fast that people don’t even take time anymore to explain the design decisions behind their architectures

Often, they don't fully understand why certain methods work, they simply experiment and observe the outcomes. This sentiment is echoed by one of the authors of the Transformer paper in another publication[1], where he states:

"We offer no explanation as to why these architectures seem to work; we attribute their success, as with all else, to divine benevolence."

[1] https://arxiv.org/pdf/2002.05202.pdf

low_tech_love · on Nov 28, 2023

Hah! Thanks for the quote and the link, this is definitely going to be in my next lecture.

lossolo · on Nov 28, 2023

I love that quote too! :) Enjoy!

naasking · on Nov 28, 2023

> It used to be that such a thing would get backlash from reviewers, and they would require you to actually justify the design.

We still lack a theoretical framework to understand and predict such things. ML research is empirical, observational research. It's like publishing a physics paper, "we saw a current in a wire when we moved it past a magnet" before we really understood electromagnetism. Theoretical understanding will hopefully come later.

zoogeny · on Nov 28, 2023

People opine that "we don't know what NNs are doing" and I think this is overblown while having some truth. But undeniably there are a host of unanswered questions arising from this latest surge in AI capabilities.

It makes a lot of sense to me that we are just throwing a lot at the wall right now to see what sticks. Once we have a nice collection of sticky results I have optimism that the theories explaining that stickiness will appear.

sytelus · on Nov 28, 2023

There are rarely any design "decisions". Typically, you throw many things at wall and something sticks which becomes paper. Transformer paper has approximately zero "design decisions" apart from attention block. I can imagine they just tried out various combinations, kept adding projections and went with what worked the best.

gradys · on Nov 28, 2023

Attention itself was the key idea of that paper and, as you sort of acknowledge, was definitely not just throwing things at the wall. It was the culmination of a long line of work gradually progressing toward fully dynamic routing via attention, and it was motivated, if not by deep theory, at least deep intuition from linguistics. The other details of transformers are perhaps sort of arbitrary, but made sense to everyone at the time. There was no claim that those other details were optimal - just that they were one way of surrounding the attention mechanism with computing machinery that worked.

galangalalgol · on Nov 28, 2023

I hear people say that a lot, but is that really how people at the leading edge of research do this? Those I know who are coming up with new stuff and not just new applications for old architectures, are either building loosely on animal models, or designing based off a traditional algorithm with some room for the training to take advantage of complex interactions the traditional algorithms don't.

dartos · on Nov 28, 2023

Most advances in NLP with transformers over the last 2 years has been random trial and error and just throwing more compute at transformers.

Some models like RWKV (which isn’t a transformer model) explain the design decisions in their paper, but generally that’s not the case.

Nobody knows why, nobody seems to be looking into it.

We’re all just trying to figure out how to improve the outcome atm.

cma · on Nov 28, 2023

There have been huge advances in the mathematics of neural networks from Greg Yang (formerly of Microsoft). This allowed predictable training-hyperparameter transfer from smaller versions of GPT-4 where they could be tuned, to the final large model.

https://www.microsoft.com/en-us/research/uploads/prod/2021/1...

He has proofs and theorems about frontiers of maximal feature learning before things devolve into equivalent to kernel methods, and more: a whole bunch of breakthrough math making deep links with random matrix theory.

skummetmaelk · on Nov 28, 2023

Optimization by graduate student descent.

yobbo · on Nov 28, 2023

In principle it's not different from discovering that some certain probability distribution fits some certain type of data.

It's not random or bogus but it's also not exactly natural science in the sense that it answers "why" or "how".

godelski · on Nov 28, 2023

> The field of ML research is moving so fast that people don’t even take time anymore to explain the design decisions behind their architectures.

Has it ever? I've read a lot of "old" papers and idk if there's ever been a strong theoretical framework. I'd say it is strongest today but not in main papers and those papers are getting rejected because of lack of experiments (wtf is going on with reviewing and why are our solutions "authors have to show reviewers used LLMs"[-1] not "reviewers have to provide meaningful review"? What a joke!). Here's a more recent Universal Approximation paper[0]. Cyberko paper[1] is good, but I think we're moving in the right direction, at least with respect to these two works. Other UA papers I've seen are pretty handwavy.

But simple also doesn't mean "not useful" or "not worthy of publish" and I think this is something people are forgetting lately. Probably because we're asked to review too much and annoyed, but don't take it out on authors. Some examples are GELU[2] (read the end, there's drama in this paper), Clean-FID (CVPR 22)[3], ViT (ICLR 21) and DeiT (ICML 21)[4,5], and even ResNet (CVPR 16)[6] and tons of ReLU papers we can list. But let's also not confuse big compute with "good work" (they correlate but aren't the same. You can definitely do more with more compute).

> he problem with this for me is that we fail to build a nice, crisp understanding of the effects of each design decision in the final outcomes, which hurts the actual “science” of it. It also opens up the field for bogus and unreproducible claims.

This is mostly an unsolvable problem in science and I think we actually have bigger problems. This can be solved without conferences due to our community being fairly good about open sourcing code and checkpoints (at least compared to any other research area), as well as just the papers themselves. Reproducibility is only able to be done when people can actually replicate works (lol sorry GPT papers...). What helps this is moving away from publish or perish or allowing people to publish works that are replication attempts (replication is the foundation of science. Why does everything have to be "novel", whatever that means).

The bigger problem I see is actually related to the above. Publishing is just insane. It is easy, especially in fast moving times, to claim any work is incremental or not novel. That it's "just x but y". My feelings are "so what?" ViT is "just a transformer on images but image patches are tokens," yet it's insanely useful. It's obvious post hoc but not a priori. If it was, it would already be in use (ViT actually being a good example given the timeline). Everything is simpler and much more obvious after you've already seen it, which is why the whole notion of novelty is ridiculous. We humans just rewrite history in our minds and we also only concentrate on what's popular (lots of incentives, especially with reviewing insanity). Poor Ross Wightman gets shadowed for replications and improving ResNets though (and poor ConvNext). But we're seeing big labs do this and get through because they can do lots of experiments and build a lot of empirical evidence towards work, despite no additional theory.

The problem? What about the academics? I don't know how people publish without connections to big labs (aka big compute) anymore. The ideas from the groups are the same (industry is a bit more hyped), but I've seen papers where the theory is good, the architecture makes sense (and is well explained) and my co-reviewers for a fucking workshop want to reject it because "not enough experiments" and "only one 'real world' experiment." These are papers that are on par or better than many I see in the main conferences (you can tell they're shifted to workshops due to rejects). Talk about holding back science. Holding back science is making authors submit works over and over and over to get orthogonal reviews every time. Holding back science is me arguing with my co-reviewers who admitted to not understanding the work that "more experiments" is not a sufficient reason to reject from a workshop (or a main conference).

Hell, I got rejected once for a distillation paper and the reviewers complained that my teacher model's performance didn't match that of the original paper (no one has replicated that work btw, and I beat the second best published result I could find). Another paper I got hard rejected because I redacted a github link and an appendix citation broke when splitting the work. Guess what the ACs did? They let that person write a stronger response and then because they were the only high confidence reviewer on that paper (others were 3/5 confidence and borderline/WA) and so rejected the work. What kind of insane system is that? I see instances like this all over the place from many different peers across many different universities. There's more that's broken before we can even begin to talk about other issues because our system of "what qualifies as meaningful work" shouldn't be a fucking slot machine run by undergrads.

CVPR has well over 10k submissions. Good luck to the <500 ACs and all the reviewers. God have mercy on the grad students who are just trying to graduate. I'm sure you're all doing good work even if the reviewer didn't read a word. I hope your pictures and tables are enough.

[-1] https://twitter.com/CVPR/status/1722384482261508498

[0] https://www.jmlr.org/papers/v23/20-1433.html

[1] https://web.njit.edu/~usman/courses/cs675_fall18/10.1.1.441....

[2] https://arxiv.org/abs/1606.08415

[3] https://www.cs.cmu.edu/~clean-fid/

[4] https://arxiv.org/abs/2010.11929

[5] https://arxiv.org/abs/2012.12877

[6] https://arxiv.org/abs/1512.03385

Bayes7 · on Nov 28, 2023

"[...] modern neural network (NN) architectures have complex designs with many components [...]"

I find the Transformer architecture actually very simple compared to previous models like LSTMs or other recurrent models. You could argue that their vision counterparts like ViT are conceptually maybe even simpler than ConvNets?

Also, can someone explain why they are so keen to remove the skip connections? At least when it comes to coding, nothing is simpler than adding a skip connection and computationally the effect should be marginal?

SuchAnonMuchWow · on Nov 28, 2023

Skip connection increase the live range of one intermediate result across the whole part of the network skiped: the tensor at the beginning of a skip connection must be stored in memory for longer while unrelated computation happen: it increase the pressure on the memory hierarchy (either the L2, or scratchpad memory).

This is especially true for example for inference for vision transformers, where it decrease the batch size you can use before hitting the L2 capacity wall.

Bayes7 · on Nov 28, 2023

Okay, I see that for inference. But for training it shouldn't matter because I need to hold on to all my activations for my backwards pass anyways? But yeah, fair point!

jksk61 · on Nov 28, 2023

also removing skip connections leads to a rougher loss landscape, hence it should be harder to find the optimal weights.

sdenton4 · on Nov 28, 2023

Yes there's very good theoretical reasons for skip connections. If your initial matrix M is noise centered at 0, then 1+M is a noisy identity operation, while 0+M is a noisy deletion... It's better to do nothing if you don't know what to do, and avoid destroying information.

I appreciate the sibling comment perspective that memory pressure is a problem, but that can be mediated by using fewer/longer skip connections across blocks of layers.

imjonse · on Nov 28, 2023

"While we have demonstrated the efficacy of our simplifications across architectures, datasets, and tasks, the models we have considered (100-300M parameters) are small relative to the largest transformers."

University researchers without a big lab's backing cannot try out such experiments on really large models.

quickthrower2 · on Nov 28, 2023

300M params will train in a Google Colab most likely, or for a few bucks on vast.ai, or a few hours on a fairly decent card. So that is super small!

buildbot · on Nov 28, 2023

Errr 20M is a few hours to train 10K iters on 8x V100s for what I’ve been doing. 300M is in the range of 4-8 nodes of that scale.

quickthrower2 · on Nov 28, 2023

I'll check my wandb account when I can to see what I got (it was months ago). At the moment I can't log in "An application error occurred. Click to refresh the page." :-o

I was using a 3090 though - I remember that because I was using vast.ai and a cheapskate! That card has serious bang for buck.

quickthrower2 · on Nov 29, 2023

Just a thought: Are people confusing million and billion?

buildbot · on Nov 29, 2023

Or the library I’ve been using for this work is … not very memory efficient possibly? I’ve been referring to millions at least intentionally

baq · on Nov 28, 2023

By 'fairly decent' do you mean a civilian 4090 or one of the weapons-grade DC cards?

buildbot · on Nov 28, 2023

In my experience you’d need several weapons grade DC nodes (unless you are fine-tuning with LoRa, totally different game then!)

quickthrower2 · on Nov 28, 2023

3080, 4090, etc.

chessgecko · on Nov 28, 2023

Not sure if I read it correctly, but it seems like the skip connections are kinda still present in the skipless block because they added I after the softmax and v= the previous hidden state. Still a cool paper if they really managed to get rid of those projections without degrading the quality.

karmasimida · on Nov 28, 2023

There could be a CISC vs RISC argument for Transformer, which I can see, but my bet is the traditional transformer has established enough fundamentals, it is going to take a long time for alternative architectures to prove they are indeed alternatives on a wide range of tasks.

Pretty much at this point, to succeed transformer, one need to showcase, the alternative model could achieve chatgpt level performance with either significant reduction of compute or data requirements.

WithinReason · on Nov 28, 2023

This is a nice start, but what would really help is for someone who understands the GPU programming model in-depth to give this a shot, with the goal of reducing DRAM bandwidth and fitting the layers exactly onto a GPU's memory hierarchy (cache levels, local memory and registers). Basically, sizes of a target HW platform's memories should be hyper-parameters of the architecture.

machinekob · on Nov 28, 2023

Indeed it could be a target, but newer and newer co-processor for pure MatMul or even NPU are created and cache/register are changing fast from one architecture to another so it still wouldnt be optimal for most users.

Buttons840 · on Nov 28, 2023

What is "Shaped Attention"? They simplified everything except for changing "Attention" to "Shaped Attention".

Majromax · on Nov 28, 2023

That's discussed at the end of page 4, and it comes from the Noci et al (2023) reference.

samus · on Nov 28, 2023

I love reading about papers like these. They raise hopes that novel model architectures might reduce the need for computational resources to train powerful models, which would help lower the barrier of entry.

heyitsguay · on Nov 28, 2023

If you read "Attention Is All You Need", you'll see that was actually the motive behind the original Transformer too! Bit of a Jevon's paradox situation.

zozbot234 · on Nov 28, 2023

Transformer was really about making it feasible to parallelize training of large context windows. More like using large compute resources effectively when available than saving on compute. This is also why it looks quite ad-hoc compared to simpler models like RNN or LSTM, which can be trained in large batches of data (hence still making use of parallelism to some extent) but have to be serial along the context dimension.

bilsbie · on Nov 28, 2023

I wonder why only 15% faster training? Seems like simplying the main element would make a huge difference.

ofou · on Nov 28, 2023

Are anyone aware of similar simplifications?

m3kw9 · on Nov 28, 2023

Do they have a demo of this theory?