Hacker News new | past | comments | ask | show | jobs | submit login
CoLT5: Faster Long-Range Transformers With Conditional Computation (arxiv.org)
123 points by optimalsolver on March 20, 2023 | hide | past | favorite | 17 comments



There is another line of work for efficient Transformers that wasn't mentioned in the paper, i.e., Adaptive Computation on the sequence level, which pools similar tokens which are easy to predict, therefore reducing the complexity of a Transformer layer. Examples [#1](https://arxiv.org/pdf/2211.09761.pdf) [#2](https://arxiv.org/pdf/2103.06874.pdf) [#3](https://arxiv.org/pdf/2110.13711.pdf).


It looks good to me but is it right that the biggest model they tried was CoLT5-XL with only about 5B parameters and even this includes some that are zero from sparsity? My understanding is that it's not enough for some of amazing emergent things that the original formulation of transformers could do, but maybe the ideas of CoLT5 will still scale even for the large language models?


Sounds interesting.

Another recent paper on efficient architecture for long context lengths: https://arxiv.org/abs/2302.10866


This looks very cool and is a massive improvement. If only OpenAI would also publish what they are doing to make 32K work.


The most interesting aspect of this research is that Google published it. They have a huge competitor in this area, it would make sense to stop publishing unless they think that OpenAI already does this.


This may seem counter-intuitive, but model architecture is not the most significant differentiator in LLMs. Not publishing creates negative externalities around collaboration, recruiting etc. and any research organization worth their salt knows better not do avoid publishing. It's similar to the fallacy that Google should've never open-sourced Kubernetes as it allowed other cloud providers to out-compete them.


To me model architecture sounds the most important.

Stable diffusion was possible because diffusion model architecture and then doing it in latent space instead of image space and then upscaling.

Of course someone still needed to train it, but they did it when it was theoretically possible.


There was no competition in academic publishing until openai made it so! The research probably started way before gpt4 became a thing.


Every single decision Google makes is to avoid being seen by the government as a web advertising monopoly. Being seen as a competitor to adjacent tech companies is what they after, as they have no competitor in display advertising.


Researchers will work for less money if they are allowed to publish.


this.


Dynamic Transformer-layers allocation per step is intuitively and computationally appealing. Can't believe this idea has been under-explored for years.


This still isn't technically dynamic allocation since it always takes a top-k (constant k) tokens from the sequence, so more like dynamic routing, which was explored in Mixture-of-Expert models but only in Feed-Forward blocks and with a different routing scheme.


One can also make a model to learn the necessary context length for each layer and head to save a huge amount of FLOPs: https://arxiv.org/abs/1905.07799


> No code open sourced

> No model weights

For shame


That would be nice but at least they published their research


Google releases a lot of the T5 models, for which they get insufficient credit, so CoLT5 may well be released at some point. But the process can take a while when it has to be run by lawyers and management, so check back in half a year...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: