Hacker News new | past | comments | ask | show | jobs | submit login
Rethinking Attention with Performers (googleblog.com)
212 points by headalgorithm on Oct 26, 2020 | hide | past | favorite | 52 comments



This is about the machine learning concepts of Attention and Performers.

I was hoping for some help holding my attention and performing better, was disappointed :/


A free tip for you then, something I figured out recently. One of the easy ways to get kids to focus (for a little while) is to create stress and fear. E.g., parental pressure, yelling, draconian deadlines. So it's easy for people to develop habits where they need stress and fear to really focus.

Once I realized that a lot of my procrastination and apparent struggles to pay attention were bad childhood habits (driven by bad parental habits), things got better in two ways. One, I started structuring work in ways that created modest, even pressure. In my case, that's a kanban system with small units of work and frequent delivery, so I always have next deadline that's pretty close. Two, I have looked for other sources of energy and have tried to build habits around that. For me it's that I enjoy being productive and feeling like I'm making a difference. So I've built habits around that.

And in case it helps, the habit-building secret for me is that if an experience is at least mildly pleasant or rewarding each time I do it, it'll turn into a habit. But if it's a negative experience each time, I'll build up an aversion that makes a mule look reasonable.

I hope that helps!


Inspiration or Desperation - the two human motivators.

Inspiration can easily lead to pursuing tangents, but conversely desperation can lead to tunnel vision and fear paralysis.

There's a sweet spot in the middle. Convince yourself you're passionate about something by visualizing the end goal, but also convince yourself that there is a real need to make progress.


> is that if an experience is at least mildly pleasant or rewarding each time I do it, it'll turn into a habit

Would you mind to elaborate more and give some recent examples?


Sure! I did not grow up with a taste for exercise. Between my bookish nature and some terrible gym teachers, I loathed it. But once I moved to San Francisco, everybody was so healthy here! So I decided to try it out.

For quite a while, I'd start on something like running. I'd do it doggedly but in retrospect too intensely. Each time, it was kinda awful. Eventually my motivation would run out and I'd quietly drop it.

But now my approach is that my first goal should be enjoying it. For example, after an injury I stopped all exercise for 6 months or so, and was totally averse to returning to it. Eventually I grabbed my laptop and just went on a long walk, stopping at coffeeshops, restaurants, bookstores, etc every time I felt my enjoyment flagging. After doing that a few Sundays in a row, I shifted away from the stops, making it a pure long walk. Again, focused on enjoying it. Then I added bits of running here and there, upping the challenge until I was back doing a 7.5 mile long run every Sunday.

I apply the same lesson to learning new technologies. Every time I dive into some new language or toolkit, I find it easy to get intimidated. So I structure it as a series of small, rewarding increments. I pay attention to when I'm tired or too frustrated, taking breaks and rethinking my approach so it ends up as a sequence of small rewarding moments.

I think the same lesson lies behind Kanban approaches for teams as well. Small units of value moving quickly though a pipeline become very satisfying. It turns the work week into a series of modest accomplishments that over the long haul add up to real results, building good habits along the way.

Is that helpful?


I'm not the one who asked, but I wanted to thank you for this write up. I find it very helpful and I want to apply a similar approach to exercise. These types of comments are a big part of why I come to HN.


Positive reinforcement is really the only way to motivate. Honestly I think the best managers would be ABA therapists and psychologists who are well trained in this area.


Talk to a doctor. I had problems my whole life with attention and memory. It turned out to be a neurological problem and now it's mostly under control thanks to medication and therapy. If you're randomly commenting about it on HN it's probably affecting more of your life than you think it is; that's certainly the way it was for me, and I am far happier having talked to a doctor about what I thought was just a thing everyone had to deal with.


Are you referring to ADHD or some other neurological problem?


Attention deficit disorder. Rather than the predominantly hyperactive-impulsive subtype that everyone imagines, though, I have the predominantly inattentive subtype. Replace all of the hyperactivity symptoms with a double helping of forgetfulness and attention fatigue. Then add a heaping pile of self-loathing because everyone condemns your "laziness" as a moral failing that you can get over by focusing harder. Which you are, of course, physically incapable of. It's great fun.

That said, I'm not going to diagnose OP based on two lines of text. Talk to a doctor.


I'm afraid there's not much to be done about that, other than finding something you're actually interested in.

Me, I've accepted that I probably won't become much better at what I do, and there's more important things to worry about than my performance. Like beer and video games.


> there's more important things to worry about than my performance

Yes!

> Like beer and video games.

No!

Some activities are like sending energy and money to /dev/null. Others push us out of our comfort zone and open doors to better things.


Every activity is ultimately like sending money and energy to /dev/null. Meditating on a mountaintop is not inherently more fulfilling than playing playing a great video game, just because it is more "respectable".

I think people get trapped by always feeling the need to perfectly optimize their lives, without asking for what they are optimizing.


Disagree. Playing video games is so much fun and relaxing compared to the stress of work and Covid and daily life that I'm 100% sure it's net-positive if it's <= 10 hours/week.


Not everyone wants to spend their life the same way, and that's fine.

There's nothing wrong with pursuing comfort and enjoyment.


It is that, but for AIs, not humans.


Performers look similar to Labda Networks in characteristics, but they don't mention it in the paper (although Lamda Networks are used to model images). I wonder what are the main differences in the ideas.

Anyways, congratulations for beating Moore's law again!



A number of groups worked on this at the same time and of course because it is machine learning, everyone had to come up with their own name for it. Even though most of these are basically elementary linear algebra dressed up by fancy language and lots of computationally expensive experiments.


You are right, but at the same time it's great that finally it seems that we have an efficient unified model for language and image modelling. Having CNNs and RNNs separately was relatively hard to work with (especially RNNs). It would be bad if we would need different hardware accelerator architectures for the two most important / different fields of AI. Also having a unified architecture will help with mixed tasks.

https://github.com/lucidrains/lambda-networks

https://www.youtube.com/watch?v=3qxJ2WD8p4w

https://openreview.net/pdf?id=xTJEN-ggl1b


OT: I want to get into transformers for NLP, what's the best way?

About me: Mostly done TS the last years. Dipped into Python, a bit pandas, a bit numpy, a bit Kaggle for the last 3-4 weeks.

Why I ask: It's so easy to get lost, this field is wide, e.g. I spent days with spaCy, CoreNLP, etc. before I learned that transformers-based stuff exist and outperform former.


I've just recently been on a journey to understand transformers in and out, here are the resources I've found managed to drill it into my head:

1. Chapters 7, 9, 10

https://web.stanford.edu/~jurafsky/slp3/

This was really useful to really build up to the concepts of attention (although the actual attention section is still brief).

2. https://jalammar.github.io/visualizing-neural-machine-transl...

This was great for visualization and understanding that attention wasn't exclusively for transformers and actually was for RNNs first.

3. https://jalammar.github.io/illustrated-transformer/

Getting to understand how transformers actually work visually.

4. https://www.youtube.com/watch?v=S27pHKBEp30

This lecture by Leo Dirac was extremely helpful to finish off with, not only because it actually includes some pseudocode but it also revisits some key topics and covers why transformers are needed.

One of the big confusion points for me was that the concept of ATTENTION and SELF-ATTENTION are not the same thing.

Hope this helps.


These look great resources for transformer, thanks!


Everyone has a different way of approaching. I like video walk-throughs of papers. If you are the same, for the theory bit you may find Yannic Kilcher's videos [0,1] helpful.

On implementation side, the best project out there is [2] with good documentation. Also spacy has transformer package [3] (I personally haven't tried it), so maybe it will be easier for you to jump in if you have prior experience with spacy.

[0] https://www.youtube.com/watch?v=iDulhoQ2pro

[1] https://www.youtube.com/watch?v=-9evrZnBorM

[2] https://huggingface.co/transformers/

[3] https://explosion.ai/blog/spacy-transformers


Every time I research transformers it seems so hand wavy. Is there a simple description, maybe a bit of pseudo code?

Or at the other extreme they dump me into formula land without exposing what all the letters in the formula represent.


This is quite a good explanation of transformers that gets shared a lot. [link](http://jalammar.github.io/illustrated-transformer/)

And here's a super simple implementation of GPT by Andrej Karpathy. [link](https://github.com/karpathy/minGPT/blob/master/mingpt/model....)


Transformers are kinda similar to state vectors. They are tracking the current state of the world. The input becomes the output which is the input to the next iteration. The transformer transform the input to the output ad infinitum until a stop token is reached.


re spacy-transformers. I really wouldn't recommend it. I tried using it but was a nightmare. They had a dependency on a previous major version of Thinc (spacy's NN backend) but removed the documentation for that version. I wasted a week trying to deal with it until I gave up and went pure pytorch.

Spacy v3 seems to have integrated the package functionality, so I'd go for the nightly release instead of this.


Sorry you lost time on this!

We took a long time to get Thinc documented and stable, because there was a long period where I wasn't sure where I wanted the library to go. The deep learning ecosystem in 2018 was pretty hard to predict, and we didn't want to encourage spaCy users to adopt Thinc as their machine learning code if we weren't sure what its status would be. So we actually never really got Thinc v7 stablised and documented.

This actually became a real issue in the previous version of spacy-transformers. It meant we were pushed into a design for spacy-transformers that really didn't work well. The library wasn't flexible enough, because there was no good way to interact with the transformers at the modelling level.

Pretrained transformers are interesting from an API perspective because you really don't want to put the neural network in a box behind a higher-level API. You can use the intermediate representations in many different ways, so long as you can backprop to them. So you want to expose the neural networking.

Thinc v8 was redesigned and finally documented earlier this year: https://thinc.ai . We now have a clear vision for the library: you can write your models in the library of your choice and easily wrap them in Thinc, so spaCy isn't limited to one particular library. For spaCy's own models, we try to implement them in "pure Thinc" rather than a library like PyTorch or Tensorflow, to keep spaCy itself lightweight (and to stop you from having to juggle competing libraries at the same time).

So, it's not quite true that we removed the docs for Thinc v7. We actually didn't have a good solution to do the things you needed to do in the previous spacy-transformers, which prompted a big redesign.


Hey thanks for the super detailed response!

Yeah I was trying to do something that didn't quite fit with the spacy-transformers API at the time. I did get a bit of a headache trying to use thinc at the time, which was just when you guys did the redesign I think, so the docs were different from what I was seeing. I might not have searched enough though.

I didn't try it yet, but it seems that transformers got added to spacy v3 with first class support.

I did gain something from rummaging though spacy source though! NN layers were composed into module-like pieces, then added to this REGISTRY variable though a decorator. That way some things could be defined at runtime. It was super elegant.

I nicked the concept of that for my data preprocessing pipeline. Saved me a lot of time when trying new things.


No worries, and glad it wasn't a total loss! Yeah the registry solution is something we've been very happy with.


What would I miss if went all transfomers without spaCy? I don't get the idea of a wrapper API through spaCy.

I'd like to be as close as possible to the core transformers API without any intermediate layers. Nothing against spaCy but also when looking at huggingface's side and all the pre-trained models... it feels that nobody talks about/uses spaCy if they use transformers already.


I think spaCy offers a lot of things to connect the models to the rest of your application.

spaCy's Doc object is pretty helpful for using the outputs, for instance you can iterate over the sentences and then iterate over the entities within each sentence, and look at the tokens within them, or get the dependency children of the words in the entity. The Doc object is backed by Cython data structures, so it's more memory efficient and faster than Python equivalents you'd likely write yourself.

I also think our pipeline stuff is a bit more mature than the one in transformers. The transformers pipeline class is relatively new, so I do think our Language object offers a better developer experience.

I think the new training config and improved train command will also be appealing to people, especially with the projects workflow.

The improved transformers support in v3 is very new, it's only just released in beta form. I do hope people find it useful, but of course no library or solution is ideal for every use-case, so I definitely encourage people to pick the mix of libraries that seems right to them.


Missed this news, thanks! OP if you wish to use spacy try v3.

https://explosion.ai/blog/spacy-v3-nightly


Yannic Kilcher just released a new video[0] on the Performers paper. It will be useful to watch it after going through the above videos on transformers.

[0] https://youtu.be/xJrKIPwVwGM


+1 for Yannic Kilcher videos!


Shameless plug - I’m teaching an intense NLP training course over 4 half days, which covers transformers and KNN the latter 2 days. https://opensourceconnections.com/training/natural-language-...


Spacy v3.0 nightly is out, which has integration with transformer models. So if you already have some familiarity with the package it might be worth a look.

It should be very similar to normal spacy usage, just instead of downloading "en_core_web_sm" etc, it's "en_trf_foo_bar"


If you know Tensorflow, hugging face is the best way to get started. It's got easy ways to transfer learn from the big models.


Huggingface is the best, Tensorflow or Pytorch.


surprised no one's mentioned fairseq, which is probably the easiest way to train and use a transformer model. With huggingface etc you still have to write a bit of code for preprocessing input, training scheduling, batch inference and multi-gpu but fairseq has all that covered with built-in scripts.


If you have some background on neural networks McCormick's post is a good start.

https://mccormickml.com/2019/11/11/bert-research-ep-1-key-co...


Try spacy alpha 3.0 it integrate the https://github.com/huggingface/transformers library You should almost always use XLnet large in order to achieve the best accuracy


The limitation on sequence length in these architectures is important. For example if you ask someone about a book they just read they can probably give you a summary fairly easily. This is beyond Xformer models today. Note this is distinct from training on a book or series of books which networks handle fairly well.

More practically lets say I want to do an English language -> SQL translation task. If I feed a schema to GPT-3 along with the English language this actually works amazingly well. Unfortunately with only a couple thousand tokens to work with as input budget my schemas have to be short. I can't feed in a ton of schemas and meta data and have it work out complicated joins.


This is sort of right. Reformer has taken books as a single input and not just in split batches. There are pretrained book-trained models on HuggingFace now. I'm not aware of book-length summarization models yet, but that's not due to the input length issues of the previous generation of xformer models (maybe due to lack of an obvious training set?). So your SQL example should be a thing of the past before long.


I think they didn't release any benchmark results for reformer, so yes, it can take whole book as input, but quality is unknown.


I don't know what to think about their protein sequence example.

.35 accuracy is sure "better than chance" but is it good enough for a real purpose? (e.g. pharmaceutical target screening) Is this a task which in principle could be done to near-perfect accuracy (did the patient in the case study die?) vs one which could not be (did they like the movie a lot, moderately, somewhat, almost, ?) How well can you do with simple heuristics?

You can follow the traces to the literature and find the answers to those questions and you will find that people are sanguine about fundamental issues such as "does this track of development converge on an asymptote of 99.8% accuracy or an asymptote of 75% accuracy."

Folks who know how to take a predictor and hook it up to a Kelly better make better money working for banks or trading on their own account than Google pays.


A lot of protein folding work is brute forcing folds. A 35% heuristic is probably pretty good.


Who wants to test it, it is implemented in the great library and works perfectly https://github.com/idiap/fast-transformers


Is there an Explain-Like-I-Know-Matrix-Decompositions?


I didn't understand a word :(


So does the performer improve the state of the art on some NLP tasks?




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: