Hacker News new | past | comments | ask | show | jobs | submit login

What you describe here sounds a little like the line of work centered around Universal Transformers, which basically process the input embeddings through a single transformer block multiple times with a separate module deciding when the embeddings have been cooked enough and can be pulled out of the oven so to speak.

Even more in line with the idea of "experts" there's a paper from last year on Sparse Universal Transformers in which they combine a universal transformer with sparse mixture of experts, so it's up to the gating mechanism to decide which transformer blocks and in which order are to be used in shaping the embeddings.

This really isn't my specialty but from what I gathered these are tricky to train properly, and require more overall compute during inference to reach comparable results to their vanilla transformer counterparts. It's an interesting direction nonetheless, having an upper bound on the number of computation steps per token is, in my opinion, one of the major downsides of the classical transformer architecture.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: