Hacker News new | past | comments | ask | show | jobs | submit login
FlashAttention: Fast Transformer training with long sequences (adept.ai)
148 points by kristianp 9 months ago | hide | past | favorite | 10 comments



The same author Tri Dao, released FlashAttention 2 in July.

https://together.ai/blog/tri-dao-flash-attention


Here is a recent interview with the author of FlashAttention, Tri Dao:

https://www.youtube.com/watch?v=J4-qZ6KBalk


It's insane that FlashAttention was released 16 months ago. It feels like a decade.


Has anybody used FlashAttention in their model? Are there any benchmark numbers on the quality impact?


The result is identical to regular attention in transformers but training can be about four times faster, so there is almost no reason to not use it.


Not quite. There can be non-deterministic race conditions, and strange head size and sequence length requirements.


Yes. For a model within the limits of the head requirements, however, you wouldn’t be able to see a quality difference from regular attention. Non determinism is a performance price; regular transformers may also suffer from it depending on the implementation.


It’s basically a way of more efficiently making use of memory transfers during the calculation of the attention blocks in a transformer. You transfer a block at a time, increasing inference throughout because less time is spent overall fetching things from slow memory.


Also, isn't the author Tri Dao at Together AI now as their chief scientist?


Related:

FlashAttention-2, 2x faster than FlashAttention - https://news.ycombinator.com/item?id=36761988 - July 2023 (18 comments)

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness - https://news.ycombinator.com/item?id=31568090 - May 2022 (3 comments)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: