Yes. For a model within the limits of the head requirements, however, you wouldn’t be able to see a quality difference from regular attention. Non determinism is a performance price; regular transformers may also suffer from it depending on the implementation.
It’s basically a way of more efficiently making use of memory transfers during the calculation of the attention blocks in a transformer. You transfer a block at a time, increasing inference throughout because less time is spent overall fetching things from slow memory.
https://together.ai/blog/tri-dao-flash-attention