As far as I understand, the main issue for LLM inference is memory bandwidth and...

winwang 89 days ago | parent | context | favorite | on: LlamaF: An Efficient Llama2 Architecture Accelerat...

As far as I understand, the main issue for LLM inference is memory bandwidth and capacity. Tensor cores are already an ASIC for matmul, and they idle half the time waiting on memory.