Also, stuff like this is hard to take the results seriously:
* To make an accurate comparison between the systems with different settings of tensor parallelism, we extrapolate throughput for the MI300X by 2.
* All inference frameworks are configured to use FP16 compute paths. Enabling FP8 compute is left for future work.
They did everything they can to make sure AMD is faster.
You need 2 H100 to have enough VRAM for the model whereas you need only 1 MI300X. Doubling the total throughput (for all completions) of 1 MI300X to simulate the numbers for a duplicated system is reasonable.
They should probably show separately the throughput per completion as the tensor parallelism is often used for that purpose in addition to the doubling the VRAM.
I don't understand why they should use TensorRT. vLLM is much more popular and it was actually written for Nvidia. It also supports AMD hardware, so it's the appropriate tool to compare.
I see it as they did everything they can to compare the specific code path. If your workload scales with FP16 but not with tensor cores, then this is the correct way to test. What do you need for LLM inference?
vLLM inference of Mixtral in fp16 is a real workload. I guess the details are there because of the different inference engine used. You need the most similar compute tasks to be ran but the compute kernels can't be the same as in the end they need to be ran by a different hardware.