It might be tough to make more efficient, but $15k seems exactly about the price of "stick 6 4090s in a decent box and throw in a couple grand for my troubles", versus any revolutionary hardware configuration. The way it advertises running fp16 Llama 70B feels a bit contrived too, given the prevelance of quantizing to 8 bit at minimum.
In my opinion the best hardware to run big models is to go and get a mac studio ultra. You have 192GB of unified RAM which can run pretty much every available model without losing performance. And it would cost you half that price.
According to this [1] article (current top of hn) memory bandwidth is typically the limiting factor, so as long as your batch size isn't huge you probably aren't losing too much in performance.