I've just ordered one to ditz around with. We're early in the process of deploying our own internal server for Mixtral work, but it'll be interesting to see how this performs (in raw terms, almost certainly worse, since we can roll out a 4090 without getting into trouble – but Hetzner has a better connection and handles the maintenance, so..) Order has been accepted but not deployed yet. I'll update when I have some initial inference numbers.
I got it after a few hours. You need to install drivers/CUDA yourself, but all very straightforward. Unfortunately due to having 20GB of VRAM, I'm limited to mixtral:8x7b-instruct-v0.1-q2_K but it runs fine, generating at about 40 tokens/s (65 tok/s for eval). As per official specs, it's running maxed out at 70W (being an SFF card).
(I've now tried running the Q4 mixtral which is 26GB. 18GB is on GPU, 8GB through CPU. Gets about 11 tok/s.)
To compare, running Ollama with Q4 Mixtral on my 24GB 3090 Ti locally (with 22.5GB of the 26GB model on the GPU) I get 14 tok/s generation, so the Hetzner server and the 4000 really aren't bad at all (notably my 3090 draws 70W during inference also).
Possibly. It's not a bad idea! The performance is pretty good in our initial tests, though cloud platforms do scale further in terms of concurrency.
Given how quickly things are changing, using more generic hardware (i.e. PCs and Nvidia cards) or leasing cloud based services might make more business sense in the long term, so investigating fallbacks with more control, like this server, are worthwhile experiments anyway.
Further, while I've been an Apple superfan for a couple of decades now, their recent attitude towards developers stinks, and I'm loathe to give them any more money than I need to right now, so I might be migrating away from the platform in any case.