I've just ordered one to ditz around with. We're early in the process of deployi...

petercooper · 2024-02-20T18:48:35 1708454915

I got it after a few hours. You need to install drivers/CUDA yourself, but all very straightforward. Unfortunately due to having 20GB of VRAM, I'm limited to mixtral:8x7b-instruct-v0.1-q2_K but it runs fine, generating at about 40 tokens/s (65 tok/s for eval). As per official specs, it's running maxed out at 70W (being an SFF card).

(I've now tried running the Q4 mixtral which is 26GB. 18GB is on GPU, 8GB through CPU. Gets about 11 tok/s.)

petercooper · 2024-02-21T16:35:36 1708533336

To compare, running Ollama with Q4 Mixtral on my 24GB 3090 Ti locally (with 22.5GB of the 26GB model on the GPU) I get 14 tok/s generation, so the Hetzner server and the 4000 really aren't bad at all (notably my 3090 draws 70W during inference also).

syntaxing · 2024-02-21T04:15:39 1708488939

I never deployed it professionally but wouldn’t getting a Mac Studio be a better run for your money if you’re only inferencing?

petercooper · 2024-02-21T11:13:23 1708514003

Possibly. It's not a bad idea! The performance is pretty good in our initial tests, though cloud platforms do scale further in terms of concurrency.

Given how quickly things are changing, using more generic hardware (i.e. PCs and Nvidia cards) or leasing cloud based services might make more business sense in the long term, so investigating fallbacks with more control, like this server, are worthwhile experiments anyway.

Further, while I've been an Apple superfan for a couple of decades now, their recent attitude towards developers stinks, and I'm loathe to give them any more money than I need to right now, so I might be migrating away from the platform in any case.