A 7B model at 8-bit quantization takes up 7 GB of RAM. Less if you use a 6-bit quantization, which is nearly as good. Otherwise it's just a question of having enough system RAM and CPU cores, plus maybe a small discrete GPU.
You’ll need a bit more than 7GB (~1 GB or so), even at 8 bit quantization, because of the KV-cache. LLM inference is notoriously inefficient without it, because it’s autoregressive.
Some projects such as lmdeploy[0] can quantize the KV cache[1] as well to save some VRAM.
Speaking of lmdeploy, it doesn't seem to be widely known but it also supports quantization with AWQ[2] which appears to be superior to the more widely used GPTQ.
The serving backend is Nvidia Triton Inference Server. Not only is Triton extremely fast and efficient, they have a custom TurboMind backend for Triton. With this lmdeploy delivers the best performance I've seen[3].
On my development workstation with an RTX 4090, llama2-chat-13b, AWQ int4, and KV cache int8:
8 concurrent sessions (batch 1): 580 tokens/s
1 concurrent session (batch 1): 105 tokens/s
This is out of the box, I haven't spent any time further optimizing it.
6-bit quantizations are supposed to be nearly equivalent to 8-bit, and that does chop 1.5 GB off the model size. I think a 6-bit model should therefore fit, or if that doesn't, 5-bit medium or 5-bit small surely will.
There is always an option to go down the list of available quantizations notch by notch until you find the largest model that works. llama.cpp offers a lot of flexibility in that regard.
On Ryzen 5600X, 7B and 13B run quite fast. Off the top of my head, pure CPU performance is about 25% slower than with an NVIDIA GPU of some kind. I don't remember the numbers off the top of my head, but the generation speed only starts to get annoying for 33B+ models.