An 80% size reduction is no joke, and the fact that the 1.58-bit version runs on dual H100s at 140 tokens/s is kind of mind-blowing. That said, I’m still skeptical about how practical this really is for most people. Like, yeah, you can run it on 24GB VRAM or even with just 20GB RAM, but "slow" is an understatement—those speeds would make even the most patient person throw their hands up.
And then there’s the whole repetition issue. Infinite loops with "Pygame’s Pygame’s Pygame’s" kind of defeats the point of quantization if you ask me. Sure, the authors have fixes like adjusting the KV cache or using min_p, but doesn’t that just patch a symptom rather than solve the actual problem? A fried model is still fried, even if it stops repeating itself.
On the flip side, I love that they’re making this accessible on Hugging Face... and the dynamic quantization approach is pretty brilliant. Using 1.58-bit for MoEs and leaving sensitive layers like down_proj at higher precision—super clever. Feels like they’re squeezing every last drop of juice out of the architecture, which is awesome for smaller teams who can’t afford OpenAI-scale hardware.
"accessible" still comes with an asterisk. Like, I get that shared memory architectures like a 192GB Mac Ultra are a big deal, but who’s dropping $6,000+ on that setup? For that price, I’d rather build a rig with used 3090s and get way more bang for my buck (though, yeah, it’d be a power hog). Cool tech—no doubt—but the practicality is still up for debate. Guess we'll see if the next-gen models can address some of these trade-offs.
Oh the repetition issue is only on the non dynamic quants :) If you do dynamic quantization and use the 1.58bit dynamic quantized model the repetition issue fully disappears!
Min_p = 0.05 was a way I found to counteract the 1.58bit model generating singular incorrect tokens which happen around 1 token per 8000!
I think most of the model creators share their model usage examples so high at 0.6-0.7 simply because it's what a lot of the client apps use. IMO this is WAY too high unless you're doing creative writing.
Generally I set temp to 0-0.4 at absolute most.
min_p actually needs a little temperature to work effectively so with min_p I almost always use 0.2
Ye lower temp is also good :) Tbh its all trial and error - I found temp=1.5, min_p=0.1 to be very useful for pass@k type workloads - ie calling the LLM multiple times and aggregating.
temp=0 is also good for singular outputs. For classification tasks, it's better to actually inspect the logits.
But my goto setting is always setting min_p at least 0.01 or 0.05! It vastly suppresses incorrect rare random tokens from being created, and it helps massively!
> That said, I’m still skeptical about how practical this really is for most people.
I'm running Open WebUI for months now for me and some friends as a front-end to one of the API providers (deepinfra in my case, but there are many others, see https://artificialanalysis.ai/).
Having 1.58-bit is very practical for me. I'm looking much forward to the API provider adding this model to their system. They also added a Llama turbo (also quantized) a few months back so I have good hopes.
>Like, I get that shared memory architectures like a 192GB Mac Ultra are a big deal, but who’s dropping $6,000+ on that setup?
AMD strix halo APU will have quad channel memory and will launch soon so expect these kinds of setups available for much less. Apple is charging an arm and a leg for memory upgrades, hopefully we get competition soon. From what I saw at CES OEMs are paying attention to this use case as well - hopefully not following suite on RAM markups.
Keep in mind the strix halo APU has a 256 bit wide memory bus and the Mac Ultra has a 1024 bit wide memory bus.
Here's hoping the Nvidia Digit (GB10 chip) has a 512 bit or 1024 bit wide interface, otherwise the Strix Halo will be the best you can do if you don't get the Mac Ultra.
I mean it remains to be seen if it will be compute or bandwidth bound, I am sure mac ultra will also have double or triple compute as well.
But in either case its going to do much better than currently available CPUs with easily upgradeable ram. I would not be surprised to see 128gb configurations for around 3k (going of the ASUS g13 announced pricing of arround 2k for 32gb version and them saying it will go up to 128gb).
At that point sure it might not compete with max but its at a much more acceptable price point, it will not be a device you get just for the AI, but a mobile workstation that you can also run some local models on for normal money. Will need to wait and see. I know I am not buying anything from ASUS either way.
Highly depends on how accessible AMD makes these boards, if a lot of OEMs get it there will be good deals for sure. DDR5 prices are nowhere near Apple markups.
At my work, we self-host some models and have found that for anything remotely similar to RAG or use cases that are very specific, the quantized models have proven to be more than sufficient. This helps us keep them running on smaller infra and generally lower costs
Personally I've noticed major changes in performance between different quantisations of the same model.
Mistral's large 123B model works well (but slowly) at 4-bit quantisation, but if I knock it down to 2.5-bit quantisation for speed, performance drops to the point where I'm better off with a 70B 4-bit model.
This makes me reluctant to evaluate new models in heavily quantised forms, as you're measuring the quantisation more than the actual model.
That's a fair point - the trick with dynamic quants is we selectively choose not to quantize many components - ie attention is left at 4 or 6bit, just the MoE parts are 1.5bit (-1, 0, 1)
There are distilled versions like Qwen 1.5, 3, 14, 32, Llama 8, 70, but those are distilled - if you want to run the original R1, then the quants are currently the only way.
But I agree quants do affect perf - hence the trick for MoEs is to not quantize specific areas!
I just ran it up on 48gb (2x 3090) + overflow into CPU RAM and it runs at around 4tk/s (only a little 8k context size though) which while absolutely not something I'd personally use daily - it is actually usable.
Not everyone needs the largest model. There are variations or R1 with fewer parameters that can easily run on consumer hardware. With 80% size reduction you could run 70B on 8-bit on an RTX 3090.
Other than that, if you really need the big one you can get six 3090s and you're good to go. It's not cheap, but you're running a ChatGPT equivalent model from your basement. A year ago this was a wetdream for most enthusiasts.
There’s a huge difference both in capabilities and in meaning between “variations of r1” and “r1 distill”. ollama is intentionally misleading people on this but the distills are much much worse
I ran whatever version Ollama downloaded on a 3070ti (laptop version). It's reasonably fast. Generative stuff can get weird if you do prompts like "in the style of" or "a new episode of" because it doesn't seem to have much pop culture in its training data. It knows the Stargate movie, for example, and seems to have the IMDB info for the series, but goes absolutely ham trying to summarize the series.
This line in the stuff inside the <think> section suggests it's also been trained on YouTube clips:
>> "I'm not entirely sure if I got all the details right, but this is what I remember from watching clips and summaries
online."
An excerpt from the generated summary:
>> "Set in the 23rd century during a Z-Corp invasion, the series features action sequences, strategic thinking, and humor. It explores themes of international espionage, space warfare, and humanity's role in the cosmos. The show incorporates musical numbers and catchy theme songs for an engaging viewing experience. The plot involves investigating alien warships and their secret base on Kessari planet while addressing personal conflicts and philosophical questions about
space."
"It explores themes of international espionage, space warfare, and humanity's role in the cosmos" is the closest to correct line in the whole output.
Ollama has been deliberately misrepresenting R1 distill models as "R1" for marketing purposes. A lot of "AI" influencers on social media are unabashedly doing the same. Ollama's default "R1" model is a 4-bit RTN quantized 7B model, which is nowhere close to the real R1 (a 671B parameter fp8 MoE).
no they are not, they intentionally remove every reference to this not being r1 from the cli and changed the names from the ones both Deepseek and Huggingface used.
> DeepSeek's first-generation of reasoning models with comparable performance to OpenAI-o1, including six dense models distilled from DeepSeek-R1 based on Llama and Qwen.
Well I guess if you are in the Enterprise Java naming model you would expect something like "VisitorModelUtilsListGetterAdapterInterceptorMessageManagerDrivenObserverPool"
If you look at their API docs you will see:
model: name of the model to push in the form of <namespace>/<model>:<tag>
I don't think there is any reason to jump to the conclusion it is some type of conspiracy here, just naming things based on a API that probably didn't think about distillation when they created it.
Yeah, they're so clear in fact that they call the distilled models "R1" in the url and everywhere on the page[1], instead of using the "DeepSeek-R1-Distill-" prefix, as DeepSeek themselves do[2].
It's fairly clear that R1-Llama or R1-Qwen is a distill, and they're all coming directly from DeepSeek.
As an aside, at least the larger distilled models (I'm mostly running r1-llama-distill-70b) are definitely not the same thing as the base llama/qwen models. I'm getting better results locally, admittedly with the slower inference time as it does the whole "<think>" section.
Surprisingly - The content in the <think> section is actually quite useful on its own. If you're using the model to spitball or brainstorm, getting to see it do that process is just flat out useful. Sometimes more-so than the actual answer it finally produces.
It should take the same amount of memory as the one you currently have.
In my experience the Llama version performs much better at adhering to the prompt, understanding data in multiple languages, and going in-depth in its responses.
It's a model called Qwen, trained by Alibaba, which the DeepSeek team has used to "distill" knowledge from their own (100x bigger) model.
Think of it as forcing a junior Qwen to listen in while the smarter, PhD-level model was asked thousands of tough problems. It will acquire some of that knowledge and learn a lot of the reasoning process.
It cannot become exactly as smart, for the same reason a dog can learn lots of tricks from a human but not become human-level itself: it doesn't have enough neurons/capacity. Here, Qwen is a 7B model so it can't cram within 7 billion parameters as much data as you can cram into 671 billion. It can literally only learn 1% as much, BUT the distillation process is cleverly built and allows to focus on the "right" 1%.
Then this now-smarter Qwen is quantized. This means that we take its parameters (16-bit floats, super precise numbers) and truncate them to make them use less memory space. This also makes it less precise.
Think of it as taking a super high resolution movie picture and compressing it into a small GIF. You lose some information, but the gist of it is preserved.
As a result of both of these transformations, you get something that can run on your local machine — but is a bit dumber than the original — because it's about 400 times smaller than the real deal.
"Qwen2.5 is the large language model series developed by Qwen team, Alibaba Cloud."
And I think they, the DeepSeek team, finetunes Qwen 7b on DeepSeek. That is how I understood it.
Which apparently makes it quite good for a 7b model. But, again: if I understood it correctly, is still just qween and without the reasoning of DeepSeek.
In my application, code generation, the distilled DeepSeek models (7B to 70B) perform poorly. They imitate the reasoning of the r1 model, but their conclusions are not correct.
The real r1 model is great, better than o1, but the distilled models are not even as good as the base models that they were distilled from.
It is hilariously bad at writing erotica when I've used jailbreaks on it. It's knowledge is the equivalent of a 1980s college kid with no access to pornography who watched an R rated movie once.
That's like trying to assemble an Ikea bookshelf with a bulldozer. All that extra power is doing nothing for the task you're asking of it, and there are plenty of lightweight alternatives.
I have MLCCHAT on my old Note 9 phone. It is actually still a great phone, but has 5GB RAM. Running an on device model is the first and only use case the RAM actually matters.
And it has a headphone jack, OK? I just hate Bluetooth earbuds. And yeah, it isna problem, but I digress.
When I run a 2.5B model, I get respectable output. Takes a minute or two to process the context, then output begins at somewhere on the order of 4 to 10 tokens per sec.
So, I just make a query and give it a few and I have my response.
Here is how I see it:
That little model, which is Gemma 2.2b sorry, knows a lot of stuff. It has knowledge I don't and it gives it to me in a reasonable, though predictable way. Answers are always of a certain teacher reminding student how it all goes way.
I don't care. Better is nice, but if I were stuck somewhere with no network, being able to query that model is amazing!
First aid, how to make fires, materials and uses. Fixing stuff, theories of operation, what things mean and more are in that thing ready for me to take advantage of.
I consider what I have fast. And it will get one or two orders faster over the next few years too.
I did it on a lark (ask the model what that means) and was surprised to see I gained a nice tool.
> First aid, how to make fires, materials and uses
This scares me more than it should...
Please do not trust an AI in actual life and death situations... Sure if it is literally your only option, but this implies you have a device on you that could make a phone call to an emergency number where a real human with real training and actually correct knowledge can assist you.
Even as an avid hiker the amount of times I've been out off cell service is miniscule and I absolutely refresh my knowledge on first aid regularly and any potential threats before a hike somewhere new.
I fired the model up at 8:39, ready to chat just a bit before 8:40, during which I input the following prompt:
Please define "lark" for me, as in "I did it on a lark."
Output began a bit before 8:41, and was complete before 8:43:
---Begin Model Output
"On a lark" means without much planning or reason, impulsively or unexpectedly.
Here's how it works:
"I did it on a lark." This means the person acted without a specific plan or reason.
"A lark" refers to the lark bird, a small songbird known for its playful nature.
So, "on a lark" suggests something spontaneous, casual, and often with a touch of playful whimsy.
Here are some other ways to understand it:
Without much thought: The person was likely to do it quickly and without much consideration.
Based on a whim: It's something they decided to do without a solid plan.
You know that's not the point. You can't even ask the dictionary app a question. You more or less just input and it'll give you what it has.
I can ask this model some pretty intense things and I get meaningful output back and it doesn't take very long and I don't need an internet connection. Like I said it's spiffy.
I will, have, and will use it again sans network connection and be happy I had the option.
Better will come. That is just fine. I have something very useful today, right now.
Layman here — but I am hopeful for 1.58 bit plus custom silicon to be the Holy Grail. I suppose I am setting high expectations on Apple to integrate said in their next "A" chip.
I do want a 192GB Mac Ultra, I'm hoping the Nvidia Digit achieves similar at $3,000. Sadly no specifications or benchmarks, so tokens/sec is just a guess at this point.
And then there’s the whole repetition issue. Infinite loops with "Pygame’s Pygame’s Pygame’s" kind of defeats the point of quantization if you ask me. Sure, the authors have fixes like adjusting the KV cache or using min_p, but doesn’t that just patch a symptom rather than solve the actual problem? A fried model is still fried, even if it stops repeating itself.
On the flip side, I love that they’re making this accessible on Hugging Face... and the dynamic quantization approach is pretty brilliant. Using 1.58-bit for MoEs and leaving sensitive layers like down_proj at higher precision—super clever. Feels like they’re squeezing every last drop of juice out of the architecture, which is awesome for smaller teams who can’t afford OpenAI-scale hardware.
"accessible" still comes with an asterisk. Like, I get that shared memory architectures like a 192GB Mac Ultra are a big deal, but who’s dropping $6,000+ on that setup? For that price, I’d rather build a rig with used 3090s and get way more bang for my buck (though, yeah, it’d be a power hog). Cool tech—no doubt—but the practicality is still up for debate. Guess we'll see if the next-gen models can address some of these trade-offs.