An 80% size reduction is no joke, and the fact that the 1.58-bit version runs on...

danielhanchen · 2025-01-28T11:21:28 1738063288

Oh the repetition issue is only on the non dynamic quants :) If you do dynamic quantization and use the 1.58bit dynamic quantized model the repetition issue fully disappears!

Min_p = 0.05 was a way I found to counteract the 1.58bit model generating singular incorrect tokens which happen around 1 token per 8000!

smcleod · 2025-01-28T12:03:02 1738065782

min_p is great, do you apply a small amount of temperate as well?

Der_Einzige · 2025-01-28T20:13:37 1738095217

Btw, min_p (the paper about the sampler) got accepted to ICLR! As 4th author it warms my heart to so it used so much in the wild.

danielhanchen · 2025-01-28T20:45:05 1738097105

Oh hi!! Congratulations on ICLR!!! min_p = 0.1 and temp = 1.5 is my default goto settings!!

danielhanchen · 2025-01-28T19:46:32 1738093592

The recommended temperature from DeepSeek is 0.6 so I leave it at that!

smcleod · 2025-01-28T21:01:05 1738098065

I think most of the model creators share their model usage examples so high at 0.6-0.7 simply because it's what a lot of the client apps use. IMO this is WAY too high unless you're doing creative writing.

Generally I set temp to 0-0.4 at absolute most.

min_p actually needs a little temperature to work effectively so with min_p I almost always use 0.2

danielhanchen · 2025-01-28T21:15:58 1738098958

Ye lower temp is also good :) Tbh its all trial and error - I found temp=1.5, min_p=0.1 to be very useful for pass@k type workloads - ie calling the LLM multiple times and aggregating.

temp=0 is also good for singular outputs. For classification tasks, it's better to actually inspect the logits.

But my goto setting is always setting min_p at least 0.01 or 0.05! It vastly suppresses incorrect rare random tokens from being created, and it helps massively!

huijzer · 2025-01-28T12:16:30 1738066590

> That said, I’m still skeptical about how practical this really is for most people.

I'm running Open WebUI for months now for me and some friends as a front-end to one of the API providers (deepinfra in my case, but there are many others, see https://artificialanalysis.ai/).

Having 1.58-bit is very practical for me. I'm looking much forward to the API provider adding this model to their system. They also added a Llama turbo (also quantized) a few months back so I have good hopes.

danielhanchen · 2025-01-28T21:17:57 1738099077

Oh I love Open WebUI as well!! But glad to hear the 1.58bit version could be helpful to you!

rafaelmn · 2025-01-28T14:20:21 1738074021

>Like, I get that shared memory architectures like a 192GB Mac Ultra are a big deal, but who’s dropping $6,000+ on that setup?

AMD strix halo APU will have quad channel memory and will launch soon so expect these kinds of setups available for much less. Apple is charging an arm and a leg for memory upgrades, hopefully we get competition soon. From what I saw at CES OEMs are paying attention to this use case as well - hopefully not following suite on RAM markups.

sliken · 2025-01-28T17:04:15 1738083855

Keep in mind the strix halo APU has a 256 bit wide memory bus and the Mac Ultra has a 1024 bit wide memory bus.

Here's hoping the Nvidia Digit (GB10 chip) has a 512 bit or 1024 bit wide interface, otherwise the Strix Halo will be the best you can do if you don't get the Mac Ultra.

rafaelmn · 2025-01-28T20:34:34 1738096474

I mean it remains to be seen if it will be compute or bandwidth bound, I am sure mac ultra will also have double or triple compute as well.

But in either case its going to do much better than currently available CPUs with easily upgradeable ram. I would not be surprised to see 128gb configurations for around 3k (going of the ASUS g13 announced pricing of arround 2k for 32gb version and them saying it will go up to 128gb).

At that point sure it might not compete with max but its at a much more acceptable price point, it will not be a device you get just for the AI, but a mobile workstation that you can also run some local models on for normal money. Will need to wait and see. I know I am not buying anything from ASUS either way.

dagmx · 2025-01-28T14:46:38 1738075598

Unfortunately, Apple’s RAM and Storage upgrade prices are very in line with other class comparable OEMs.

I’m sure there’ll be some amount of undercutting but I don’t think it’ll be a huge difference on the RAM side itself.

rafaelmn · 2025-01-28T15:02:09 1738076529

Highly depends on how accessible AMD makes these boards, if a lot of OEMs get it there will be good deals for sure. DDR5 prices are nowhere near Apple markups.

nyrikki · 2025-01-28T21:18:24 1738099104

The upgrade from 8GB to 16GB cost $200...in 2012!!!!

The newest Lenovo workstations that use LPDDR5x with 7467MT/s gets you a 16GB jump for the price apple charges for an 8GB.

2:1 isn't "class comparable" IMHO.

jairuhme · 2025-01-28T12:51:14 1738068674

At my work, we self-host some models and have found that for anything remotely similar to RAG or use cases that are very specific, the quantized models have proven to be more than sufficient. This helps us keep them running on smaller infra and generally lower costs

michaelt · 2025-01-28T13:11:49 1738069909

Personally I've noticed major changes in performance between different quantisations of the same model.

Mistral's large 123B model works well (but slowly) at 4-bit quantisation, but if I knock it down to 2.5-bit quantisation for speed, performance drops to the point where I'm better off with a 70B 4-bit model.

This makes me reluctant to evaluate new models in heavily quantised forms, as you're measuring the quantisation more than the actual model.

danielhanchen · 2025-01-28T21:20:10 1738099210

That's a fair point - the trick with dynamic quants is we selectively choose not to quantize many components - ie attention is left at 4 or 6bit, just the MoE parts are 1.5bit (-1, 0, 1)

There are distilled versions like Qwen 1.5, 3, 14, 32, Llama 8, 70, but those are distilled - if you want to run the original R1, then the quants are currently the only way.

But I agree quants do affect perf - hence the trick for MoEs is to not quantize specific areas!

sitkack · 2025-01-28T13:58:25 1738072705

How are you doing your evals?

Being able to do semantic diffs of the output of the two models should tell you what you need to do.

smcleod · 2025-01-28T12:01:53 1738065713

I just ran it up on 48gb (2x 3090) + overflow into CPU RAM and it runs at around 4tk/s (only a little 8k context size though) which while absolutely not something I'd personally use daily - it is actually usable.

eurekin · 2025-01-28T14:01:55 1738072915

I have similar set-up - can you help out with running it? Was it in ollama?

EDIT: It seems that original authors provided a nice write-up:

https://unsloth.ai/blog/deepseekr1-dynamic#:~:text=%F0%9F%96...

smcleod · 2025-01-28T19:39:20 1738093160

Yep that's pretty much what I did, their calculation for the layers was slightly off though, I found I could offload an extra 1-2 layers to the GPUs

danielhanchen · 2025-01-28T21:27:08 1738099628

Oh yes I reduced it by 4 for just in case :) I found sometimes the formula doesn't work, so in the worst case -4 was used - glad at least it ran!

ryao · 2025-01-28T20:44:24 1738097064

Which one did you run?

elorant · 2025-01-28T11:28:45 1738063725

Not everyone needs the largest model. There are variations or R1 with fewer parameters that can easily run on consumer hardware. With 80% size reduction you could run 70B on 8-bit on an RTX 3090.

Other than that, if you really need the big one you can get six 3090s and you're good to go. It's not cheap, but you're running a ChatGPT equivalent model from your basement. A year ago this was a wetdream for most enthusiasts.

whimsicalism · 2025-01-28T15:15:04 1738077304

There’s a huge difference both in capabilities and in meaning between “variations of r1” and “r1 distill”. ollama is intentionally misleading people on this but the distills are much much worse

thot_experiment · 2025-01-28T18:18:56 1738088336

They're really not? Both subjectively and in benchmarks there is no world in which the delta between the models deserves a "much much".

Kye · 2025-01-28T11:49:24 1738064964

I ran whatever version Ollama downloaded on a 3070ti (laptop version). It's reasonably fast. Generative stuff can get weird if you do prompts like "in the style of" or "a new episode of" because it doesn't seem to have much pop culture in its training data. It knows the Stargate movie, for example, and seems to have the IMDB info for the series, but goes absolutely ham trying to summarize the series.

This line in the stuff inside the <think> section suggests it's also been trained on YouTube clips:

>> "I'm not entirely sure if I got all the details right, but this is what I remember from watching clips and summaries online."

An excerpt from the generated summary:

>> "Set in the 23rd century during a Z-Corp invasion, the series features action sequences, strategic thinking, and humor. It explores themes of international espionage, space warfare, and humanity's role in the cosmos. The show incorporates musical numbers and catchy theme songs for an engaging viewing experience. The plot involves investigating alien warships and their secret base on Kessari planet while addressing personal conflicts and philosophical questions about space."

"It explores themes of international espionage, space warfare, and humanity's role in the cosmos" is the closest to correct line in the whole output.

Mashimo · 2025-01-28T11:59:02 1738065542

> ran whatever version Ollama downloaded on a 3070ti (laptop version). It's reasonably fast.

Probably was not r1, but one of the other models that got trained on r1, which apparently might still be quite good.

woadwarrior01 · 2025-01-28T12:53:34 1738068814

Ollama has been deliberately misrepresenting R1 distill models as "R1" for marketing purposes. A lot of "AI" influencers on social media are unabashedly doing the same. Ollama's default "R1" model is a 4-bit RTN quantized 7B model, which is nowhere close to the real R1 (a 671B parameter fp8 MoE).

https://www.reddit.com/r/LocalLLaMA/comments/1i8ifxd/ollama_...

wklauss · 2025-01-28T14:32:16 1738074736

Ollama is pretty clear about it, it's not like they are trying to deceive. You can also download the 671B model with Ollama, if you like.

whimsicalism · 2025-01-28T15:16:06 1738077366

no they are not, they intentionally remove every reference to this not being r1 from the cli and changed the names from the ones both Deepseek and Huggingface used.

blacklightpy · 2025-01-30T16:55:51 1738256151

Yet, I did not see a single issue made on the GitHub repository, and I just made one myself (https://github.com/ollama/ollama/issues/8698).

nyrikki · 2025-01-28T21:06:08 1738098368

They used short strings for the names, which is very different from deception.

https://ollama.com/search

> DeepSeek's first-generation of reasoning models with comparable performance to OpenAI-o1, including six dense models distilled from DeepSeek-R1 based on Llama and Qwen.

Well I guess if you are in the Enterprise Java naming model you would expect something like "VisitorModelUtilsListGetterAdapterInterceptorMessageManagerDrivenObserverPool"

If you look at their API docs you will see:

    model: name of the model to push in the form of <namespace>/<model>:<tag>

I don't think there is any reason to jump to the conclusion it is some type of conspiracy here, just naming things based on a API that probably didn't think about distillation when they created it.

woadwarrior01 · 2025-01-28T14:37:27 1738075047

Yeah, they're so clear in fact that they call the distilled models "R1" in the url and everywhere on the page[1], instead of using the "DeepSeek-R1-Distill-" prefix, as DeepSeek themselves do[2].

[1]: https://ollama.com/library/deepseek-r1

[2]: https://github.com/deepseek-ai/DeepSeek-R1#deepseek-r1-disti...

horsawlarway · 2025-01-28T14:40:58 1738075258

I mean... yes. The DeepSeek announcement puts R1 right there in the name for those models. https://api-docs.deepseek.com/news/news250120

It's fairly clear that R1-Llama or R1-Qwen is a distill, and they're all coming directly from DeepSeek.

As an aside, at least the larger distilled models (I'm mostly running r1-llama-distill-70b) are definitely not the same thing as the base llama/qwen models. I'm getting better results locally, admittedly with the slower inference time as it does the whole "<think>" section.

Surprisingly - The content in the <think> section is actually quite useful on its own. If you're using the model to spitball or brainstorm, getting to see it do that process is just flat out useful. Sometimes more-so than the actual answer it finally produces.

Kye · 2025-01-28T12:38:11 1738067891

I'm not too hip to all the LLM terminology, so maybe someone can make sense of this and see if it's r1 or something based on r1:

>>> /show info

  Model

    architecture        qwen2

    parameters          7.6B

    context length      131072

    embedding length    3584

    quantization        Q4_K_M

sebastiennight · 2025-02-09T16:03:23 1739117003

Hi Kye, I tried a version of this model to assess its capabilities.

I would recommend you to try to run the llama-based distill (same size, same quantization) that you can find here: https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-8...

It should take the same amount of memory as the one you currently have.

In my experience the Llama version performs much better at adhering to the prompt, understanding data in multiple languages, and going in-depth in its responses.

sebastiennight · 2025-01-30T13:40:26 1738244426

So... it's not R1 itself.

It's a model called Qwen, trained by Alibaba, which the DeepSeek team has used to "distill" knowledge from their own (100x bigger) model.

Think of it as forcing a junior Qwen to listen in while the smarter, PhD-level model was asked thousands of tough problems. It will acquire some of that knowledge and learn a lot of the reasoning process.

It cannot become exactly as smart, for the same reason a dog can learn lots of tricks from a human but not become human-level itself: it doesn't have enough neurons/capacity. Here, Qwen is a 7B model so it can't cram within 7 billion parameters as much data as you can cram into 671 billion. It can literally only learn 1% as much, BUT the distillation process is cleverly built and allows to focus on the "right" 1%.

Then this now-smarter Qwen is quantized. This means that we take its parameters (16-bit floats, super precise numbers) and truncate them to make them use less memory space. This also makes it less precise. Think of it as taking a super high resolution movie picture and compressing it into a small GIF. You lose some information, but the gist of it is preserved.

As a result of both of these transformations, you get something that can run on your local machine — but is a bit dumber than the original — because it's about 400 times smaller than the real deal.

Mashimo · 2025-01-28T12:59:30 1738069170

"Qwen2.5 is the large language model series developed by Qwen team, Alibaba Cloud."

And I think they, the DeepSeek team, finetunes Qwen 7b on DeepSeek. That is how I understood it.

Which apparently makes it quite good for a 7b model. But, again: if I understood it correctly, is still just qween and without the reasoning of DeepSeek.

randomifcpfan · 2025-01-28T14:10:07 1738073407

In my application, code generation, the distilled DeepSeek models (7B to 70B) perform poorly. They imitate the reasoning of the r1 model, but their conclusions are not correct.

The real r1 model is great, better than o1, but the distilled models are not even as good as the base models that they were distilled from.

whimsicalism · 2025-01-28T15:16:53 1738077413

it’s a distill, it’s going to be much much worse than r1

narrator · 2025-01-28T13:04:57 1738069497

It is hilariously bad at writing erotica when I've used jailbreaks on it. It's knowledge is the equivalent of a 1980s college kid with no access to pornography who watched an R rated movie once.

svachalek · 2025-01-28T19:28:24 1738092504

That's like trying to assemble an Ikea bookshelf with a bulldozer. All that extra power is doing nothing for the task you're asking of it, and there are plenty of lightweight alternatives.

brookst · 2025-01-28T14:52:48 1738075968

Or if you want a large model but don’t need high performance, get a Mac with 128GB UMA.

hdjjhhvvhga · 2025-01-28T22:48:42 1738104522

How many tokens/s would you get in such a setup?

brookst · 2025-01-29T14:32:07 1738161127

This Reddit thread says an m3 max 128GB gets 23 tokens/sec with deepseek r1 32B, and 4 tokens / sec with 70b: https://www.reddit.com/r/LocalLLaMA/comments/1i69dhz/deepsee...

F7F7F7 · 2025-01-28T15:22:23 1738077743

People would only be 'throwing their hands up' because commercial LLMs have set unreasonable expectations for folks.

Anyone who has a/the need for or understands the value of a local LLM would be OK with this kind of output.

bnchrch · 2025-01-28T15:26:07 1738077967

Everyone has the need for on device LLM, if the response rate was fast!

ddingus · 2025-01-28T16:40:00 1738082400

I have MLCCHAT on my old Note 9 phone. It is actually still a great phone, but has 5GB RAM. Running an on device model is the first and only use case the RAM actually matters.

And it has a headphone jack, OK? I just hate Bluetooth earbuds. And yeah, it isna problem, but I digress.

When I run a 2.5B model, I get respectable output. Takes a minute or two to process the context, then output begins at somewhere on the order of 4 to 10 tokens per sec.

So, I just make a query and give it a few and I have my response.

Here is how I see it:

That little model, which is Gemma 2.2b sorry, knows a lot of stuff. It has knowledge I don't and it gives it to me in a reasonable, though predictable way. Answers are always of a certain teacher reminding student how it all goes way.

I don't care. Better is nice, but if I were stuck somewhere with no network, being able to query that model is amazing!

First aid, how to make fires, materials and uses. Fixing stuff, theories of operation, what things mean and more are in that thing ready for me to take advantage of.

I consider what I have fast. And it will get one or two orders faster over the next few years too.

I did it on a lark (ask the model what that means) and was surprised to see I gained a nice tool.

jpc0 · 2025-01-28T21:44:59 1738100699

> First aid, how to make fires, materials and uses

This scares me more than it should...

Please do not trust an AI in actual life and death situations... Sure if it is literally your only option, but this implies you have a device on you that could make a phone call to an emergency number where a real human with real training and actually correct knowledge can assist you.

Even as an avid hiker the amount of times I've been out off cell service is miniscule and I absolutely refresh my knowledge on first aid regularly and any potential threats before a hike somewhere new.

ddingus · 2025-01-29T02:53:40 1738119220

Of course! I do the same. However, I won't deny being able to get some information, even if I must validate it with care, jn a pinch is a great thing.

It just a tool in the tool box. Like any tool, one must respect and use it with care.

ddingus · 2025-01-29T23:30:29 1738193429

I asked the question. It did a respectable job.

For choking, as an example, it outlines all the basics, affirmed it was educational info only, and urged the help of a professional.

All in all, reasonable.

Your point stands, but this little model 2.5B has clear bounds it seems.

I plan to poke at it a few more times asking for higher risk info, curious about what it may do.

ddingus · 2025-01-28T16:44:33 1738082673

I fired the model up at 8:39, ready to chat just a bit before 8:40, during which I input the following prompt:

Please define "lark" for me, as in "I did it on a lark."

Output began a bit before 8:41, and was complete before 8:43:

---Begin Model Output

"On a lark" means without much planning or reason, impulsively or unexpectedly.

Here's how it works:

"I did it on a lark." This means the person acted without a specific plan or reason. "A lark" refers to the lark bird, a small songbird known for its playful nature.

So, "on a lark" suggests something spontaneous, casual, and often with a touch of playful whimsy.

Here are some other ways to understand it:

Without much thought: The person was likely to do it quickly and without much consideration. Based on a whim: It's something they decided to do without a solid plan.

Let me know if you'd like more examples

---End Model Output

To me, that is pretty damn spiffy!

dbtc · 2025-01-28T18:48:34 1738090114

I opened Dictionary.app and typed in lark and got the definition with etymology in under 30 seconds.

> British English informal used to suggest that an activity is foolish or a waste of time: he's serious about this music lark.

ddingus · 2025-01-29T02:51:29 1738119089

You know that's not the point. You can't even ask the dictionary app a question. You more or less just input and it'll give you what it has.

I can ask this model some pretty intense things and I get meaningful output back and it doesn't take very long and I don't need an internet connection. Like I said it's spiffy.

I will, have, and will use it again sans network connection and be happy I had the option.

Better will come. That is just fine. I have something very useful today, right now.

goosejuice · 2025-01-28T15:58:33 1738079913

I use commercial LLMs every day. The best of them can still be infuriating at times to the point of being unproductive. So I'm not sure I agree here.

ricardobeat · 2025-01-28T12:36:10 1738067770

The repetition issue happens on simple quantization, what they are releasing is an approach that fixes that.

danielhanchen · 2025-01-28T21:28:06 1738099686

Yes exactly! I edited the blog post to make the wording a bit better!

JKCalhoun · 2025-01-28T14:06:32 1738073192

Layman here — but I am hopeful for 1.58 bit plus custom silicon to be the Holy Grail. I suppose I am setting high expectations on Apple to integrate said in their next "A" chip.

Wishful thinking.

danielhanchen · 2025-01-28T21:28:53 1738099733

Ye a custom chip would be insane! 1.5 bit with a scaling factor seems to be actually usable for MoEs with shared experts!

sliken · 2025-01-28T17:01:32 1738083692

I do want a 192GB Mac Ultra, I'm hoping the Nvidia Digit achieves similar at $3,000. Sadly no specifications or benchmarks, so tokens/sec is just a guess at this point.

yodsanklai · 2025-01-28T21:59:13 1738101553

> I’d rather build a rig with used 3090s and get way more bang for my buck

I'm curious, what would you use that rig for?