Codestral Mamba

bhouston · 2024-07-16T16:15:47.000000Z

What are the steps required to get this running in VS Code?

If they had linked to the instructions in their post (or better yet a link to a one click install of a VS Code Extension), it would help a lot with adoption.

(BTW I consider it malpractice that they are at the top of hacker news with a model that is of great interest to a large portion of the users where and they do not have a monetizable call to action on the page featured.)

leourbina · 2024-07-16T16:31:55.000000Z

If you can run this using ollama, then you should be able to use https://www.continue.dev/ with both IntelliJ and VSCode. Haven’t tried this model yet - but overall this plugin works well.

scosman · 2024-07-16T16:57:43.000000Z

They say no llama.cpp support yet, so no ollama yet (which uses llama.cpp)

HanClinto · 2024-07-16T17:14:21.000000Z

Correct. The only back-end that Ollama uses is llama.cpp, and llama.cpp does not yet have Mamba2 support. The issues to track Mamba2 and Codestral Mamba support are here:

https://github.com/ggerganov/llama.cpp/issues/8519

https://github.com/ggerganov/llama.cpp/issues/7727

Mamba support was added in March of this year:

https://github.com/ggerganov/llama.cpp/pull/5328

I have not yet seen a PR to address Mamba2.

sadeshmukh · 2024-07-16T17:03:27.000000Z

Ollama is supported: https://docs.continue.dev/setup/select-provider

trsohmers · 2024-07-16T17:06:23.000000Z

They meant that there is no support for Codestral Mamba for llama.cpp yet.

osmano807 · 2024-07-16T19:11:57.000000Z

Unrelated, all my devices freeze when accessing this page, desktop Firefox and Chrome, mobile Firefox and Brave. Is this the best alternative to access code ai helpers besides the GitHub Copilot and Google Gemini on VSCode?

raphaelj · 2024-07-16T19:33:22.000000Z

I've been using it for a few months (with Starcoder 2 for code, and GPT-4o for chat). I find the code completion actually better than Github Copilot.

My main complain is that the chat sometimes fails to correctly render some GPT-4o output (e.g. LaTeX expressions), but it's mostly fixed with a custom system prompt. It also significantly reduces the battery life of my Macbook M1, but that's expected.

oliverulerich · 2024-07-16T20:45:04.000000Z

I'm quite happy with Cody from Sourcegraph https://marketplace.visualstudio.com/items?itemName=sourcegr...

refulgentis · 2024-07-16T16:49:46.000000Z

"All you need is users" doesn't seem optimal IMHO, Stability.ai providing an object lesson in that.

They just released weights, and being a for profit, need to optimize for making money, not eyeballs. It seems wise to guide people to the API offering.

bhouston · 2024-07-16T17:15:04.000000Z

On top of Hacker News (the target demographic for coders) without an effective monetizable call to action? What a missed opportunity.

Github Copilot makes +100M/year, if not way way more.

Having a VS Code extension for Mistral would be a revenue stream if it was one-click and better or cheaper than Github Copilot. It is malpractice in my mind to not be doing this if you are investing in creating coding models.

treyd · 2024-07-17T00:31:59.000000Z

How the hell does Copilot make $100M/yr? That seems an order of magnitude higher than I would expect at the high end.

kcb · 2024-07-17T03:19:45.000000Z

I was thinking the opposite...remember there's enterprise subscriptions and multi-million dollar contracts with single companies.

ketzo · 2024-07-17T01:06:09.000000Z

if we’re talking individual subscriptions that’s ~1M paying subscribers. honestly that number would not totally shock me?

plus they’ve got some kinda enterprise/team offering; assuming they charge extra there, I could easily see $100M ARR

but that’s pure conjecture, and generous at that; I don’t think we have any hard numbers

refulgentis · 2024-07-17T05:19:20.000000Z

baq · 2024-07-17T08:23:46.000000Z

yeah exactly only $100M/yr? barely covers expenses

refulgentis · 2024-07-16T17:36:09.000000Z

I see, that makes sense: make an extension and charge for it.

I assumed they meant free x local. It doesn't seem rational to make this one paid: its significantly smaller than their better model, and even more so than Copilot's.

passion__desire · 2024-07-16T17:23:41.000000Z

But they also signal competence in the space which means M&A. Or big nation states in future would hire them to produce country models once the space matures as was Emad's vision.

refulgentis · 2024-07-16T17:35:26.000000Z

Did Emad's vision end up manifest? ex. did a nation-state end up paying Stability for a country model?

Would it help signal competency? They're a small team focused on making models, not VS Code extensions.

Would they do M&A? The founding team is ex-Googlers and has found significant attention in the MBA world via being an EU champion.

NotMichaelBay · 2024-07-16T23:59:27.000000Z

What does a "country model" mean? Optimized for that country's specific language, or with state propaganda or something else?

michaelt · 2024-07-17T09:20:32.000000Z

If you believe LLMs are going to end up built into everything and doing everything, from moderating social media to writing novels and history books, making such a model will be the most political thing that has ever happened.

If your country believes guns=bad nipples=good war=hell but you get your novels and history books written by an LLM trained by people who believe guns=good nipples=bad war=heroic it would be naive to expect the output to reflect your values and not theirs.

Even close allies of the US would be nervous to have such power in the hands of American multinational corporations alone - so the French state could be very eager for Mistral to produce a competitive product.

refulgentis · 2024-07-17T01:48:47.000000Z

More or less; it was about as serious as your median Elon product tweet the last decade, or median coin nonsense.

Half-baked idea that obviously the models would need to be tuned for different languages / for specific knowledge, therefore countries would pay to do that.

There were many ideas like that, none of them panned out, hence the defenestration. All love for the guy, he did a very, very good thing. It's just meaningless to invoke it here, not only because it's completely off-topic, if anything that's already the play as the EU champion, and because the Stability gentleman was just thinking out loud, nothing more.

DalasNoin · 2024-07-17T10:24:54.000000Z

I feel like local models could be an amazing coding experience because you could disconnect from the internet. Usually I need to open chatgpt or google every so often to solve some issue or generate some function, but this also introduces so many distractions. imagine being able to turn off internet completely and only have a chat assistant that runs locally. I fear though that it is just going to be a bit to slow at generating tokens on CPU to not be annoying.

regularfry · 2024-07-17T13:55:32.000000Z

I don't have a gut feel for how much difference the Mamba arch makes to inference speed, nor how much quantisation is likely to ruin things, but as a rough comparison Mistral-7B at 4 bits per param is very usable on CPU.

The issue with using any local models for code generation comes up with doing so in a professional context: you lose any infrastructure the provider might have for avoiding regurgitation of copyright code, so there's a legal risk there. That might not be a barrier in your context, but in my day-to-day it certainly is.

sleepytimetea · 2024-07-16T16:35:09.000000Z

Looking through the Quickstart docs, they have an API that can generate code. However, I don't think they have a way to do "Day 2" code editing.

Also, doesn't seem to have a freemium tier...need to start paying even before trying it out ?

"Our API is currently available through La Plateforme. You need to activate payments on your account to enable your API keys."

sv123 · 2024-07-16T17:00:33.000000Z

I signed up when codestral was first available and put my payment details in. Been using it daily since then with continue.dev but my usage dashboard shows 0 tokens, and so far have not been billed for anything... Definitely not clear anywhere, but it seems to be free for now? Or some sort of free limit that I am not hitting.

sunaookami · 2024-07-16T17:27:41.000000Z

Through codestral.mistral.ai? It's free until August 1st: https://docs.mistral.ai/capabilities/code_generation/

>Monthly subscription based, free until 1st of August

PufPufPuf · 2024-07-17T07:54:11.000000Z

Currently the best (most user-friendly) way to run models locally is to use Ollama with Continue.dev. This one is not available yet, though: https://github.com/ggerganov/llama.cpp/issues/8519

yogeshp · 2024-07-17T06:16:12.000000Z

Website codegpt.co also has a plugin for both VS Code and Intellij. When model becomes available in Ollama, you can connect plugin in VS code to local ollama instance.

antifa · 2024-07-18T04:31:13.000000Z

Maybe not this model, but checkout TabbyML for offline/selfhostws LLMs in vscode.

sfsylvester · 2024-07-18T09:15:41.000000Z

Also looks like an older version of Codestral works well with TabbyML: https://tabby.tabbyml.com/blog/2024/07/09/tabby-codestral/

Thank you for sharing, this is almost exactly what I've been looking for, for ages!

solarkraft · 2024-07-17T06:10:15.000000Z

I kinda just want something that can keep up with the original version of Copilot. It was so much better than the crap they’re pumping out now (keeps messing up syntax and only completing a few characters at a time).

terhechte · 2024-07-17T06:53:54.000000Z

Have you tried supermaven? (https://supermaven.com). I find it much better than copilot. Using it daily.

razodactyl · 2024-07-17T11:20:34.000000Z

Supposedly they were training on feedback provided by the plugin itself but that approach doesn't make sense to me because:

- I don't remember the shortcuts most of the time.

- When I run completions I double take and realise they're wrong.

- I am not a good source of data.

All this information is being fed back into the model as positive feedback. So perhaps reason for it to have gone downhill.

I recall it being amazing at coding back in the day, now I can't trust it.

Of course, it's anecdotal which is also problematic in itself but I have definitely noticed the issue where it will fail and stop autocompleting or provide completely irrelevant code.

theptip · 2024-07-17T15:06:04.000000Z

It could also be that back in the day they were training with a bit more code than they should have been (eg private repos) and now the lawyers are more involved the training set is smaller/more sanitized.

Pure speculation of course.

heeton · 2024-07-17T06:49:29.000000Z

Have you tried supermaven? It replaced copilot for me a couple of months ago.

karolist · 2024-07-17T07:31:51.000000Z

I tried it, uses GPT-4o, the $10 sign up credit dissapeared in a few hours of intense coding, I'm not paying $500/mo for a fancy autocomple. Manual instruct style chat about code with Claude-Sonnet-3.5 is the best price/perf I've tried so far, through poe.com I use around 30k credits per day of coding of the 1M monthly allotment, I think it was $200/y. It's not available directly in my country. I've tried a bunch of local models too but Claude is just next level and inference is very cheap.

M4v3R · 2024-07-17T11:58:18.000000Z

Supermaven does not use GPT-4o for their autocompletions, they have their own model for that and for your paid subscription you get unlimited use of autocomplete.

There is a separate, optional feature called Supermaven Chat which is basically a wrapper around GPT-4o / Claude 3.5 where you can chat about fragments of your code, refactor them etc. This costs money, but you can also supply your own API key for OpenAI/Anthropic models. Or just use the web interface of any LLM if you just want to chat about your codebase.

PufPufPuf · 2024-07-17T07:50:26.000000Z

On the pricing page it says $10/month, what you describe sounds like pay-as-you-go, are you sure it was this service you tried?

karolist · 2024-07-17T08:03:40.000000Z

Yes that the same, in the "Your Account" section of the service you have "Chat Credits" field, it was seeded with $10 initially and I got it down to $0.11 in one evening. You are right about the pricing page, it does say it's $10/month, but I believe it's $10/month up to some amount of queries in actuality, because they obviously can't offer infinite inference for $10. Maybe I was "holding it wrong", sending too much complete files as part of prompt and the token use and context window skyrocketed, but I absolutely used my initial $10 in one evening, unlike with Claude...

hobofan · 2024-07-17T08:09:48.000000Z

From what I can tell inside the product, the public pricing seems deceptive.

It appears that for the $10/month you just get access to additional features (e.g. bigger context) + a budget of $5/month of credits. The credits possibly translate 1:1 to the usage costs of the underlying models you chose.

I asked it to add some missing documentation with the Claude 3.5 Sonnet model to a medium-sized Python file (2k lines) as a first test and that used up 13 cents of the credits. If it were working in a really productive way (where I'd also include more files as context), I'd probably burn through $20-50/day.

I'm wondering why they are not publicly showing some more transparent pricing. Yeah, this method will bump their signup numbers for investors, but it also absolutely wrecks their churn numbers, when people immediately cancel in the first hour.

jessicant · 2024-07-17T10:09:26.000000Z

Supermaven has a $10/month subscription which covers unlimited use of the in editor code autocomplete stuff.

It also has a pay as you go chat which is integrated into the IDE w/ hotkeys, however the above subscription gives you $5 free credits for it a month. This doesn't call their own model but stuff like gpt-4o which is presumably why they don't offer unlimited free usage.

thot_experiment · 2024-07-16T18:42:08.000000Z

Does anyone have a favorite FIM capable model? I've been using codellama-13b through ollama w/ a vim extension i wrote and it's okay but not amazing, I definitely get better code most of the time out of Gemma-27b but no FIM (and for some reason codellama-34b has broken inference for me)

trissi · 2024-07-17T14:12:39.000000Z

I use deepseek-coder-7b-instruct-v1.5 & DeepSeek-Coder-V2-Lite-Instruct when I want speed & codestral-22B-v0.1 when I want smartness.

All of those are FIM capable, but especially deepseek-v2-lite is very picky with its prompt template so make sure you use it correctly...

Depending on your hardware codestral-22B might be fast enough for everything, but for me it's a bit to slow...

If you can run it deepseek v2 non-light is amazing, but it requires loads of VRAM

thot_experiment · 2024-07-17T21:09:29.000000Z

IIRC the codestral fim tokens aren't properly implemented in llama.cpp/ollama, what backend are you using to run them? id probably have to drop down to iq2_xxs or something for the full fat deepseek but I'll definitely look into codestral, I'm a big fan of mixtral, hopefully a MoE code model with FIM comes along soon.

EDIT: nvm, my mistake looks like it works fine https://github.com/ollama/ollama/issues/5403

xoranth · 2024-07-17T06:10:56.000000Z

Is the extension you wrote public?

thot_experiment · 2024-07-17T06:26:28.000000Z

no but it's super super janky and simple hodgepode of stack overflow and gemma:27b generated code, i'll just put it in the comment here, you just need CURL on your path and vim that's compiled with some specific flag

    function! GetSurroundingLines(n)
        let l:current_line = line('.')
        let l:start_line = max([1, l:current_line - a:n])
        let l:end_line = min([line('$'), l:current_line + a:n])
        
        let l:lines_before = getline(l:start_line, l:current_line - 1)
        let l:lines_after = getline(l:current_line + 1, l:end_line)
        
        return [l:lines_before, l:lines_after]
    endfunction
    
    function! AIComplete()
        let l:n = 256
        let [l:lines_before, l:lines_after] = GetSurroundingLines(l:n)
        
        let l:prompt = '<PRE>' . join(l:lines_before, "\n") . ' <SUF>' . join(l:lines_after, "\n") . ' <MID>'
        
        let l:json_data = json_encode({
            \ 'model': 'codellama:13b-code-q6_K',
            \ 'keep_alive': '30m',
            \ 'stream': v:false,
            \ 'prompt': l:prompt
        \ })
        
        let l:response = system('curl -s -X POST -H "Content-Type: application/json" -d ' . shellescape(l:json_data) . ' http://localhost:11434/api/generate')
    
        let l:completion = json_decode(l:response)['response']
        let l:paste_mode = &paste
        set paste
        execute "normal! a" . l:completion
        let &paste = l:paste_mode
    endfunction
    
    nnoremap <leader>c :call AIComplete()<CR>

xoranth · 2024-07-17T06:54:18.000000Z

Thanks!

sa-code · 2024-07-16T15:49:57.000000Z

It's great to see a high-profile model using Mamba2!

imjonse · 2024-07-16T16:42:04.000000Z

The MBPP column should bold DeepSeek as it has a better score than Codestral.

smith7018 · 2024-07-16T16:47:16.000000Z

Which means Codestral Mamba and DeepSeek both lead four benchmarks. Kinda takes the air out the announcement a bit.

causal · 2024-07-16T17:04:01.000000Z

It should be corrected but the interesting aspect of this release is the architecture. To stay competitive while only needing linear inference time and supporting 256k context is pretty neat.

mbowcut2 · 2024-07-16T17:43:24.000000Z

THIS. People don't realize the importance of Mamba competing on par with transformers.

imtringued · 2024-07-17T11:07:11.000000Z

Linear attention is terrible for chatbot style request-response applications, but if you're giving the model the prompt and then let it scan the codebase and fill in the middle, then linear attention should work pretty decently. The performance benefit should also have a much bigger impact, since you're reprocessing the same code over and over again.

ed · 2024-07-16T17:05:24.000000Z

They're in roughly the same class but totally different architectures

Deepseek uses a 4k sliding window compared to Codestral Mamba's 256k+ tokens

attentive · 2024-07-17T03:17:14.000000Z

codegeex4-all-9b beats them "on paper" so that's why it's not in the benchmarks.

magnio · 2024-07-16T16:33:57.000000Z

They announce the model is on HuggingFace but don't link to it. Here it is: https://huggingface.co/mistralai/mamba-codestral-7B-v0.1

dvfjsdhgfv · 2024-07-16T17:01:36.000000Z

The link is already there in the text, they probably just fixed it.

flakiness · 2024-07-16T20:45:09.000000Z

So Mamba is supposed to be faster and the article claims that. But they don't have any latency numbers.

Has anyone tried this? And then, is it fast(er)?

monkeydust · 2024-07-16T16:19:11.000000Z

Any recommended product primers to Mamba vs Transformers - pros/cons etc?

red2awn · 2024-07-16T17:38:44.000000Z

A very good primer to state-space models (from which Mamba is based on) is The Annotated S4 [1]. If you want to dive into the code I wrote a minimal single-file implementation of Mamba-2 here [2].

[1]: https://srush.github.io/annotated-s4/

[2]: https://github.com/tommyip/mamba2-minimal

bhouston · 2024-07-16T16:24:56.000000Z

This video is good: https://www.youtube.com/watch?v=N6Piou4oYx8. As are the other videos on the same YouTube account.

flakiness · 2024-07-17T00:38:18.000000Z

For those who are text oriented: https://newsletter.maartengrootendorst.com/p/a-visual-guide-...

The paper author has a blog series but I don't think it's for general public https://tridao.me/blog/2024/mamba2-part1-model/

ertgbnm · 2024-07-16T16:21:29.000000Z

https://www.youtube.com/watch?v=X5F2X4tF9iM

This is what introduced me to them. May be a bit outdated at this point.

modeless · 2024-07-17T05:37:54.000000Z

> Unlike Transformer models, Mamba models offer the advantage of linear time inference and the theoretical ability to model sequences of infinite length

> We have tested Codestral Mamba on in-context retrieval capabilities up to 256k tokens

Why only 256k tokens? Gemini's context window is 1 million or more and it's (probably) not even using Mamba.

rileyphone · 2024-07-17T07:38:38.000000Z

Gemini is probably using ring attention. But scaling to that size requires more engineering effort in terms of interlink that goes beyond the purpose of this release from Mistral.

tatsuya4 · 2024-07-16T20:43:26.000000Z

Just did a quick test in the https://model.box playground, and it looks like the completion length is noticeably shorter than other models (e.g., gpt-4o). However, the response speed meets expectations..

culopatin · 2024-07-16T16:02:00.000000Z

Does anyone have a video or written article that would get one up to speed with a bit of the history/progression and current products that are out there for one to try locally?

This is coming from someone that understands the general concepts of how LLMs work but only used the general publicly available tools like ChatGPT, Claude, etc.

I want to see if I have any hardware I can stress and run something locally, but don’t know where to start or even what are the available options.

Kydlaw · 2024-07-16T16:06:26.000000Z

If I understand correctly what you are looking for, Ollama might be a solution (https://ollama.com/)?. I have no affiliation, but I lazily use this solution when I want to run a quick model locally.

TechDebtDevin · 2024-07-16T16:24:02.000000Z

Better yet install Open Web GUI and ollama at the same time via docker. Most people will want a familiar GUI rather than the terminal.

https://github.com/open-webui/open-webui

This will install ollama and open web GUI:

For GPU support run:

docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama

Use for CPU only support:

docker run -d -p 3000:8080 -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama

Der_Einzige · 2024-07-16T16:50:41.000000Z

Why do people recommend this instead of the much better oobabooga text-gen-webui?

https://github.com/oobabooga/text-generation-webui

It's like you hate settings, features, and access to many backends!

TechDebtDevin · 2024-07-16T17:07:07.000000Z

To each their own, how are you using these extra features? I personally am not looking to spend a bunch on API credits and don't have the hardware to run models larger than 7-8b parameters. I use local llms almost exclusively for formatting notes and as a reading assistant/summarizer and therefor don't need these features.

currycurry16 · 2024-07-16T16:10:42.000000Z

Find good models here: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...

Check hardware requirements here: https://rahulschand.github.io/gpu_poor/

nabakin · 2024-07-16T20:03:38.000000Z

Here's a summary of what's happened the past couple of years and what tools are out there.

After ChatGPT released, there was a lot of hype in the space but open source was far behind. Iirc the best open foundation LLM that existed was GPT-2 but it was two generations behind.

Awhile later Meta released LLaMA[1], a well trained base foundation model, which brought an explosion to open source. It was soon implemented in the Hugging Face Transformers library[2] and the weights were spread across the Hugging Face website for anyone to use.

At first, it was difficult to run locally. Few developers had the system or money to run. It required too much RAM and iirc Meta's original implementation didn't support running on the CPU but developers soon came up with methods to make it smaller via quantization. The biggest project for this was Llama.cpp[3] which probably is still the biggest open source project today for running LLMs locally. Hugging Face Transformers also added quantization support through bitsandbytes[4].

Over the next months there was rapid development in open source. Quantization techniques improved which meant LLaMA was able to run with less and less RAM with greater and greater accuracy on more and more systems. Tools came out that were capable of finetuning LLaMA and there were hundreds of LLaMA finetunes that came out which finetuned LLaMA on instruction following, RLHF, and chat datasets which drastically increased accuracy even further. During this time, Stanford's Alpaca, Lmsys's Vicuna, Microsoft's Wizard, 01ai's Yi, Mistral, and a few others made their way onto the open LLM scene with some very good LLaMA finetunes.

A new inference engine (software for running LLMs like Llama.cpp, Transformers, etc) called vLLM[5] came out which was capable of running LLMs in a more efficient way than was previously possible in open source. Soon it would even get good AMD support, making it possible for those with AMD GPUs to run open LLMs locally and with relative efficiency.

Then Meta released Llama 2[6]. Llama 2 was by far the best open LLM for its time. Released with RLHF instruction finetunes for chat and with human evaluation data that put its open LLM leadership beyond doubt. Existing tools like Llama.cpp and Hugging Face Transformers quickly added support and users had access to the best LLM open source had to offer.

At this point in time, despite all the advancements, it was still difficult to run LLMs. Llama.cpp and Transformers were great engines for running LLMs but the setup process was difficult and required a lot of time. You had to find the best LLM, quantize it in the best way for your computer (or figure out how to identify and download one from Hugging Face), setup whatever engine you wanted, figure out how to use your quantized LLM with the engine, fix any bugs you made along the way, and finally figure out how to prompt your specific LLM in a chat-like format.

However, tools started coming out to make this process significantly easier. The first one of these that I remember was GPT4All[7]. GPT4All was a wrapper around Llama.cpp which made it easy to install, easy to select the LLM that you want (pre-quantized options for easy download from a download manager), and a chat UI which made LLMs easy to use. This significantly reduced the barrier to entry for those who were interested in using LLMs.

The second project that I remember was Ollama[8]. Also a wrapper around Llama.cpp, Ollama gave most of what GPT4All had to offer but in an even simpler way. Today, I believe Ollama is bigger than GPT4All although I think it's missing some of the higher-level features of GPT4All.

Another important tool that came out during this time is called Exllama[9]. Exllama is an inference engine with a focus on modern consumer Nvidia GPUs and advanced quantization support based on GPTQ. It is probably the best inference engine for squeezing performance out of consumer Nvidia GPUs.

Months later, Nvidia came out with another new inference engine called TensorRT-LLM[10]. TensorRT-LLM is capable of running most LLMs and does so with extreme efficiency. It is the most efficient open source inferencing engine that exists for Nvidia GPUs. However, it also has the most difficult setup process of any inference engine and is made primarily for production use cases and Nvidia AI GPUs so don't expect it to work on your personal computer.

With the rumors of GPT-4 being a Mixture of Experts LLM, research breakthroughs in MoE, and some small MoE LLMs coming out, interest in MoE LLMs was at an all-time high. The company Mistral had proven itself in the past with very impressive LLaMA finetunes, capitalized on this interest by releasing Mixtral 8x7b[11]. The best accuracy for its size LLM that the local LLM community had seen to date. Eventually MoE support was added to all inference engines and it was a very popular mid-to-large sized LLM.

Cohere released their own LLM as well called Command R+[12] built specifically for RAG-related tasks with a context length of 128k. It's quite large and doesn't have notable performance on many metrics, but it has some interesting RAG features no other LLM has.

More recently, Llama 3[13] was released which similar to previous Llama releases, blew every other open LLM out of the water. The smallest version of Llama 3 (Llama 3 8b) has the greatest accuracy for its size of any other open LLM and the largest version of Llama 3 released so far (Llama 3 70b) beats every other open LLM on almost every metric.

Less than a month ago, Google released Gemma 2[14], the largest of which, performs very well under human evaluation despite being less than half the size of Llama 3 70b, but performs only decently on automated benchmarks.

If you're looking for a tool to get started running LLMs locally, I'd go with either Ollama or GPT4All. They make the process about as painless as possible. I believe GPT4All has more features like using your local documents for RAG, but you can also use something like Open WebUI[15] with Ollama to get the same functionality.

If you want to get into the weeds a bit and extract some more performance out of your machine, I'd go with using Llama.cpp, Exllama, or vLLM depending upon your system. If you have a normal, consumer Nvidia GPU, I'd go with Exllama. If you have an AMD GPU that supports ROCm 5.7 or 6.0, I'd go with vLLM. For anything else, including just running it on your CPU or M-series Mac, I'd go with Llama.cpp. TensorRT-LLM only makes sense if you have an AI Nvidia GPU like the A100, V100, A10, H100, etc.

[1] https://ai.meta.com/blog/large-language-model-llama-meta-ai/

[2] https://github.com/huggingface/transformers

[3] https://github.com/ggerganov/llama.cpp

[4] https://github.com/bitsandbytes-foundation/bitsandbytes

[5] https://github.com/vllm-project/vllm

[6] https://ai.meta.com/blog/llama-2/

[7] https://www.nomic.ai/gpt4all

[8] http://ollama.ai/

[9] https://github.com/turboderp/exllamav2

[10] https://github.com/NVIDIA/TensorRT-LLM

[11] https://mistral.ai/news/mixtral-of-experts/

[12] https://cohere.com/blog/command-r-plus-microsoft-azure

[13] https://ai.meta.com/blog/meta-llama-3/

[14] https://blog.google/technology/developers/google-gemma-2/

[15] https://github.com/open-webui/open-webui

visarga · 2024-07-17T05:47:43.000000Z

Overall a good write up, but I have a few quips:

> Awhile later Meta released LLaMA[1],

I think Stable Diffusion was first to release a SOTA model (August 2022) that worked locally, not in language but image generation, but it set the tone for Meta. LLaMA only came in February 2023.

> The company Mistral had proven itself in the past with very impressive LLaMA finetunes

Mistal is not a finetune of LLaMA, it is a model trained from scratch. Also, Mistral was most of the time better than LLaMA during this period.

> Quantization techniques improved which meant LLaMA was able to run with less and less RAM with greater and greater accuracy

Quantization does not improve accuracy, except if you trade off precision for longer context maybe, but not on similar prompts. It is like JPEG compression, the original is always better for a specific image, but for the same byte size you get more resolution from JPEG than say... a PNG.

nabakin · 2024-07-17T11:00:35.000000Z

> I think Stable Diffusion was first to release a SOTA model (August 2022) that worked locally, not in language but image generation, but it set the tone for Meta. LLaMA only came in February 2023.

Sure, I was only covering LLMs though. If I wanted to cover image generation models and tools as well, the comment would be double its size.

> Mistal is not a finetune of LLaMA, it is a model trained from scratch. Also, Mistral was most of the time better than LLaMA during this period.

Oh, that's right. Iirc it was just the Llama 2 architecture that was used with sliding window attention.

> Quantization does not improve accuracy, except if you trade off precision for longer context maybe, but not on similar prompts. It is like JPEG compression, the original is always better for a specific image, but for the same byte size you get more resolution from JPEG than say... a PNG.

I'm well aware of how quantization works. I meant quantization methods were increasingly able to retain accuracy. Such as methods which quantize less important weights more heavily, improving accuracy for the same LLM size.

psychoslave · 2024-07-16T22:30:11.000000Z

This is one one the most useful and informative comment I ever faced on HN. Thank you very much.

iAmAPencilYo · 2024-07-16T21:43:08.000000Z

Thank you! Very helpful as a newbie coming in.

holoduke · 2024-07-16T21:39:24.000000Z

Great info. Do you also know the state of the code assistants? Any thoughts on copilot versus others?

hobofan · 2024-07-16T22:29:52.000000Z

All the main IDE-integrated ones seem very much on par (Copilot, Sourcegraph Cody, Continue.dev), with cursor.sh liked by some as it has code assistant-first UI.

I've personally went back to the browser with Claude 3.5 Sonnet (and the projects + artifacts feature), as it is one of the most industrious ones, and I really like the UX of artifacts + it integrates new code well into existing code you paste into it.

In the end I think it also often comes down to what languages/frameworks you are using and how well the LLM/product handles it, so I'd still recommend to test around. E.g. some of the main frameworks I'm working with on a daily basis went through big refactors/interface changes 1-2 years ago, and I stopped using ChatGPT because it had a strong tendency to produce code based on the old interfaces/paradigms.

Aider[0] is also quite interesting, especially when it comes to more significant refactorings in the codebase and has gotten quite good with that with the last few bigger model releases, but it takes same time to get used to and doesn't have good IDE-integration.

[0]: https://github.com/paul-gauthier/aider

nabakin · 2024-07-16T22:26:41.000000Z

I've been following the state of things, but I'm not sure which ones are the best. There's Meta's CodeLlama[1], Mistral's Codestral[2], DeepSeek AI's DeepSeek-Coder-V2-Instruct[3], CodeGemma[4], Alibaba's CodeQwen[5], and Microsoft's WizardCoder[6].

I'm pretty sure CodeLlama is out of date now. I've heard DeepSeek LLMs are good and DeepSeek-Coder-V2-Instruct was released recently. With the good reputation and its massive size (236b) I'd guess it is the best coding LLM, but if it's not being trained efficiently, maybe Codestral and Codestral Mamba come close.

I don't think the best coding LLMs are close to GitHub Copilot but I could be wrong since I'm just relaying information that I've heard secondhand.

[1] https://ai.meta.com/blog/code-llama-large-language-model-cod...

[2] https://mistral.ai/news/codestral/

[3] https://github.com/deepseek-ai/DeepSeek-Coder-V2

[4] https://developers.googleblog.com/en/gemma-family-expands-wi...

[5] https://qwenlm.github.io/blog/codeqwen1.5/

[6] https://github.com/nlpxucan/WizardLM

attentive · 2024-07-17T03:18:03.000000Z

try THUDM/codegeex4-all-9b

ygouzerh · 2024-07-17T16:25:14.000000Z

Wow very useful comment, thank you very much for all the work to write it!

TechDebtDevin · 2024-07-16T16:18:33.000000Z

Most the 7b instruct models are very bad outside very simple queries.

You can run a 7b on most modern hardware.How fast will vary.

To run 30-70b models you're getting in the realm of needing 24gb or more of vRAM.

dTal · 2024-07-16T19:26:44.000000Z

>Most the 7b instruct models are very bad outside very simple queries.

I can't agree with "very bad". Maybe your standards are set by the best, largest models, but have a little perspective: a modern 7b model is a friggin magical piece of software. Fully in the realm of sci-fi until basically last Tuesday. It can reliably summarize documents, bash a 30 minute rambling voice note into a terse proposal, and give you social counseling at least on par with r/Relationship_Advice. It might not always get facts exactly right but it is smart in a way that computers have never been before. And for all this capability, you can get it running on a computer a decade old, maybe even a Raspberry Pi or a smartphone.

To answer the parent: Download a "gguf" file (blob of weights) of a popular model like Mistral from HugginFace. Git pull and compile llama.cpp. Run ./main -m path/to/gguf -p "prompt"

visarga · 2024-07-17T06:04:08.000000Z

Even better, install ollama and then do "ollama run llama3", it works like docker, pulls the model locally and starts a chat session right there in the terminal. No need to compile. Or just run the docker image "ollama/ollama".

Agentus · 2024-07-16T16:48:15.000000Z

I'm looking to run something on a 24gb GPU for the purpose of running wild with agentic use of LLMs. Is there anything worth trying that would fit on that amount of vRAM? Or are all the open-source PC-sized LLMs laughable still?

TechDebtDevin · 2024-07-16T18:19:44.000000Z

You can run the llama 70b based models faster than 10 tkn/s on 24gb vram. I've found that the quality of this class of LLMs is heavily swayed by your configuration and system prompting and results may vary. This Reddit post seems to have some input on the topic:

https://www.reddit.com/r/LocalLLaMA/comments/1cj4det/llama_3...

I haven't used any agent frameworks other than messing around with langchain a bit so I can't speak to how that would effect things.

Zambyte · 2024-07-17T12:20:57.000000Z

You would probably get the same tokens per second with llama 3 70b if you just unplugged the 24gb GPU. For something that actually fits in 24gb of VRAM, I recommend gemma 2 27b up to q6. I use q4 and it works quite well for my needs.

sva_ · 2024-07-16T16:10:07.000000Z

If you mean LLM in general, maybe try llamafile first

https://github.com/Mozilla-Ocho/llamafile

derefr · 2024-07-16T16:33:51.000000Z

For running LLMs, I think most people just dive into https://www.reddit.com/r/LocalLLaMA/ and start reading.

Not sure what the equivalent is for image generation; it's either https://www.reddit.com/r/StableDiffusion/ or one of the related subreddits it links to.

Sadly, I've yet to find anyone doing "daily ML-hobbyist news" content creation, summarizing the types of articles that appear on these subreddits. (Which is a surprise to me, as it's really easy to find e.g. "daily homelab news" content creators. Please, someone, start a "daily ML-hobbyist news" blog/channel! Given that the target audience would essentially be "people who will get an itch to buy a better GPU soon", the CPM you'd earn on ad impressions would be really high...)

---

That being said, just to get you started, here's a few things to know at present about "what you can run locally":

1. Most models (of the architectures people care about today) will probably fit on a GPU which has something like 1.5x the VRAM of the model's parameter-weights size. So e.g. a "7B" (7 billion parameter-weights) model, will fit on a GPU that has 12GB of VRAM. (You can potentially squeeze even tighter if you have a machine with integrated graphics + dedicated GPU, and you're using the integrated graphics as graphics, leaving the GPU's VRAM free to only hold the model.)

2. There are models that come in all sorts of sizes. Many open-source ML models are huge (70B, 120B, 144B — things you'd need datacenter-class GPUs to run), but then versions of these same models get released which have been heavily cut down (pruned and/or quantized), to force them to fit into smaller VRAM sizes. There are 5B, 3B, 1B, even 0.5B models (although the last two are usually special-purpose models.)

3. Surprisingly, depending on your use-case, smaller models (or small quants of larger models) can "mostly" work perfectly well! They just have more edge-cases where something will send them off the rails spiralling into nonsense — so they're less reliable than their larger cousins. You might have to give them more prompting, and try regenerating their output from the same prompt several times, to get good results.

4. Apple Silicon Macs have a GPU and TPU that read from/write to the same unified memory that the CPU does. While this makes these devices slower for inference than "real" GPUs with dedicated VRAM, it means that if you happen to own a Mac with 16GB of RAM, then you own something that can run 7B models. AS Macs are, oddly enough, the "cheapest" things you can buy in terms of model-capacity-per-dollar. (Unlike a "real" GPU, they won't be especially quick and won't have any capacity for concurrent model inference, so you'd never use one as a server backing an Inference-as-a-Service business. But for home use? No real downsides.)

_kidlike · 2024-07-16T16:11:14.000000Z

not sure about the history/progression part, but there's ollama which makes it possible to run models locally. The UX of ollama is similar to docker.

Kinrany · 2024-07-16T18:21:26.000000Z

Is there a good explanation of the Mamba architecture?

alecco · 2024-07-16T19:17:36.000000Z

https://thegradient.pub/mamba-explained/

https://jackcook.com/2024/02/23/mamba.html

https://www.kolaayonrinde.com/blog/2024/02/11/mamba.html

altilunium · 2024-07-17T13:24:42.000000Z

https://x.com/bycloudai/status/1813311769047138568

simonw · 2024-07-16T18:38:19.000000Z

There's a paper: https://arxiv.org/abs/2312.00752

I haven't seen any good non-paper explainers yet.

rjurney · 2024-07-16T16:29:43.000000Z

But I JUST switched from GPT4o to Claude! :( Kidding, but it isn't clear how to use this thing, as others have pointed out.

ukuina · 2024-07-16T17:02:26.000000Z

What made you switch?

throwup238 · 2024-07-16T18:06:43.000000Z

Claude Projects which allow attaching a bunch of files to fill up the 200k context. I wrote up a script to dump a bunch of code and documentation files to markdown as context and I add them to a bunch of Claude projects on a per topic basis.

For example, I'm currently working on a Rust/Qt desktop app so I have a project with the whole Qt6 book attached to ask questions about Qt, a project with my SQL schema and ORM/Sqlite docs to ask questions about the app's data and generate models without dealing with hallucinations, a project with all my QML files and Rust QML element code, a project with a bunch of Rust crate docs, and so on and on.

GPTs allow attaching files too but Claude Projects dump the entire contents of the files into the context rather than trying to do some hacky RAG that never works like I want it to.

funnygiraffe · 2024-07-16T18:36:50.000000Z

I was under the impression that with LLMs, in order to get high-quality answers, it's always best to keep context short. Is that not the case anymore? Does Claude under this usage paradigm not struggle with very long contexts in ways as for example described in the "lost in the middle" paper (https://arxiv.org/abs/2307.03172)?

azeirah · 2024-07-16T20:45:33.000000Z

The conclusion you walked away with is the opposite of what usually works in practice.

The more context you give the llm, the better.

The key takeaway from that paper is to keep your instructions/questions/direction in the beginning or at the end of the context. Any information can go anywhere.

Not to be too dismissive, it's a good paper, but we're one year further and in practice this issue seems to have been tackled by training on better data.

This can differ a lot depending on what model you're using, but in the case of claude sonnet 3.5, more relevant context is generally better for anything except for speed.

It does remain true that you need to keep your most important instructions at the beginning or at the end however.

rvnx · 2024-07-16T23:15:08.000000Z

At the beginning it was true, the longer the context, the more the LLM was lost, but now, the new models can retrieve information anywhere in the context

c.f.

https://pbs.twimg.com/media/GH2NJMxbYAAcRL3?format=jpg&name=...

throwup238 · 2024-07-16T19:09:47.000000Z

I don't have the time to evaluate the effects of context length on my use cases so I have no idea. There might be some degradation when I attach the Qt book which is probably already in Claude's training data but when using it against my private code base, it's not like I have any other choice.

The UX of drag and dropping a few monolithic markdown files to include entire chunks of a large project outweighs the downsides of including irrelevant context in my experience.

inciampati · 2024-07-16T19:51:52.000000Z

No, you need to provide as much information in context as possible. Otherwise you are sampling from the mode. "Write me an essay about cows" = garbage boring and probably 200 words. "here are twenty papers about cow evolution, write me an overview of findings" = yes

rjurney · 2024-07-16T17:32:04.000000Z

Claude is much better. Overwhelmingly better. It not only implements deep learning models for me, it has great suggestions on evolving them to actually work.

mountainriver · 2024-07-16T17:50:18.000000Z

lol no it’s not, the benchmarks don’t show that at all. Both have issues in different ways

causal · 2024-07-16T18:16:48.000000Z

Benchmarks are pretty flawed IMO, in particular their weakness here seems to be that they are poor at evaluating long-tail multiturn conversations. 4o often gives a great first response, then spirals into a repetition. Sonnet 3.5 is much better at seeing the big picture in a longer conversation IMO.

stavros · 2024-07-16T21:57:44.000000Z

I made a mobile app the other day using LLMs (I had never used React or TypeScript before, and I built an app with React Native). I was pretty disappointed, both Sonnet 3.5 and gpt-4-turbo performed pretty poorly, making mistakes like missing a closing bracket somewhere and meaning I had to revert, because I had no idea where they meant to put it.

Also they did the thing that junior developers tend to do, where you have a race condition of some sort, and they just work around it by adding some if checks. The app is at around 400 lines right now, it works but feels pretty brittle. Adding a tiny feature here or there breaks something else, and GPT does the wrong thing half the time.

All in all, I'm not complaining, because I made an app in two days, but it won't replace a developer yet, no matter how much I want it to.

orbital-decay · 2024-07-16T20:10:38.000000Z

Repetition in multiturn conversations is actually Sonnet's fatal flaw, both 3 and 3.5. 4o is also repetitive to an extent. Opus is way better than both at being non-repetitive.

rjurney · 2024-07-22T18:25:39.000000Z

Which benchmarks are you looking at? It is very competitive with GPT4o in the table of metrics I just built at work. Have you used it to code? Qualitatively, it is much better - once it can execute Python it will be supzors.

whywhywhywhy · 2024-07-17T09:57:44.000000Z

Benchmarks claim GPT4o is better than GPT4 so anyone who's actually used the software knows benchmarks mean nothing.

treme · 2024-07-17T01:10:53.000000Z

it's the consensus of people that use both that claude.ai is superior for practical use despite benchmark results which are mostly one-shot based prompts.

pelagicAustral · 2024-07-16T17:07:07.000000Z

I'm using both, been doing that for months now. I can confidently assert that while Claude is getting better and better, GPT 4 and 4o seem the be getting dumbed down for some unexplained reason. Claude is now my go-to for anything code. (I do Ruby and C#, btw, other might have a different experience)

marcyb5st · 2024-07-16T19:38:10.000000Z

I guess they are distilling the models so that they can save $$$ on serving.

ldjkfkdsjnv · 2024-07-16T17:16:45.000000Z

GPT4o is way behind sonnet 3.5

mountainriver · 2024-07-16T17:50:36.000000Z

Huh I guess all the benchmarks are wrong then

GaggiX · 2024-07-17T09:59:21.000000Z

3.5 Sonnet outperforms GPT-4o on most benchmarks.

causal · 2024-07-16T18:17:19.000000Z

Agreed.

rjurney · 2024-07-22T18:26:11.000000Z

I think it depends on what your use case is and how optimized your prompts are.

zamalek · 2024-07-17T05:20:10.000000Z

Is this the active Codestral model on Le Chat? I got quite some mixed results from it tonight.

localfirst · 2024-07-16T16:48:29.000000Z

any sort of evals on how it compares to closed models like chat gpt 4 or open ones like WizardLLM ?

taf2 · 2024-07-16T19:55:25.000000Z

How does this work in vim?

kristianp · 2024-07-17T01:04:09.000000Z

Similarly, is there a way to use it with Kate or Sublime Text?

pzo · 2024-07-16T17:47:28.000000Z

weird they compare to deepseek-coder v1.5 when we already have v2.0. Any advantage to use codestral mamba apart from that it's lighter in weights?

kz919 · 2024-07-16T21:26:11.000000Z

obviously because they can't beat it... There will be zero reason to use it when you have better transformer based models that can fit the existing infrastructure.

attentive · 2024-07-17T03:55:39.000000Z

deepseek-coder 2.0 is big, even lite version.

I do wish they compare it to codegeex4-all-9b

sam_goldman_ · 2024-07-16T18:13:18.000000Z

You can try this model out using OpenAI's API format with this TypeScript SDK: https://github.com/token-js/token.js

You just need a Mistral API key: https://console.mistral.ai/api-keys/

croemer · 2024-07-16T16:19:17.000000Z

The first sentence is wrong. The website says:

> As a tribute to Cleopatra, whose glorious destiny ended in tragic snake circumstances

but according to Wikipedia this is not true:

> When Cleopatra learned that Octavian planned to bring her to his Roman triumphal procession, she killed herself by poisoning, contrary to the popular belief that she was bitten by an asp.

skybrian · 2024-07-16T16:41:06.000000Z

Yes, that seems to be a myth, but exact circumstances seem rather uncertain according to the Wikipedia article [1]:

> [A]ccording to the Roman-era writers Strabo, Plutarch, and Cassius Dio, Cleopatra poisoned herself using either a toxic ointment or by introducing the poison with a sharp implement such as a hairpin. Modern scholars debate the validity of ancient reports involving snakebites as the cause of death and whether she was murdered. Some academics hypothesize that her Roman political rival Octavian forced her to kill herself in a manner of her choosing. The location of Cleopatra's tomb is unknown. It was recorded that Octavian allowed for her and her husband, the Roman politician and general Mark Antony, who stabbed himself with a sword, to be buried together properly.

I think this rounds to “nobody really knows.”

The “glorious destiny” seems kind of shaky, too. It’s just a throwaway line anyway.

[1] https://en.m.wikipedia.org/wiki/Death_of_Cleopatra

ljsprague · 2024-07-16T17:09:21.000000Z

What bothers me more is that the legend is that she was killed by an asp, not a mamba.

rjurney · 2024-07-16T16:30:31.000000Z

I believe this is in dispute among sources.

dghlsakjg · 2024-07-16T19:26:00.000000Z

Maybe Octavian was the snake?

lolinder · 2024-07-17T02:11:34.000000Z

I know it's just a throwaway line, but the bit about Cleopatra at the top feels in poor taste. It's completely inaccurate in that no one has ever attributed her death to a "mamba", and even the asp that some sources claim has been disputed. But even aside from that, it just feels weird that a human being's death has turned into a random reference you can make in a throwaway joke while advertising a product.

They're certainly not the first to use Cleopatra this way nor the most egregious, but there are plenty of other random mamba jokes that could have filled in there and both made more sense and been less crass.

virgildotcodes · 2024-07-17T02:24:46.000000Z

Too soon!

quantisan · 2024-07-17T03:57:25.000000Z

Rick and Morty ref?

https://youtu.be/rYJP2YWtZl8?si=NIvyeVwCJaExq92U

alluro2 · 2024-07-17T02:23:30.000000Z

Maybe it was generated :)

PoignardAzur · 2024-07-17T13:24:19.000000Z

Also, saying that she had a "glorious destiny" is incredibly inaccurate historically. She was kinda crap at her job.

solarkraft · 2024-07-17T06:06:28.000000Z

I do agree, but there’s worse.

Ever heard of Patrice Lumumba? He was a congolese politician involved in its independence and democratization. He was shot with the involvement of western forces. There’s a drink named after him: Hot chocolate with a shot of rum.