I'm absolutely amazed at how capable the new 1B model is, considering it's just ...

GaggiX · 2024-09-25T22:13:21 1727302401

Llama 3.2 vision models don't seem that great if they have to compare them to Claude 3 Haiku or GPT4o-mini. For an open alternative I would use Qwen-2-72B model, it's smaller than the 90B and seems to perform quite better. Also Qwen2-VL-7B as an alternative to Llama-3.2-11B, smaller, better in visual benchmarks and also Apache 2.0.

Molmo models: https://huggingface.co/collections/allenai/molmo-66f379e6fe3..., also seem to perform better than Llama-3.2 models while being smaller and Apache 2.0.

f38zf5vdt · 2024-09-26T00:55:44 1727312144

1. Ignore the benchmarks. I've been A/Bing 11B today with Molmo 72B [1], which itself has an ELO neck-and-neck with GPT4o, and it's even. Because everyone in open source tends to train on validation benchmarks, you really can not trust them.

2. The method of tokenization/adapter is novel and uses many fewer tokens than all comparable CLIP/SigLIP-adapter models, making it _much_ faster. Attention is O(n^2) on memory/compute per sequence length.

[1] https://molmo.allenai.org/blog

espadrine · 2024-09-26T12:40:26 1727354426

> I've been A/Bing 11B today with Molmo 72B

How are you testing Molmo 72B? If you are interacting with https://molmo.allenai.org/, they are using Molmo-7B-D.

benreesman · 2024-09-26T04:15:24 1727324124

It’s not just open source that trains on the validation set. The big labs have already forgotten more about gaming MMLU down to the decimal than the open source community ever knew. Every once in a while they get sloppy and Claude does a faux pas with a BIGBENCH canary string or some other embarrassing little admission of dishonesty like that.

A big lab gets exactly the score on any public eval that they want to. They have their own holdouts for actual ML work, and they’re some of the most closely guarded IP artifacts, far more valuable than a snapshot of weights.

sumedh · 2024-09-26T09:54:47 1727344487

I tried some OCR use cases, Claude Sonnet just blows Molmo.

knicholes · 2024-09-26T12:55:55 1727355355

When you say "blows," do you mean in a subservient sense or more like, "it blows it out of the water?"

grahamj · 2024-09-26T16:07:34 1727366854

yeah does it suck or does it suck?

GaggiX · 2024-09-26T01:43:41 1727315021

How about its performance compare to Qwen-2-72B tho?

f38zf5vdt · 2024-09-26T02:59:33 1727319573

Refer to the blog post I linked. Molmo is ahead of Qwen2 72b.

dannyobrien · 2024-09-26T00:26:13 1727310373

What interface do you use for a locally-run Qwen2-VL-7B? Inspired by Simon Willison's research[1], I have tried it out on Hugging Face[2]. Its handwriting recognition seems fantastic, but I haven't figured out how to run it locally yet.

[1] https://simonwillison.net/2024/Sep/4/qwen2-vl/ [2] https://huggingface.co/spaces/GanymedeNil/Qwen2-VL-7B

Eisenstein · 2024-09-26T02:52:05 1727319125

MiniCPM-V 2.6 is based on Qwen 2 and is also great at handwriting. It works locally with KoboldCPP. Here are the results I got with a test I just did.

Image:

* https://imgur.com/wg0kdQK

Output:

* https://pastebin.com/RKvYQasi

OCR script used:

* https://github.com/jabberjabberjabber/LLMOCR/blob/main/llmoc...

Model weights: MiniCPM-V-2_6-Q6_K_L.gguf, mmproj-MiniCPM-V-2_6-f16.gguf

Inference:

* https://github.com/LostRuins/koboldcpp/releases/tag/v1.75.2

jona-f · 2024-09-26T10:47:38 1727347658

Should the line "p.o. 5rd w/ new W5 533" say "p.o. 3rd w/ new WW 5W .533R"?

What does p.o. stand for? I can't make out the first letter. It looks more like the f, but the nodge on the upper left only fits the p. All the other p's look very different though.

Eisenstein · 2024-09-26T15:03:18 1727362998

'Replaced R436, R430 emitter resistors on right-channel power output board with new wire-wound 5watt .33ohm 5% with ceramic lead insulators'

jona-f · 2024-09-26T20:28:06 1727382486

Thx :). I thought the 3 looked like a b but didn't think brd would make any sense. My reasoning has led me astray.

Eisenstein · 2024-09-26T21:13:32 1727385212

Yeah. If you realize that a large part of the llm's 'ocr' is guessing due to context (token prediction) and not actually recognizing the characters exactly, you can see that it is indeed pretty impressive because the log it is reading uses pretty unique terminology that it couldn't know from training.

jona-f · 2024-09-27T06:57:20 1727420240

I'd say as an llm it should know this kind of stuff from training, contrary to me, for whom this is out of domain data. Anyhow I don't think the AI did a great job on that line. Would require better performance for it to be useful for me. I think larger models might actually be better at this than I am, which would be very useful.

Eisenstein · 2024-09-27T08:52:35 1727427155

Be aware that a lot of this also has to do with prompting and sampler settings. For instance changing the prompt from 'write the text on the image verbatim' to something like 'this is an electronics repair log using shorthand...' and being specific about it will give the LLM context in which to make decisions about characters and words.

hansoolo · 2024-09-26T10:16:43 1727345803

Thanks for the hint. Will try the out!

faangguyindia · 2024-09-26T09:34:42 1727343282

If you are in US, you get 1 billion tokens a DAY with Gemini (Google) completely free of cost.

Gemini Flash is fast with upto 4 million token context.

Gemini Flash 002 improved in math and logical abilities surpassing Claude and Gpt 4o

You can simply use Gemini Flash for Code Completion, git review tool and many more.

a2128 · 2024-09-26T12:48:43 1727354923

Is this sustainable though, or are they just trying really hard to attract users? If I build all of my tooling on it, will they start charging me thousands of dollars next year once the subsidies dry up? With a local model running with open source software, at least I can know that as long as my computer can still compute, the model will still run just as well and just as fast as it did on day 1, and cost the same amount of electricity

QuinnyPig · 2024-09-26T17:39:02 1727372342

Facts. Google did the same thing you describe with Maps a few years ago.

moffkalast · 2024-09-26T17:58:11 1727373491

It's not just Google, literally every new service always does this. Prices will always go up once the have enough customers and bean counters start pointing at spreadsheets. Ergo, local is the only option if you don't want to be held for ransom afterwards. As goes for web servers, scraper bots, and whatever, so goes for llms.

phillipcarter · 2024-09-26T15:31:54 1727364714

I think there's a few things to consider:

They make a ton of money on large enterprise package deals through Google Cloud. That includes API access but also support and professional services. Most orgs that pay for this stuff don't really need it, but they buy it anyways, as is consistent with most enterprise sales. That can give Google a significant margin to make up the cost elsewhere.

Gemini Flash is probably super cheap to run compared to other models. The cost of inference for many tasks has gone down tremendously over the past 1.5 years, and it's still going down. Every economic incentive aligns with running these models more efficiently.

rcpt · 2024-09-26T16:27:21 1727368041

Aren't API calls essentially swappable now between vendors now?

If you wanted to switch from Gemini to Chatgpt you could copy/paste your code into Chatgpt and ask it to switch to their API.

Disclaimer I work at Google but not on Gemini

zitterbewegung · 2024-09-26T16:51:07 1727369467

Not tokens allowed per user. Google has the largest token windows .

snek_case · 2024-09-26T17:32:30 1727371950

Different APIs and models are going to come with different capabilities and restrictions.

4ndrewl · 2024-09-26T14:06:15 1727359575

It's Google. You know the answer ;)

rl3 · 2024-09-26T14:19:32 1727360372

I mean, there’s no need to dry up subsidies when the underlying product can just be deprecated without warning.

zitterbewegung · 2024-09-26T16:50:00 1727369400

Run test queries on all platforms using something like litellm [1] and langsmith [2] .

You may not be able to match large queries but, testing will help you transition to other services.

[1] https://github.com/BerriAI/litellm

[2] https://langtrace.ai/

faangguyindia · 2024-09-26T15:29:32 1727364572

Google has deep pockets and SOTA hardware for training and interference

smrtinsert · 2024-09-27T04:57:02 1727413022

It's "free cloud picture video storage" rush all over again

stavros · 2024-09-26T13:15:38 1727356538

Are you asking whether giving away $5/day/user (what OpenAI charges) in compute is sustainable?

nycdatasci · 2024-09-26T10:55:10 1727348110

This is great for experimentation, but as others have pointed out recently there are persistent issues with Gemini that prevent use in actual products. The recitation/self-sensoring issue results in random failures:

https://github.com/google/generative-ai-docs/issues/257

faangguyindia · 2024-09-26T14:36:19 1727361379

I had this problem too but 002 solves this I think (not tested exhaustively), but I've not run into any problems since 002 and vertex + block all on all safety is now working fine, earlier I had problems with "block all" in safety settings and api throwing errors.

I am using it in https://github.com/zerocorebeta/Option-K (currently it doesn't have lowest safety settings because api wouldn't allow it, but now I am going to push new update with safety disabled)

Why? I've another application which is working since yesterday after 002 launch, I've safety settings to none and it will not answer certain questions but since yesterday it answers everything.

o11c · 2024-09-26T15:09:22 1727363362

And yet - if Gemini actually bothers to tell you when it detects verbatim copying of copyrighted content, how often must that occur on other AIs without notice?

airspresso · 2024-09-26T17:45:21 1727372721

Free of cost != free open model. Free of cost means all your requests are logged for Google to use as training data and whatnot.

Llama3.2 on the other hand runs locally, no data is ever sent to a 3rd party, so I can freely use it to summarize all my notes regardless of one of them being from my most recent therapy session and another being my thoughts on how to solve a delicate problem involving politics at work. I don't need to pre-classify all the input to make sure it's safe to share. Same with images, I can use Llama3.2 11B locally to interpret any photo I've taken without having to worry about getting consent from the people in the photo to share it with a 3rd party, or whether the photo is of my passport for some application I had to file or a receipt of something I bought that I don't want Google to train their next vision model OCR on.

TL;DR - Google free of cost models are irrelevant when talking about local models.

Deathmax · 2024-09-26T15:31:02 1727364662

The free tier API isn't US-only, Google has removed the free tier restriction for UK/EEA countries for a while now, with the added bonus of not training on your data if making a request from the UK/CH/EEA.

hobofan · 2024-09-26T13:43:30 1727358210

Not locked to the US, you get 1 billion tokens per month per model with Mistral since their recent announcement: https://mistral.ai/news/september-24-release/ (1 request per second is quite a harsh rate limit, but hey, free is free)

I'm pretty excited what all the services adopting free tiers is going to do to the landscape, as that should allow for a lot more experimentation and a lot more hobby projects transitioning into full-time projects, that previously felt a lot more risky/unpredictable with pricing.

jackbravo · 2024-09-26T00:16:52 1727309812

I saw that you mention https://github.com/simonw/llm/. Hadn't seen this before. What is its purpose? And why not use ollama instead?

dannyobrien · 2024-09-26T00:21:21 1727310081

llm is Simon's command line front-end to a lot of the llm apis, local and cloud-based. Along with aider-chat, it's my main interface to any LLM work -- it works well with a chat model, one-off queries, and piping text or output into a llm chain. For people who live on the command line, or are just put-off by web interfaces, it's a godsend.

About the only thing I need to look further abroad for is when I'm working multi-modally -- I know Simon and the community are mainly noodling over the best command line UX for that: https://github.com/simonw/llm/issues/331

SOLAR_FIELDS · 2024-09-26T04:03:17 1727323397

I use a fair amount of aider - what does Simon's solution offer that aider doesn't? I am usually using a mix of aider and the ChatGPT window. I use ChatGPT for one off queries that aren't super context heavy for my codebase, since pricing can still add up for the API and a lot of the times the questions that I ask don't really need deep context about what I'm doing in the terminal. But when I'm in flow state and I need deep integration with the files I'm changing I switch over to aider with Sonnet - my subjective experience is that Anthropic's models are significantly better for that use case. Curious if Simon's solution is more geared toward the first use case or the second.

skybrian · 2024-09-26T05:25:04 1727328304

The llm command is a general-purpose tool for writing shell scripts that use an llm somehow. For example, generating some llm output and sending it though a Unix pipeline. You can also use it interactively if you like working on the command line.

It’s not specifically about chatting or helping you write code, though you could use it for that if you like.

n8henrie · 2024-09-26T00:41:34 1727311294

I've only used ollama over cli. As per the parent poster -- do you know if there are advantages over ollama for CLI use? Have you used both?

simonw · 2024-09-26T02:50:29 1727319029

Ollama can’t talk to OpenAI / Anthropic / etc. LLM gives you a single interface that can talk to both hosted and local models.

It also logs everything you do to a SQLite database, which is great for further analysis.

I use LLM and Ollama together quite a bit, because Ollama are really good at getting new models working and their server keeps those models in memory between requests.

wrsh07 · 2024-09-26T02:56:15 1727319375

You can run llamafile as a server, too, right? Still need to download gguf files if you don't use one of their premade binaries, but if you haven't set up llm to hit the running llamafile server I'm sure that's easy to do

dannyobrien · 2024-09-26T01:22:21 1727313741

I haven't used Ollama, but from what I've seen, it seems to operate at a different level of abstraction compared to `llm`. I use `llm` to access both remote and local models through its plugin ecosystem[1]. One of the plugins allows you to use Ollama-served local models. This means you can use the same CLI interface with Ollama[2], as well as with OpenAI, Gemini, Anthropic, llamafile, llamacpp, mlc, and others. I select different models for different purposes. Recently, I've switched my default from OpenAI to Anthropic quite seamlessly.

[1] - https://llm.datasette.io/en/stable/plugins/directory.html#pl... [2] - https://github.com/taketwo/llm-ollama

awwaiid · 2024-09-26T02:13:47 1727316827

The llm CLI is much more unixy, letting you pipe data in and out easily. It can use hosted and local models, including ollama.

jerieljan · 2024-09-26T01:40:17 1727314817

It looks like a multi-purpose utility in the terminal for bridging together the terminal, your scripts or programs to both local and remote LLM providers.

And it looks very handy! I'll use this myself because I do want to invoke OpenAI and other cloud providers just like I do in ollama and piping things around and this accomplishes that, and more.

https://llm.datasette.io/en/stable/

I guess you can also accomplish similar results if you're just looking for `/chat/completions` and such if you configured something like LiteLLM and connecting that to ollama and any other service.

flakiness · 2024-09-26T18:03:22 1727373802

There is a recent podcast episode with the tool's author https://newsletter.pragmaticengineer.com/p/ai-tools-for-soft...

It's worth listening to learn abouut the context on how that tool is used.

magicalhippo · 2024-09-27T04:17:39 1727410659

I'm new to this game. I played with Gemma 2 9B in an agent-like role before and was pleasantly surprised. I just tried some of the same prompts with Llama 3.2 3B and found it doesn't stick to my instructions very well.

Since I'm a n00b, does this just mean Llama 3.2 3B instruct was "tuned more softly" than Gemma 2 instruct? That is, could one expect to be able to further fine-tune it to more closely follow instructions?

forgingahead · 2024-09-26T01:15:55 1727313355

What are people using to check token length of code bases? I'd like to point certain app folders to a local LLM, but no idea how that stuff is calculated? Seems like some strategic prompting (eg: this is a rails app, here is the folder structure with file names, and btw here are the actual files to parse) would be more efficient than just giving it the full app folder? No point giving it stuff from /lib and /vendor for the most part I reckon.

simonw · 2024-09-26T02:51:59 1727319119

I use my https://github.com/simonw/ttok command for that - you can pipe stuff into it for a token count.

Unfortunately it only uses the OpenAI tokenizers at the moment (via tiktoken), so counts for other models may be inaccurate. I find they tend to be close enough though.

xyc · 2024-09-26T18:13:33 1727374413

You can use llama.cpp server's tokenize endpoint to tokenize and count the tokens: https://github.com/ggerganov/llama.cpp/blob/master/examples/...

sumedh · 2024-09-26T09:56:33 1727344593

You can try Gemini Token count. https://ai.google.dev/api/tokens

lowyek · 2024-09-26T03:00:48 1727319648

Hi simon, is there a way to run the vision model easily on my mac locally?

simonw · 2024-09-26T03:52:58 1727322778

Not that I’ve seen so far, but Ollama are pending a solution for that “soon”.

v3ss0n · 2024-09-26T04:23:28 1727324608

I doubt ollama team can do much about it. Ollama are just wrapper on top of heavy lifter

Patrick_Devine · 2024-09-26T05:53:17 1727329997

The draft PRs are already up in the repo.

theaniketmaurya · 2024-09-26T11:29:02 1727350142

You can run it with LitServe (MPS GPU), here is the code - https://lightning.ai/lightning-ai/studios/deploy-llama-3-2-v...

foxhop · 2024-09-25T22:39:10 1727303950

The llama 3.0, 3.1, & 3.2 all use the TikToken tokenizer which is the open source openai tokenizer.

littlestymaar · 2024-09-25T23:33:58 1727307238

GP is talking about context windows, not the number of token used by the tokenizer.

sva_ · 2024-09-26T00:03:15 1727308995

Somewhat confusingly, it appears the tokenizer vocabulary as well as the context length are both 128k tokens!

littlestymaar · 2024-09-26T05:51:42 1727329902

Yup, that's why I wanted to clarify things.

TZubiri · 2024-09-26T15:05:03 1727363103

This obsession with using AI to help with programming is short sighted.

We discover gold and you think of gold pickaxes.

Carrok · 2024-09-26T17:43:12 1727372592

If we make this an analogy to video games, gold pickaxes can usually mine more gold much faster.

What could be short sighted about using tools to improve your daily work?

TZubiri · 2024-09-26T22:32:16 1727389936

We should be thinking about building golden products, not golden tools.