Hacker News new | past | comments | ask | show | jobs | submit login

I'm absolutely amazed at how capable the new 1B model is, considering it's just a 1.3GB download (for the Ollama GGUF version).

I tried running a full codebase through it (since it can handle 128,000 tokens) and asking it to summarize the code - it did a surprisingly decent job, incomplete but still unbelievable for a model that tiny: https://gist.github.com/simonw/64c5f5b111fe473999144932bef42...

More of my notes here: https://simonwillison.net/2024/Sep/25/llama-32/

I've been trying out the larger image models to using the versions hosted on https://lmarena.ai/ - navigate to "Direct Chat" and you can select them from the dropdown and upload images to run prompts.




Llama 3.2 vision models don't seem that great if they have to compare them to Claude 3 Haiku or GPT4o-mini. For an open alternative I would use Qwen-2-72B model, it's smaller than the 90B and seems to perform quite better. Also Qwen2-VL-7B as an alternative to Llama-3.2-11B, smaller, better in visual benchmarks and also Apache 2.0.

Molmo models: https://huggingface.co/collections/allenai/molmo-66f379e6fe3..., also seem to perform better than Llama-3.2 models while being smaller and Apache 2.0.


1. Ignore the benchmarks. I've been A/Bing 11B today with Molmo 72B [1], which itself has an ELO neck-and-neck with GPT4o, and it's even. Because everyone in open source tends to train on validation benchmarks, you really can not trust them.

2. The method of tokenization/adapter is novel and uses many fewer tokens than all comparable CLIP/SigLIP-adapter models, making it _much_ faster. Attention is O(n^2) on memory/compute per sequence length.

[1] https://molmo.allenai.org/blog


> I've been A/Bing 11B today with Molmo 72B

How are you testing Molmo 72B? If you are interacting with https://molmo.allenai.org/, they are using Molmo-7B-D.


It’s not just open source that trains on the validation set. The big labs have already forgotten more about gaming MMLU down to the decimal than the open source community ever knew. Every once in a while they get sloppy and Claude does a faux pas with a BIGBENCH canary string or some other embarrassing little admission of dishonesty like that.

A big lab gets exactly the score on any public eval that they want to. They have their own holdouts for actual ML work, and they’re some of the most closely guarded IP artifacts, far more valuable than a snapshot of weights.


I tried some OCR use cases, Claude Sonnet just blows Molmo.


When you say "blows," do you mean in a subservient sense or more like, "it blows it out of the water?"


yeah does it suck or does it suck?


How about its performance compare to Qwen-2-72B tho?


Refer to the blog post I linked. Molmo is ahead of Qwen2 72b.


What interface do you use for a locally-run Qwen2-VL-7B? Inspired by Simon Willison's research[1], I have tried it out on Hugging Face[2]. Its handwriting recognition seems fantastic, but I haven't figured out how to run it locally yet.

[1] https://simonwillison.net/2024/Sep/4/qwen2-vl/ [2] https://huggingface.co/spaces/GanymedeNil/Qwen2-VL-7B


MiniCPM-V 2.6 is based on Qwen 2 and is also great at handwriting. It works locally with KoboldCPP. Here are the results I got with a test I just did.

Image:

* https://imgur.com/wg0kdQK

Output:

* https://pastebin.com/RKvYQasi

OCR script used:

* https://github.com/jabberjabberjabber/LLMOCR/blob/main/llmoc...

Model weights: MiniCPM-V-2_6-Q6_K_L.gguf, mmproj-MiniCPM-V-2_6-f16.gguf

Inference:

* https://github.com/LostRuins/koboldcpp/releases/tag/v1.75.2


Should the line "p.o. 5rd w/ new W5 533" say "p.o. 3rd w/ new WW 5W .533R"?

What does p.o. stand for? I can't make out the first letter. It looks more like the f, but the nodge on the upper left only fits the p. All the other p's look very different though.


'Replaced R436, R430 emitter resistors on right-channel power output board with new wire-wound 5watt .33ohm 5% with ceramic lead insulators'


Thx :). I thought the 3 looked like a b but didn't think brd would make any sense. My reasoning has led me astray.


Yeah. If you realize that a large part of the llm's 'ocr' is guessing due to context (token prediction) and not actually recognizing the characters exactly, you can see that it is indeed pretty impressive because the log it is reading uses pretty unique terminology that it couldn't know from training.


I'd say as an llm it should know this kind of stuff from training, contrary to me, for whom this is out of domain data. Anyhow I don't think the AI did a great job on that line. Would require better performance for it to be useful for me. I think larger models might actually be better at this than I am, which would be very useful.


Be aware that a lot of this also has to do with prompting and sampler settings. For instance changing the prompt from 'write the text on the image verbatim' to something like 'this is an electronics repair log using shorthand...' and being specific about it will give the LLM context in which to make decisions about characters and words.


Thanks for the hint. Will try the out!


If you are in US, you get 1 billion tokens a DAY with Gemini (Google) completely free of cost.

Gemini Flash is fast with upto 4 million token context.

Gemini Flash 002 improved in math and logical abilities surpassing Claude and Gpt 4o

You can simply use Gemini Flash for Code Completion, git review tool and many more.


Is this sustainable though, or are they just trying really hard to attract users? If I build all of my tooling on it, will they start charging me thousands of dollars next year once the subsidies dry up? With a local model running with open source software, at least I can know that as long as my computer can still compute, the model will still run just as well and just as fast as it did on day 1, and cost the same amount of electricity


Facts. Google did the same thing you describe with Maps a few years ago.


It's not just Google, literally every new service always does this. Prices will always go up once the have enough customers and bean counters start pointing at spreadsheets. Ergo, local is the only option if you don't want to be held for ransom afterwards. As goes for web servers, scraper bots, and whatever, so goes for llms.


I think there's a few things to consider:

They make a ton of money on large enterprise package deals through Google Cloud. That includes API access but also support and professional services. Most orgs that pay for this stuff don't really need it, but they buy it anyways, as is consistent with most enterprise sales. That can give Google a significant margin to make up the cost elsewhere.

Gemini Flash is probably super cheap to run compared to other models. The cost of inference for many tasks has gone down tremendously over the past 1.5 years, and it's still going down. Every economic incentive aligns with running these models more efficiently.


Aren't API calls essentially swappable now between vendors now?

If you wanted to switch from Gemini to Chatgpt you could copy/paste your code into Chatgpt and ask it to switch to their API.

Disclaimer I work at Google but not on Gemini


Not tokens allowed per user. Google has the largest token windows .


Different APIs and models are going to come with different capabilities and restrictions.


It's Google. You know the answer ;)


I mean, there’s no need to dry up subsidies when the underlying product can just be deprecated without warning.


Run test queries on all platforms using something like litellm [1] and langsmith [2] .

You may not be able to match large queries but, testing will help you transition to other services.

[1] https://github.com/BerriAI/litellm

[2] https://langtrace.ai/


Google has deep pockets and SOTA hardware for training and interference


It's "free cloud picture video storage" rush all over again


Are you asking whether giving away $5/day/user (what OpenAI charges) in compute is sustainable?


This is great for experimentation, but as others have pointed out recently there are persistent issues with Gemini that prevent use in actual products. The recitation/self-sensoring issue results in random failures:

https://github.com/google/generative-ai-docs/issues/257


I had this problem too but 002 solves this I think (not tested exhaustively), but I've not run into any problems since 002 and vertex + block all on all safety is now working fine, earlier I had problems with "block all" in safety settings and api throwing errors.

I am using it in https://github.com/zerocorebeta/Option-K (currently it doesn't have lowest safety settings because api wouldn't allow it, but now I am going to push new update with safety disabled)

Why? I've another application which is working since yesterday after 002 launch, I've safety settings to none and it will not answer certain questions but since yesterday it answers everything.


And yet - if Gemini actually bothers to tell you when it detects verbatim copying of copyrighted content, how often must that occur on other AIs without notice?


Free of cost != free open model. Free of cost means all your requests are logged for Google to use as training data and whatnot.

Llama3.2 on the other hand runs locally, no data is ever sent to a 3rd party, so I can freely use it to summarize all my notes regardless of one of them being from my most recent therapy session and another being my thoughts on how to solve a delicate problem involving politics at work. I don't need to pre-classify all the input to make sure it's safe to share. Same with images, I can use Llama3.2 11B locally to interpret any photo I've taken without having to worry about getting consent from the people in the photo to share it with a 3rd party, or whether the photo is of my passport for some application I had to file or a receipt of something I bought that I don't want Google to train their next vision model OCR on.

TL;DR - Google free of cost models are irrelevant when talking about local models.


The free tier API isn't US-only, Google has removed the free tier restriction for UK/EEA countries for a while now, with the added bonus of not training on your data if making a request from the UK/CH/EEA.


Not locked to the US, you get 1 billion tokens per month per model with Mistral since their recent announcement: https://mistral.ai/news/september-24-release/ (1 request per second is quite a harsh rate limit, but hey, free is free)

I'm pretty excited what all the services adopting free tiers is going to do to the landscape, as that should allow for a lot more experimentation and a lot more hobby projects transitioning into full-time projects, that previously felt a lot more risky/unpredictable with pricing.


I saw that you mention https://github.com/simonw/llm/. Hadn't seen this before. What is its purpose? And why not use ollama instead?


llm is Simon's command line front-end to a lot of the llm apis, local and cloud-based. Along with aider-chat, it's my main interface to any LLM work -- it works well with a chat model, one-off queries, and piping text or output into a llm chain. For people who live on the command line, or are just put-off by web interfaces, it's a godsend.

About the only thing I need to look further abroad for is when I'm working multi-modally -- I know Simon and the community are mainly noodling over the best command line UX for that: https://github.com/simonw/llm/issues/331


I use a fair amount of aider - what does Simon's solution offer that aider doesn't? I am usually using a mix of aider and the ChatGPT window. I use ChatGPT for one off queries that aren't super context heavy for my codebase, since pricing can still add up for the API and a lot of the times the questions that I ask don't really need deep context about what I'm doing in the terminal. But when I'm in flow state and I need deep integration with the files I'm changing I switch over to aider with Sonnet - my subjective experience is that Anthropic's models are significantly better for that use case. Curious if Simon's solution is more geared toward the first use case or the second.


The llm command is a general-purpose tool for writing shell scripts that use an llm somehow. For example, generating some llm output and sending it though a Unix pipeline. You can also use it interactively if you like working on the command line.

It’s not specifically about chatting or helping you write code, though you could use it for that if you like.


I've only used ollama over cli. As per the parent poster -- do you know if there are advantages over ollama for CLI use? Have you used both?


Ollama can’t talk to OpenAI / Anthropic / etc. LLM gives you a single interface that can talk to both hosted and local models.

It also logs everything you do to a SQLite database, which is great for further analysis.

I use LLM and Ollama together quite a bit, because Ollama are really good at getting new models working and their server keeps those models in memory between requests.


You can run llamafile as a server, too, right? Still need to download gguf files if you don't use one of their premade binaries, but if you haven't set up llm to hit the running llamafile server I'm sure that's easy to do


I haven't used Ollama, but from what I've seen, it seems to operate at a different level of abstraction compared to `llm`. I use `llm` to access both remote and local models through its plugin ecosystem[1]. One of the plugins allows you to use Ollama-served local models. This means you can use the same CLI interface with Ollama[2], as well as with OpenAI, Gemini, Anthropic, llamafile, llamacpp, mlc, and others. I select different models for different purposes. Recently, I've switched my default from OpenAI to Anthropic quite seamlessly.

[1] - https://llm.datasette.io/en/stable/plugins/directory.html#pl... [2] - https://github.com/taketwo/llm-ollama


The llm CLI is much more unixy, letting you pipe data in and out easily. It can use hosted and local models, including ollama.


It looks like a multi-purpose utility in the terminal for bridging together the terminal, your scripts or programs to both local and remote LLM providers.

And it looks very handy! I'll use this myself because I do want to invoke OpenAI and other cloud providers just like I do in ollama and piping things around and this accomplishes that, and more.

https://llm.datasette.io/en/stable/

I guess you can also accomplish similar results if you're just looking for `/chat/completions` and such if you configured something like LiteLLM and connecting that to ollama and any other service.


There is a recent podcast episode with the tool's author https://newsletter.pragmaticengineer.com/p/ai-tools-for-soft...

It's worth listening to learn abouut the context on how that tool is used.


I'm new to this game. I played with Gemma 2 9B in an agent-like role before and was pleasantly surprised. I just tried some of the same prompts with Llama 3.2 3B and found it doesn't stick to my instructions very well.

Since I'm a n00b, does this just mean Llama 3.2 3B instruct was "tuned more softly" than Gemma 2 instruct? That is, could one expect to be able to further fine-tune it to more closely follow instructions?


What are people using to check token length of code bases? I'd like to point certain app folders to a local LLM, but no idea how that stuff is calculated? Seems like some strategic prompting (eg: this is a rails app, here is the folder structure with file names, and btw here are the actual files to parse) would be more efficient than just giving it the full app folder? No point giving it stuff from /lib and /vendor for the most part I reckon.


I use my https://github.com/simonw/ttok command for that - you can pipe stuff into it for a token count.

Unfortunately it only uses the OpenAI tokenizers at the moment (via tiktoken), so counts for other models may be inaccurate. I find they tend to be close enough though.


You can use llama.cpp server's tokenize endpoint to tokenize and count the tokens: https://github.com/ggerganov/llama.cpp/blob/master/examples/...


You can try Gemini Token count. https://ai.google.dev/api/tokens


Hi simon, is there a way to run the vision model easily on my mac locally?


Not that I’ve seen so far, but Ollama are pending a solution for that “soon”.


I doubt ollama team can do much about it. Ollama are just wrapper on top of heavy lifter


The draft PRs are already up in the repo.


You can run it with LitServe (MPS GPU), here is the code - https://lightning.ai/lightning-ai/studios/deploy-llama-3-2-v...


The llama 3.0, 3.1, & 3.2 all use the TikToken tokenizer which is the open source openai tokenizer.


GP is talking about context windows, not the number of token used by the tokenizer.


Somewhat confusingly, it appears the tokenizer vocabulary as well as the context length are both 128k tokens!


Yup, that's why I wanted to clarify things.


This obsession with using AI to help with programming is short sighted.

We discover gold and you think of gold pickaxes.


If we make this an analogy to video games, gold pickaxes can usually mine more gold much faster.

What could be short sighted about using tools to improve your daily work?


We should be thinking about building golden products, not golden tools.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: