Hacker News new | past | comments | ask | show | jobs | submit login

16GB of RAM can fit a 5 bit 13B model at best, they're second dumbest class of LLama model. If Open Orca turns out any good than that might be enough for the time being, but you'll need more RAM to use anything serious.

Here's a handy model comparison chart (this is a coding benchmark, so coding-only models tend to rank higher): https://i.imgur.com/AqSjjj2.jpeg




Your benchmark lacks the current #2 https://github.com/nlpxucan/WizardLM/tree/main/WizardCoder

It beats Claude and Bard.

You could probably get a 4bit 15B model going in 16GB of RAM and be approaching GPT4 in capability.

...on an old laptop, lol

Let's eat OpenAI's lunch! They deserve it for trying to steal this tech by "privatizing" a charity, hiding scientific data that was supposed to be shared with us by said charity whose purpose was to help us all, and dishonestly trying to persuade the government not to let us compete with them.


Yeah I mean I wouldn't really include coding models in this list since they're not general purpose models and have an obvious fine tuning edge compared to the rest. But WizardCoder is definitely something to look at as a Copilot replacement.

I'd post a more well rounded benchmark but the problem is that all non-coding benchmarks are currently more or less complete garbage, especially the Vicuna benchmark that rates everything as 99.7% GPT 3.5 lol.


The benchmark you linked was to "programming performance", not generic LLM "intelligence".

The situation for the little guy is wildly better than most people imagine.


Yep, that's what I'm saying, programming performance is seemingly very indicative of model inteligence (assuming it's tuned well enough to be able to run the benchmark at all). Coding is an exercise in problem solving and abstract thinking after all.

There are exceptions of course, as there are a few models (e.g. Vicuna, Baize) that don't do well at coding at all but otherwise perform well for chat, and the coding models I mentioned that game the benchmark by sacrificing performance in all other areas.

If you exclude those, it's very a accurate overall reasoning level comparison, at least it fits most to what I've seen their performance was for various tasks when testing out individual models. The only other valid benchmark that isn't coding are the SAT and LSAT tests that OpenAI runs on all of their models, but afaik there isn't an open version that would be widely used.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: