16GB of RAM can fit a 5 bit 13B model at best, they're second dumbest class of L...

PostOnce · on July 6, 2023

Your benchmark lacks the current #2 https://github.com/nlpxucan/WizardLM/tree/main/WizardCoder

It beats Claude and Bard.

You could probably get a 4bit 15B model going in 16GB of RAM and be approaching GPT4 in capability.

...on an old laptop, lol

Let's eat OpenAI's lunch! They deserve it for trying to steal this tech by "privatizing" a charity, hiding scientific data that was supposed to be shared with us by said charity whose purpose was to help us all, and dishonestly trying to persuade the government not to let us compete with them.

moffkalast · on July 7, 2023

Yeah I mean I wouldn't really include coding models in this list since they're not general purpose models and have an obvious fine tuning edge compared to the rest. But WizardCoder is definitely something to look at as a Copilot replacement.

I'd post a more well rounded benchmark but the problem is that all non-coding benchmarks are currently more or less complete garbage, especially the Vicuna benchmark that rates everything as 99.7% GPT 3.5 lol.

PostOnce · on July 7, 2023

The benchmark you linked was to "programming performance", not generic LLM "intelligence".

The situation for the little guy is wildly better than most people imagine.

moffkalast · on July 7, 2023

Yep, that's what I'm saying, programming performance is seemingly very indicative of model inteligence (assuming it's tuned well enough to be able to run the benchmark at all). Coding is an exercise in problem solving and abstract thinking after all.

There are exceptions of course, as there are a few models (e.g. Vicuna, Baize) that don't do well at coding at all but otherwise perform well for chat, and the coding models I mentioned that game the benchmark by sacrificing performance in all other areas.

If you exclude those, it's very a accurate overall reasoning level comparison, at least it fits most to what I've seen their performance was for various tasks when testing out individual models. The only other valid benchmark that isn't coding are the SAT and LSAT tests that OpenAI runs on all of their models, but afaik there isn't an open version that would be widely used.