Llama 3.1

dang · 2024-07-23T17:56:45.000000Z

Related ongoing thread:

Open source AI is the path forward - https://news.ycombinator.com/item?id=41046773 - July 2024 (278 comments)

lelag · 2024-07-23T15:25:01.000000Z

The 405b model is actually competitive against closed source frontier models.

Quick comparison with GPT-4o:

    +----------------+-------+-------+
    |     Metric     | GPT-4o| Llama |
    |                |       | 3.1   |
    |                |       | 405B  |
    +----------------+-------+-------+
    | MMLU           |  88.7 |  88.6 |
    | GPQA           |  53.6 |  51.1 |
    | MATH           |  76.6 |  73.8 |
    | HumanEval      |  90.2 |  89.0 |
    | MGSM           |  90.5 |  91.6 |
    +----------------+-------+-------+

bamboozled · 2024-07-23T18:41:19.000000Z

This nodel is not “open source”, free to use maybe.

nomel · 2024-07-23T20:18:18.000000Z

I really wish people would use "open weights" rather than "open source". It's precise and obvious, and leaves an accurate descriptor for actual "open source" models, where the source and methods that that generate the artifact, that is the weights, is open.

lolinder · 2024-07-24T02:00:43.000000Z

It's not precise. People who want to use "open weights" instead of "open source" are focusing on the wrong thing.

The weights are, for all practical purposes, source code in their own right. The GPL defines "source code" as "the preferred form of the work for making modifications to it". Almost no one would be capable of reproducing them even if given the source + data. At the same time, the weights are exactly what you need for the one type of modification that's within reach of most people: fine-tuning. That they didn't release the surrounding code that produced this "source" isn't that much different than a company releasing a library but not their whole software stack.

I'd argue that "source" vs "weights" is a dangerous distraction from the far more insidious word in "open source" when used to refer to the Llama license: "open".

The Llama 3.1 license [0] specifically forbids its use by very large organizations, by militaries, and by nuclear industries. It also contains a long list of forbidden use cases. This specific list sounds very reasonable to me on its face, but having a list of specific groups of people or fields of endeavor who are banned from participating runs counter to the spirit of open source and opens up the possibility that new "open" licenses come out with different lists of forbidden uses that sound less reasonable.

To be clear, I'm totally fine with them having those terms in their license, but I'm uncomfortable with setting the precedent of embracing the word "open" for it.

Llama is "nearly-open source". That's good enough for me to be able to use it for what I want, but the word "open" is the one that should be called out. "Source" is fine.

[0] https://github.com/meta-llama/llama-models/blob/main/models/...

TeMPOraL · 2024-07-24T07:21:27.000000Z

Do the costs really matter here? "Weights" are "the preferred form of the work for making modifications to it" in the same sense compiled binary code would be, if for some reason no one could afford to recompile a program from sources.

Fine-tuning and LoRAs and toying with the runtime are all directly equivalent to DLL injection[0], trainers[1], and various other techniques used to tweak a compiled binary before or at runtime, including plain taking at the executable with a hex editor. Just because that's all anyone except the model vendor is able to do, doesn't merit calling the models "open source", much like no one would call binary-only software "open source" just because reverse engineering is a thing.

No, the weights are just artifacts. The source is the dataset and the training code (and possibly the training parameters). This isn't fundamentally different from running an advanced solver for a year, to find a way to make your program 100 byes smaller so it can fit on a Tamagochi. The resulting binary is magic, can't be reproduced without spending $$$$ on compute for th solver, but it is not open source. The source code is the bit that (produced the original binary that) went into the optimizer.

Calling these models "open source" is a runaway misuse of the term, and in some cases, a sleigh of hand.

--

[0] - https://en.wikipedia.org/wiki/DLL_injection

[1] - https://en.wikipedia.org/wiki/Trainer_(games) - a type of programs popular some 20 years ago, used to cheat at, or mod, single-player games, by keeping track of and directly modifying the memory of the game process. Could be as simple as continuously resetting the ammo counter, or as complex as injecting assembly to add new UI elements.

lolinder · 2024-07-24T13:27:13.000000Z

> Fine-tuning and LoRAs and toying with the runtime are all directly equivalent to DLL injection[0], trainers[1], and various other techniques used to tweak a compiled binary before or at runtime, including plain taking at the executable with a hex editor.

No, because fine tuning is basically just a continuation of the same process that the original creators used to produce the weights in the first place, in the same way that modifying source code directly is in traditional open source. You pick up where they left off with new data and train it a little bit (or a lot!) more to adapt it to your use case.

The weights themselves are the computer program. There exists no corresponding source code. The code you're asking for corresponds not to the source code of a traditional program but to the programmers themselves and the processes used to write the code. Demanding the source code and data that produced the weights is equivalent to demanding a detailed engineering log documenting the process of building the library before you'll accept it as open source.

Just because you can't read it doesn't make it not source code. Once you have the weights, you are perfectly capable of modifying them following essentially the same processes the original authors did, which are well known and well documented in plenty of places with or without the actual source code that implements that process.

> Calling these models "open source" is a runaway misuse of the term, and in some cases, a sleigh of hand.

I agree wholeheartedly, but not because of "source". The sleight of hand is getting people to focus on that instead of the really problematic word.

mcbuilder · 2024-07-24T12:31:58.000000Z

The thing is, the core of the GPT architecture is like 40 lines of code. Everyone knows what the source code is basically (minus optimizations). You just need to bring your own 20TB in data, 100k GPUs, and tens of millions in power budget, and you too can train llama 405b.

sharpshadow · 2024-07-24T12:39:03.000000Z

If I understood the article correctly he intends to let the community make suggestions to selected developers which work on the source somehow. So maybe part of the source will be made visible.

fishermanbill · 2024-07-24T10:06:45.000000Z

Its not open source. Your definition would make most video games open source - we modify them all the time. The small runtime framework IS open source but that's not much benefit as you cant really modify it hugely because the weights fix it to an implementation.

lolinder · 2024-07-24T13:12:15.000000Z

> Your definition would make most video games open source - we modify them all the time.

No, because most video games aren't licensed in a way that makes that explicitly authorized, nor is modding the preferred form of the work for making modifications. The video game has source code that would be more useful, the model does not have source code that would be more useful than the weights.

fngjdflmdflg · 2024-07-23T20:48:54.000000Z

As far as I know it's not just the weights. it's everything but the dataset. So the code used to generate the weights is also open source.

nomel · 2024-07-23T21:11:47.000000Z

Is there any other case where "open source" is used for something that can't be reproduced? Seems like a new term is required, in the concept of "open source, non-reproducible artifacts".

I suppose language changes. I just prefer it changes towards being more precise, not less.

xu_ituairo · 2024-07-23T21:49:43.000000Z

This feels somewhat analogous to games like Quake being open-sourced though still needing the user to provide the original game data files.

TeMPOraL · 2024-07-24T07:39:58.000000Z

But games like Quake are not "open source". They have been open-sourced, specifically the executable parts were, without the assets. This is usually spelled out clearly as the process happen.

In terms of functional role, if we're to compare the models to open-sourced games, then all that's been open-sourced is the trivial[0] bit of code that does the inference.

Maybe a more adequate comparison would be a SoC running a Linux kernel with a big NVidia or Qualcomm binary blob in the middle of it? Sure, the Linux kernel is open source, but we wouldn't call the SoC "open source", because all that makes it what it is (software-side) is hidden in a proprietary binary.

--

[0] - In the sense that there's not much of it, and it's possible to reproduce from papers.

fishermanbill · 2024-07-24T10:09:52.000000Z

Yes its "freeware" or any one of the similar terms we've used to refer to free software.

rovr138 · 2024-07-24T15:49:52.000000Z

Academia - nowadays source is needed is a lot of conferences, but the datasets, depending on where/how it might have be obtained, just can't be used or not available and the exact results can't be reproduced.

Not sure if the code is required under an open source license, but it's the same issue.

---

IMO, source is source and can be used for other datasets. Dataset isn't available, bring your own.

In this case, the source is there. The output is there, and not technically required. What isn't available is the ability to confirm the output comes from that source. That's not required under open source though.

What's disingenuous is the output being called 'open source'.

drexlspivey · 2024-07-24T06:44:33.000000Z

No, the term is fine, “source” in “open source” refers to source code. A dataset by definition is not source code. Stop changing the meaning of words.

TeMPOraL · 2024-07-24T07:32:36.000000Z

A dataset very much is the source code. It's the part that gets turned into the program through an automated process (training is equivalent to compilation).

TeMPOraL · 2024-07-24T07:03:28.000000Z

In other words, it's everything except the one thing that actually matters.

kibibu · 2024-07-24T13:02:43.000000Z

The dataset is likely absolutely jam packed with copyrighted material that cannot be distributed.

OrangeMusic · 2024-07-24T08:25:16.000000Z

Maybe, but it doesn't mean it's not open source.

TeMPOraL · 2024-07-24T09:56:17.000000Z

The things that don't matter are, the thing that does isn't. Together, they can hardly be called open source.

llm_trw · 2024-07-24T02:53:38.000000Z

> where the source and methods that that generate the artifact, that is the weights, is open.

When you require the same thing in software, namely the whole stack to run the software in question to be open source, we don't call the license open source.

TeMPOraL · 2024-07-24T07:43:58.000000Z

Nope. Those model releases only open source the equivalent of "run.bat" that does some trivial things and calls into a binary blob. We wouldn't call such a program "open source".

Hell, in case of the models, "the whole stack to run the software" already is open source. Literally everything except the actual sources - the datasets and the build scripts (code doing the training) - is available openly. This is almost a literal inverse of "open source", thus shouldn't be called "open source".

gantrol · 2024-07-24T01:55:31.000000Z

Training a model is like automatic programming, and the key of it is having a well-organized dataset.

If some "opensource" model just have the model and training methods but no dataset, it’s like some repo which released an executable file with a detailed design doc. Where is the source code? Do it yourself, please.

NOTE: I understand the difficulty of open-sourcing datasets. I'm just saying that the term "opensource" is getting diluted.

votepaunchy · 2024-07-23T23:36:40.000000Z

It’s not even free to use. There are commercial restrictions.

cchance · 2024-07-23T15:26:45.000000Z

Super cool, though sadly 405b will be outside most personal usage without cloud providers which sorta defeats the purpose of opensource to some extent atleast sadly, because .. nvidia's rampup of consumer VRAM is glacial

aabhay · 2024-07-23T15:32:40.000000Z

Zoom out a bit. There’s a massive feeder ecosystem around llama. You’ll see many startups take this on and help drive down inference costs for everyone and create competitive pressure that will improve the state of the art.

loudmax · 2024-07-23T15:58:17.000000Z

I agree that 405B isn't practical for home users, but I disagree that it defeats the purpose of open source. If you're building a business on inference it can be valuable to run an open model on hardware that you control, without the need to worry that OpenAI or Anthropic or whoever will make drastic changes to the model performance or pricing. Also, it allows the possibility of fine-tuning the model to your requirements. Meta believes it's in their interest to promote these businesses.

I'd think of the 405B model as the equivalent to a big rig tractor trailer. It's not for home use. But also check out the benchmark improvements for the 70B and 8B models.

diego_sandoval · 2024-07-23T22:57:16.000000Z

The fact that it takes $20k to run your own SOTA model, instead of the $2B+ that it took until yesterday, is significant.

gkk · 2024-07-23T15:33:45.000000Z

If you think of open source as a protocol through which the ecosystem of companies loosely collaborate, then it's a big deal. E.g. Groq can work on inference without a complicated negotiations with Meta. Ditto for Huggingface, and smaller startups.

I agree with you on open source in the original, home tinkerer sense.

duchenne · 2024-07-23T15:32:05.000000Z

Most SMBs would be able to run it. This is already a huge win for decentralized AI.

paxys · 2024-07-23T19:03:09.000000Z

You don't need a model of this scale for personal use. Llama 3.1 8B can easily run on your laptop right now. The 70B model can run on a pair of 4090s.

api · 2024-07-23T19:45:26.000000Z

I have the 70b model running quantized just fine on an M1 Max laptop with 64GiB unified RAM. Performance is fine and so far some Q&A tests are impressive.

This is good enough for a lot of use cases... on a laptop. An expensive laptop, but hardware only gets better and cheaper over time.

diffeomorphism · 2024-07-24T08:21:17.000000Z

Just for reference, the current version of that laptop costs 4800€ (14 inch macbook pro, m3 max, 64gb of ram, 1TB of storage). So price-wise that is more like four laptops.

evilduck · 2024-07-24T15:59:56.000000Z

I think they were referring to the form factor not the price. But even then the price of four laptops is not out of line for enthusiast hobby spending.

Ever priced out a four wheeler, a jet-ski, a filled gun safe, what a "car guy" loses in trade in values every two years, what a hobbyist day-trader is losing before they cut their losses or turn it around, or what a parent who lives vicariously through their child and drags them all over their nearby states for overnight trips so they can do football/soccer/ballet/whatever at 6am on Saturdays against all the other kids who also won't become pro athletes? What about the cost of a wingsuit or getting your pilots license? "Cruisers" or annual-Disney vacationers? If you bought a used CNC machine from a machine shop? But spend five grand on a laptop to play with LLMs and everyone gets real judgmental.

karolist · 2024-07-24T15:39:17.000000Z

I have the same machine, may I ask which model file and program are you using? Is it partial GPU offload?

buu700 · 2024-07-23T20:52:03.000000Z

I don't have the hardware to confirm this, so I'd take it with a grain of salt, but ChatGPT tells me that a maxed out M3 MacBook Pro with 128 GB RAM should be capable of efficiently running Llama 3.1 405B, albeit with essentially no ability to multitask.

(It also predicted that a MacBook Air in 2030 will be able to do the same, and that for smartphones to do the same might take around 20 years.)

hmottestad · 2024-07-23T21:30:32.000000Z

I’ve run the Falcon 180B on my M3 Max with 128 GB of memory. I think I ran it at 3-bit. Took a long time to load and was incredibly slow at generating text. Even if you could load the Llama 405B model it would be too slow to be of much use.

buu700 · 2024-07-23T21:49:44.000000Z

Ah, that's a shame to hear. FWIW, ChatGPT did also suggest that there was a lot of room for improvement in the MPS backend of PyTorch that would likely make it more efficient on Apple hardware in time.

Klaus23 · 2024-07-23T22:39:55.000000Z

You fundamentally misunderstand the bottleneck of large LLMs. It is not really possible to make gains that way.

A 405B LLM has 405 billion parameters. If you run it at full "prescision", each parameter takes up 2 bytes, which means you need 810GB of memory. If it does not fit in RAM or GPU memory it will swap to disc and be unusably slow.

You can run the model at reduced prescision to save memory, called quantisation, but this will degrade the quality of the response. The exact amount of degradation depends on the task, the specific model and its size. Larger models seem to suffer slightly less. 1 byte per parameter is pretty much as good as full precision. 4 bits per parameter is still good quality, 3 bits is noticeably worse and 2 bits is often bad to unusable.

With 128GB of RAM, zero overhead and a 405B model, you would have to quantize to about 2.5 bits, which would noticeably degrade the response quality.

There is also model pruning, which removes parameters completely, but this is much more experimental than quantisation, also degrades response quality, and I have not seen it used that widely.

buu700 · 2024-07-23T23:04:05.000000Z

I appreciate the additional information, but I'm not sure what you're claiming is a fundamental misunderstanding on my part. I was referring to running the model with quantization, and was clear that I hadn't verified the accuracy of the claims.

The comment about the MPS PyTorch backend was related to performance, not whether the model would fit at all. I can't say whether it's accurate that the MPS backend has significant room for optimization, but it is still publicly listed as in beta.

Klaus23 · 2024-07-23T23:41:45.000000Z

Yes my mistake, I read your answer to mean that you think that the model could fit into the memory with the help of efficiency gains.

I would be sceptical about increasing efficiency. I'm not that familiar with the subject, but as far as I know, LLMs for single users (i.e. with batch size 1) are practically always limited by the memory bandwidth. The whole LLM (if it is monolytic) has to be completely loaded from memory once for each new token (which is about 4 characters). With 400GB per second memory bandwidth and 4-bit quantisation, you are limited to 2 tokens per second, no matter how efficiently the software works. This is not unusable, but still quite slow compared to online services.

buu700 · 2024-07-23T23:50:33.000000Z

Got it, thanks, that makes sense. I was aware that memory was the primary bottleneck, but wasn't clear on the specifics of how model sizes mapped to memory requirements or the exact implications of quantization in practice. It sounds like we're pretty far from a model of this size running on any halfway common consumer hardware in a useful way, even if some high-end hardware might technically be able to initialize it in one form or another.

Klaus23 · 2024-07-24T01:18:57.000000Z

GPU memory costs about $2.5/GB on the spot market, so that is $500 for 200GB. I would speculate that it might be possible to build such a LLM card for $1-2k, but I suspect that the market for running larger LLMs locally is just too small to consider, especially now that the datacentre is so lucrative.

Maybe we'll get really good LLMs on local hardware when the hype has died down a bit, memory is cheaper and the models are more efficient.

nl · 2024-07-23T23:51:45.000000Z

Most "local model runners" (Llama.CPP, Llama-file etc) don't use Pytorch and instead implement the neural network directly themselves optimized for whatever hardware they are supporting.

For example here's the list of backends for Llama.cpp: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#su...

buu700 · 2024-07-24T00:10:15.000000Z

Ah okay, interesting, thanks.

stuckinhell · 2024-07-23T16:53:45.000000Z

100% reddit is full of people trying to solder more vram

gaogao · 2024-07-23T17:43:08.000000Z

I've been wondering if you could just attach a chunk of vram over NVLink, since that's very roughly what FSDP is doing here anyways.

bick_nyers · 2024-07-23T18:59:05.000000Z

The best NVLINK you can reasonably purchase is for the 3090, which is capped somewhere around 100 Gbit/s. This is too slow. The 3090 has about 1 TB/s memory bandwidth, and the 4090 is even faster, and the 5090 will be even faster.

PCIE 5.0 x16 is 500 Gbit/s if I'm not mistaken, so using RAM is more viable an alternative in this case.

Edit: 3090 has 1 TB/s, not terabits

Aurornis · 2024-07-23T21:34:15.000000Z

> sorta defeats the purpose of opensource to some extent

Not in the slightest. They even have a table of cloud providers where you can host the 405B model and the associated cost to do so on their website: https://llama.meta.com/ (Scroll down)

"Open Source" doesn't mean "You can run this on consumer hardware". It just means that it's open source. They also released 8B and 70B models for people to use on consumer gear.

kingsleyopara · 2024-07-23T15:31:25.000000Z

You might be able to get away with running a heavily quantizied 405b model using CPU inference at a blistering fast token every 5 seconds on a 7950x.

wuschel · 2024-07-23T15:39:21.000000Z

OK, I am curious now: What kind of hardware would I need to run such a model for a couple of users with decent performance?

Where could I get a mapping of token / time vs hardware?

danieldk · 2024-07-23T18:57:40.000000Z

You can run the 4-bit GPTQ/AWQ quantized Llama 405B somewhat reasonably on 4x H100 or A100. You will be somewhat limited in how many tokens you can have in flight between requests and you cannot create CUDA graphs for larger batch sizes. You can run 405B well on 8x H100 and A100, either with the mixed BFloat16/FP8 checkpoint that Meta provided or GPTQ/AWQ-quantized models. Note though that the A100 does not have native support for FP8, but FP8 quantized weights can be used through the GPTQ-Marlin FP8 kernel.

Here are some TGI 405B benchmarks that I did with the different quantized models:

https://x.com/danieldekok/status/1815814357298577718

The 405B model is very useful outside direct use in inference though. E.g. for generating synthetic data for training smaller model:

https://huggingface.co/blog/synthetic-data-save-costs

risho · 2024-07-23T21:06:19.000000Z

how much vram do you need for 4-bit llama 405?

zargon · 2024-07-24T01:22:35.000000Z

405 billion * 4 bits = approximately 200 GB. Plus extra for the amount of context you want.

angoragoats · 2024-07-23T16:45:24.000000Z

Unsure if anyone has specific hardware benchmarks for the 405b model yet, since it's so new, but elsewhere in this thread I outlined a build that'd probably be capable of running a quantized version of Llama 3.1 405b for roughly $10k.

The $10k figure is likely roughly the minimum amount of money/hardware that you'd need to run the model at acceptable speeds, as anything less requires you to compromise heavily on GPU cores (e.g. Tesla P40s also have 24GB of VRAM, for half the price or less, but are much slower than 3090s), or run on the CPU entirely, which I don't think will be viable for this model even with gobs of RAM and CPU cores, just due to its sheer size.

bick_nyers · 2024-07-23T18:51:39.000000Z

Energy costs are an important factor here too. While Quadro cards are much more expensive upfront (higher $/VRAM), they are cheaper over time (lower Watts/Token). Offsetting the energy expense of a 3090/4090/5090 build via solar complicates this calculation but generally speaking can be a "reasonable" way of justifying this much hardware running in a homelab.

I would be curious to see relative failure rates over time of consumer vs Quadro cards as well.

angoragoats · 2024-07-23T19:42:29.000000Z

Agree 100% that energy costs are important. The example system in my other post would consume somewhere around 300W at idle, 24/7, which is 219 kWh per month, and that's assuming you aren't using the machine at all.

I don't have any actual figures to back this up, but my gut tells me that the fact that enterprise GPUs are an order of magnitude (at least) more expensive than, say a, 3090, means that the payback period of them has got to be pretty long. I also wonder whether setting the max power on a 3090 to a lower than default value (as I suggest in my other post) has a significant effect on the average W/token.

bick_nyers · 2024-07-24T16:01:15.000000Z

Agreed, but there are other costs associated with supporting 10-16x GPUs that may not necessarily happen with say 6 GPUs. Having to go from single socket (or Threadripper) to dual socket, PCIE bifurcation, PLX risers, etc.

Not necessarily saying that Quadros are cheaper, just that there's more to the calculation when trying to run 405B size models at home

angoragoats · 2024-07-24T19:41:54.000000Z

The system I outlined in my other post [0] has ten GPUs and does not require dual socket CPUs as far as I'm aware. It could likely scale easily to 14 GPUs as well (assuming you have sufficient power), with an x8/x8 bifurcation adapter installed in each PCIe slot. This is pushing the limits of the PCIe subsystem I'm sure, but you could also likely scale up to 28 GPUs, again assuming sufficient power, by simply bifurcating at x4/x4/x4/x4 vs x8/x8.

I think it should work as-is with the components listed, but if you disagree please let me know!

[0] https://news.ycombinator.com/item?id=41047689

lostmsu · 2024-07-23T23:17:35.000000Z

I don't think this is correct. 5 years power usage of 4090 is $2600 giving TCO of ~$4300. RTX 6000 Ada starts at $6k for the card itself.

https://gpuprices.us

bick_nyers · 2024-07-24T17:00:34.000000Z

To be fair, you need 2x 4090 to match the VRAM capacity of an RTX 6000 Ada. There is also the rest of the system you need to factor into the cost. When running 10-16x 4090s, you may also need to upgrade your electrical wiring to support that load, you may need to spend more on air conditioning, etc.

I'm not necessarily saying that it's obviously better in terms of total cost, just that there are more factors to consider in a system of this size.

If inference is the only thing that is important to someone building this system, then used 3090s in x8 or even x4 bifurcation is probably the way to go. Things become more complicated if you want to add the ability to train/do other ML stuff, as you will really want to try to hit PCIE 4.0 x16 on every single card.

lostmsu · 2024-07-24T17:11:33.000000Z

With 2x 4090 you will have 2x speed of RTX 6000 A. So same energy per token.

Will need more space, true.

angoragoats · 2024-07-24T19:42:48.000000Z

Yeah, after digging more into RTX 6000 Ada cards, I don't see any way they'd be more economical even over many years, no matter how you slice it.

monkeydust · 2024-07-23T16:09:32.000000Z

Great for Groq whos already hosting it but at what cost I guess.

niutech · 2024-07-26T23:09:09.000000Z

Groq provides a limited free tier for now: https://wow.groq.com/now-available-on-groq-the-largest-and-m...

lostmsu · 2024-07-23T23:12:13.000000Z

You can probably run it on your local PC at 1 token/minute.

mi_lk · 2024-07-23T23:43:39.000000Z

How do you draw/generate such ascii table?

lelag · 2024-07-24T09:53:31.000000Z

In the past, I might have used a python library like asciitable to do that.

This time, I just copy pasted the raw metrics I found and asked an LLM to format it as an ASCII table.

TacticalCoder · 2024-07-24T02:28:30.000000Z

Don't know about OP but I generate such tables using Emacs.

zone411 · 2024-07-23T20:35:27.000000Z

I've just finished running my NYT Connections benchmark on all three Llama 3.1 models. The 8B and 70B models improve on Llama 3 (12.3 -> 14.0, 24.0 -> 26.4), and the 405B model is near GPT-4o, GPT-4 turbo, Claude 3.5 Sonnet, and Claude 3 Opus at the top of the leaderboard.

GPT-4o 30.7

GPT-4 turbo (2024-04-09) 29.7

Llama 3.1 405B Instruct 29.5

Claude 3.5 Sonnet 27.9

Claude 3 Opus 27.3

Llama 3.1 70B Instruct 26.4

Gemini Pro 1.5 0514 22.3

Gemma 2 27B Instruct 21.2

Mistral Large 17.7

Gemma 2 9B Instruct 16.3

Qwen 2 Instruct 72B 15.6

Gemini 1.5 Flash 15.3

GPT-4o mini 14.3

Llama 3.1 8B Instruct 14.0

DeepSeek-V2 Chat 236B (0628) 13.4

Nemotron-4 340B 12.7

Mixtral-8x22B Instruct 12.2

Yi Large 12.1

Command R Plus 11.1

Mistral Small 9.3

Reka Core-20240501 9.1

GLM-4 9.0

Qwen 1.5 Chat 32B 8.7

Phi-3 Small 8k 8.4

DBRX 8.0

henryaj · 2024-07-24T11:39:01.000000Z

I love Connections! Can you tell us more about your benchmark?

foundval · 2024-07-23T15:36:00.000000Z

You can chat with these new models at ultra-low latency at groq.com. 8B and 70B API access is available at console.groq.com. 405B API access for select customers only – GA and 3rd party speed benchmarks soon.

If you want to learn more, there is a writeup at https://wow.groq.com/now-available-on-groq-the-largest-and-m....

(disclaimer, I am a Groq employee)

geepytee · 2024-07-23T16:09:25.000000Z

We also added Llama 3.1 405B to our VSCode copilot extension for anyone to try coding with it.

Free trial gets you 50 messages, no credit card required - https://double.bot

(disclaimer, I am the co-founder)

noble-lombax · 2024-07-26T09:11:23.000000Z

would be great if there was a page showing benchmarks compared to other auto completion tools

quotemstr · 2024-07-23T16:18:22.000000Z

Groq's TSP architecture is one of the weirder and more wonderful ISAs I've seen lately. The choice of SRAM in fascinating. Are you guys planning on publishing anything about how you bridged the gap between your order-hundreds-megabytes SRAM TSP main memory and multi-TB model sizes?

foundval · 2024-07-23T16:24:52.000000Z

There is a lot out here.

I gave a seminar about the overall approach recently, abstract: https://shorturl.at/E7TcA, recording: https://shorturl.at/zBcoL.

This two-part AMA has a lot more detail if you're already familiar with what we do:

https://www.youtube.com/watch?v=UztfweS-7MU

https://www.youtube.com/watch?v=GOGuSJe2C6U

quotemstr · 2024-07-23T16:26:12.000000Z

Thanks!

serverlessmania · 2024-07-27T03:43:55.000000Z

You can chat with all these models for free and ultra-low latency using this hosted website https://nat.dev/chat for free by GitHub Founder

listic · 2024-07-24T00:54:31.000000Z

Just checked it out. Is pay-as-you-go API access available at all? It says 'Coming Soon'

https://console.groq.com/settings/billing

weberer · 2024-07-24T09:32:05.000000Z

I've found Bedrock to be nice with pay-as-you-go, but they take a long time to adopt new models.

d13 · 2024-07-24T12:19:49.000000Z

And twice as expensive in comparison to the source providers’ APIs

Alifatisk · 2024-07-24T08:40:02.000000Z

I think you answered it yourself? It’s coming soon, so it is not available now, but soon.

senko · 2024-07-24T12:06:26.000000Z

It's been coming soon for a couple of months now, meanwhile Groq churns out a lot of other improvements, so to an outsider like me it looks like it's not terribly high on their list of priorities.

I'm really impressed by what (&how) they're doing and would like to pay for a higher rate limit, or failing that at least know if "soon" means "weeks" or "months" or "eventually".

I remember TravisCI did something similar back in the day, and then Circle and GitHub ate their lunch.

sagz · 2024-07-23T15:43:53.000000Z

405B is already being served on WhatsApp!

https://ibb.co/kQ2tKX5

Workaccount2 · 2024-07-23T16:10:20.000000Z

How do you get that option?

e12e · 2024-07-23T20:05:21.000000Z

And available via poe:

https://poe.com/s/LCAyUbAgUx8UcVMhM3Re

d13 · 2024-07-24T06:23:01.000000Z

At what quantisation are you running these?

netsec_burn · 2024-07-23T15:09:11.000000Z

Today appears to be the day you can run an LLM that is competitive with GPT-4o at home with the right hardware. Incredible for progress and advancement of the technology.

Statement from Mark: https://about.fb.com/news/2024/07/open-source-ai-is-the-path...

lolinder · 2024-07-23T15:11:26.000000Z

> at home with the right hardware

Where the right hardware is 10x4090s even at 4 bits quantization. I'm hoping we'll see these models get smaller, but the GPT-4-competitive one isn't really accessible for home use yet.

Still amazing that it's available at all, of course!

petercooper · 2024-07-23T15:19:09.000000Z

It's hardly cheap starting at about $10k of hardware, but another potential option appears to be using Exo to spread the model across a few MBPs or Mac Studios: https://x.com/exolabs_/status/1814913116704288870

niutech · 2024-07-26T23:17:21.000000Z

Or maybe using Distributed Llama? https://github.com/b4rtaz/distributed-llama

dunefox · 2024-07-23T15:13:09.000000Z

It's not really competitive though, is it? I tested it and 4o is just better.

dunefox · 2024-07-24T11:04:29.000000Z

Disclaimer: I tested llama3-8B, 3.1 might even as a small model be better, but I so far I have not seen a single small model approach 4o, ime.

meetpateltech · 2024-07-23T15:00:54.000000Z

Open Source AI Is the Path Forward - Mark Zuckerberg

https://about.fb.com/news/2024/07/open-source-ai-is-the-path...

ninjin · 2024-07-23T15:17:17.000000Z

So are they actually making the models open now or are they staying the course with "kind of open" as they have done for LLaMA 1, 2, and 3 [1]?

[1]: https://opensource.org/blog/metas-llama-2-license-is-not-ope...

As I have stated time and again, it is perfectly fine for them to slap on whatever license they see fit as it is their work. But it would be nice if they used appropriate terms so as not to disrupt the discourse further than they have already done. I have written several walls of text why I as a researcher find Facebook's behaviour problematic so I will fall back on an old link [2] this time rather than writing it all over again.

[2]: https://news.ycombinator.com/item?id=38427832

Zambyte · 2024-07-23T15:30:05.000000Z

> it is perfectly fine for them to slap on whatever license they see fit as it is their work.

Is it? Has there been a ruling on the enforceability of the license they attach to their models yet? Just because you say what you release can only be used for certain things doesn't actually mean what you say means anything.

moffkalast · 2024-07-23T19:45:45.000000Z

> specifically, it puts restrictions on commercial use for some users (paragraph 2) and also restricts the use of the model and software for certain purposes (the Acceptable Use Policy)

It's "a Google and Apple can't use this model in production" clause that frankly we can all be relatively okay with.

ninjin · 2024-07-23T20:20:18.000000Z

Good, then we can expect them to call it what it is then? Not open source and not open science and a regression in terms of openness in relationship to what came before. Because that is precisely my objection. There are those of us that have been committed to those ideals for a long time and now one of the largest corporations on earth is appropriating those terms for marketing purposes.

j_maffe · 2024-07-23T21:01:58.000000Z

I think it's great that you're fighting to maintain the term's fundamental meaning. I do, however, think that we need to give credit where credit is due to companies who take actions in the right direction to encourage more companies to do the same. If we blindly protest any positive-impact action by corporations for not being perfect, they'll get the hint and stop trying to appease the community entirely.

ninjin · 2024-07-24T07:38:53.000000Z

I am in agreement. However, I do believe that a large portion of the community here is also missing a key point: Facebook was more open five years ago with their AI research than they are today. I suspect this perspective is because of the massive influx of people into AI around the time of the ChatGPT release. From their point of view, Facebook's move (although dishonestly labelled as something it is not) is a step in the right direction relative to "Open"AI and others. While for us that have been around for longer, openness "peaked" around 2018 and has been in steady decline ever since. If you see the wall of text I linked in my first comment in this chain, there is a longer description of this historical perspective.

It should also be noted (again) that the value of the terms open science and open source comes from the sacrifices and efforts of numerous academic, commercial, personal, etc. actors over several decades. They "paid" by sticking to the principles of these movements and Facebook is now cashing in on their efforts; solely for their own benefit. Not even Microsoft back in 2001 in the age of "fear uncertainty and doubt" were so dishonest as to label the source-available portions of their Shared Source Initiative as something it was not. Facebook has been called out again and again since the release of LLaMA 1 (which in its paper appropriated the term "open") and have shown no willingness to reconsider their open science and open source misuse. At this point, I can no longer give them the benefit of the doubt. The best defence I have heard is that they seek to "define open in the 'age of AI'", but if that was the case, where is their consensus building efforts akin to what we have seen numerous academics and OSI carry out? No, sadly the only logical conclusion is that it is cynical marketing on their part, both from their academics and business people.

[1]: https://en.wikipedia.org/wiki/Shared_Source_Initiative

In short. I think the correct response to Facebook is: "Thank you for the weights, we appreciate it. However, please stop calling your actions and releases something they clearly are not."

j_maffe · 2024-07-24T08:47:53.000000Z

Totally agree. Your suggested response is perfect IMO.

asadm · 2024-07-23T19:51:05.000000Z

but it means your company cant be acquired by those giants, if you use this model.

bilbo0s · 2024-07-23T20:09:39.000000Z

I'm glad someone said it.

You're only ok with it if you're not interested in having maximum freedom of movement vis-a-vis any potential exits.

gkfasdfasdf · 2024-07-23T15:03:57.000000Z

Meta the new "Open" AI?

para_parolu · 2024-07-28T04:30:08.000000Z

Until they make model much better than competitors to actually start capitalizing on it

ajhai · 2024-07-23T18:12:50.000000Z

You can already run these models locally with Ollama (ollama run llama3.1:latest) along with at places like huggingface, groq etc.

If you want a playground to test this model locally or want to quickly build some applications with it, you can try LLMStack (https://github.com/trypromptly/LLMStack). I wrote last week about how to configure and use Ollama with LLMStack at https://docs.trypromptly.com/guides/using-llama3-with-ollama.

Disclaimer: I'm the maintainer of LLMStack

jxy · 2024-07-23T21:17:14.000000Z

You are a maintainer of a software that depends on ollama, so you should know that ollama depends on llama.cpp. And as of now, llama.cpp doesn't support the new ROPE: https://github.com/ggerganov/llama.cpp/issues/8650, and all ollama can do is wait for llama.cpp: https://github.com/ollama/ollama/issues/5881

ajhai · 2024-07-23T21:38:05.000000Z

I've tested Q4 on M1 and it works though the quality may not likely be the same as you'd expect as others have pointed out on the issue.

primaprashant · 2024-07-23T15:14:31.000000Z

I have found Claude 3.5 Sonnet really good for coding tasks along with the artifacts feature and seems like it's still the king on the coding benchmarks

cubefox · 2024-07-23T18:23:24.000000Z

I have found it to be better than GPT-4o at math too, despite the latter being better at several math benchmarks.

wfme · 2024-07-23T23:20:38.000000Z

My experience reflects this too. My hunch is that GPT-4o was trained to game the benchmarks rather than output higher quality content.

In theory the benchmarks should be a pretty close proxy for quality, but that doesn't match my experience at all.

margorczynski · 2024-07-23T23:55:34.000000Z

A problem with a lot of benchmarks is that they are out in the open so the model basically trains to game them instead of actually acquiring knowledge that would let it solve it. Probably private benchmarks that are not in the training set of these models should give better estimates about their general performance.

Davidzheng · 2024-07-24T01:48:51.000000Z

I personally disagree. But i haven't used sonnet that much

cubefox · 2024-07-24T06:42:17.000000Z

I asked both whether the product of two odds (odds=(probability/(1-probability)) can itself be interpreted as an odds, and if so, which. Neither could solve the problem completely, but Claude 3.5 Sonnet at least helped me to find the answer after a while. I assume the questions in math benchmarks are different.

Alifatisk · 2024-07-24T08:43:38.000000Z

Yeah same experience here aswell, I found Sonnet 3.5 to fulfill my task much better than 4o even though 4o scores higher on benchmarks.

CGamesPlay · 2024-07-24T01:42:24.000000Z

The LMSys Overall leaderboard <https://chat.lmsys.org/?leaderboard> can tell us a bit more about how these models will perform in real life, rather than in a benchmark context. By comparing the ELO score against the MMLU benchmark scores, we can see models which outperform / underperform based on their benchmark scores relative to other models. A low score here indicates that the model is more optimized for the benchmark, while a higher score indicates it's more optimized for real-world examples. Using that, we can make some inferences about the training data used, and then extrapolate how future models might perform. Here's a chart: <https://docs.getgrist.com/gV2DtvizWtG7/LLMs/p/5?embed=true>

Examples: OpenAI's GPT 4o-mini is second only to 4o on LMSys Overall, but is 6.7 points behind 4o on MMLU. It's "punching above its weight" in real-world contexts. The Gemma series (9B and 27B) are similar, both beating the mean in terms of ELO per MMLU point. Microsoft's Phi series are all below the mean, meaning they have strong MMLU scores but aren't preferred in real-world contexts.

Llama 3 8B previously did substantially better than the mean on LMSys Overall, so hopefully Llama 3.1 8B will be even better! The 70B variant was interestingly right on the mean. Hopefully the 430B variant won't fall below!

Lockal · 2024-07-24T02:46:43.000000Z

Something is broken with "meta-llama-3.1-405b-instruct-sp" and "meta-llama-3.1-70b-instruct-sp" there, after few sentences both models switch to infinite random like: "Rotterdam计算 dining counselor/__asan jo Nas было /well-rest esse moltet Grants SL и Four VIHu-turn greatest Morenh elementary(((( parts referralswhich IMOаш ...".

Don't expect any meaningful score there before they wipe results.

CGamesPlay · 2024-07-24T03:04:30.000000Z

Good to know, but just to clarify, the results I pulled don't include the 3.1 models yet (they aren't on the leaderboard yet).

sujay1844 · 2024-07-24T02:34:22.000000Z

These days, lmsys elo is the only thing I trust. The other benchmark scores mean nothing to me at this point

__jl__ · 2024-07-24T02:55:56.000000Z

I disagree. Not saying the other benchmarks are better. It just depends on your use case and application.

For my use of the chat interface, I don't think lmsys is very useful. lmsys mainly evaluates relatively simple, low token count questions. Most (if not all) are single prompts, not conversations. The small models do well in this context. If that is what you are looking for, great. However, it does not test longer conversations with high token counts.

Just saying that all benchmarks, including lmsys, have issues and are focused on specific use cases.

kingsleyopara · 2024-07-23T15:38:08.000000Z

The biggest win here has to be the context length increase to 128k from 8k tokens. Till now my understanding is there hasn't been any open models anywhere close to that.

HanClinto · 2024-07-23T17:20:48.000000Z

It is notable, but it's not alone. Mistral NeMo just released last week with a 128k context window:

https://news.ycombinator.com/item?id=40996058

kingsleyopara · 2024-07-23T18:47:59.000000Z

Thanks! Not sure how I missed that :)

HanClinto · 2024-07-23T18:51:07.000000Z

It's easy to miss things. Trying to keep up with the latest in AI news is like drinking from the firehose -- it's never-ending.

cpursley · 2024-07-23T23:21:52.000000Z

Phi 3

Workaccount2 · 2024-07-23T16:30:18.000000Z

@dang why was this removed/filtered from the front page?

nomel · 2024-07-23T20:25:27.000000Z

I see a few cloud hosting providers for it on the front page. I wonder if it's being gamed.

AaronFriel · 2024-07-23T15:10:32.000000Z

Is there pricing available on any of these vendors?

Open source models are very exciting for self hosting, but the per-token hosted inference pricing hasn't been competitive with OpenAI and Anthropic, at least for a given tier of quality. (E.g.: Llama 3 70B costing between $1 and $10 per million tokens on various platforms, but Claude Sonnet 3.5 is $3 per million.)

handzhiev · 2024-07-23T21:17:35.000000Z

Llama 3 is 0.59/0.79 on Groq. Still no price for 3.1

primaprashant · 2024-07-23T14:59:19.000000Z

The resources for link to model card[1], research paper, and Prompt Guard Tutorial[2] on the page doesn't exist yet

[1]: https://github.com/meta-llama/llama-models/blob/main/models/...

[2]: https://github.com/meta-llama/llama-recipes/blob/main/recipe...

dado3212 · 2024-07-23T16:35:59.000000Z

> We use synthetic data generation to produce the vast majority of our SFT examples, iterating multiple times to produce higher and higher quality synthetic data across all capabilities. Additionally, we invest in multiple data processing techniques to filter this synthetic data to the highest quality. This enables us to scale the amount of fine-tuning data across capabilities. [0]

Have other major models explicitly communicated that they're trained on synthetic data?

[0]. https://ai.meta.com/blog/meta-llama-3-1/

tommy_axle · 2024-07-23T17:20:37.000000Z

It's in the <7B club, but Phi has always had a good dose of synthetic data https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

usaar333 · 2024-07-24T06:19:14.000000Z

Technically this is post training. This has been standard for a long time now - I think InstructGPT (gpt 3.5 base) was the last that used only human feedback (RLHF)

jcmp · 2024-07-23T15:06:53.000000Z

"Meta AI isn't available yet in your country" Hi from europe :/

monkmartinez · 2024-07-23T15:26:53.000000Z

Why are (some) Europeans surprised when they are not included in tech product débuts? My lay understanding could best be described as; EU law is incredibly business unfriendly and takes a heroic effort in time and money to implement the myriad of requirements therein. Am I wrong?

cubefox · 2024-07-23T18:18:11.000000Z

> Why are (some) Europeans surprised when they are not included in tech product débuts?

Why do you think he is surprised? I think very few are surprised.

w4 · 2024-07-23T16:24:46.000000Z

> Why are (some) Europeans surprised when they are not included in tech product débuts?

We had a brief, abnormal, and special moment in time after the crypto wars ended in the mid-2000s where software products were truly global, and the internet was more or less unregulated and completely open (at least in most of the world). Sadly it seems that this era has come to a close, and people have not yet updated their understanding of the world to account for that fact.

People are also not great at thinking through the second order effects of the policies they advocate for (e.g. the GDPR), and are often surprised by the results.

Joeri · 2024-07-24T06:22:40.000000Z

The only real requirement impacting Meta AI is GDPR conformance. The DMA does not apply and the AI act has yet to enter into force. So either Meta AI is a vehicle to steal people’s data, and it is being kept out for the right reasons, or not providing it is punitive due to the EU commission’s DMA action running against Meta.

crimsoneer · 2024-07-23T21:55:36.000000Z

You are pretty wrong. EU law is tricky on AI very specifically in this use case (because it's a massive model), but that's not affecting anybody else.

Other than that, and GDPR (which is generally now regarded as a good thing), I'm not sure what requirements you've got in mind.

Daunk · 2024-07-23T16:28:33.000000Z

Most things do dèbute in the EU, unless the product or company behind it doesn't value your privacy. Meta does not value your privacy.

lolinder · 2024-07-23T16:56:11.000000Z

Privacy was the first thing that the EU did that started this trend of companies slowing their EU releases because of GDPR. Now there's the Digital Markets Act and the AI Act that both have caused companies to slow their releases to the EU.

Each new large regulation adds another category of company to the list of those who choose not to participate. Sure, you can always label them as companies who don't value principle X, but at some point it stops being the fault of the companies and you have to start looking at whether there are too many enormous regulations slowing down tech releases.

Agingcoder · 2024-07-24T07:00:46.000000Z

This is an interesting point.

The word fault somehow implies that something’s wrong - from the eu regulator’s perspective, what’s happening is perfectly normal, and what they want : at some point, the advances in insert new tech are not worth the (social) cost to individuals, so they make things more complicated/ ask companies to behave differently.

Now I’m not saying the regulations are good, required, etc : just that depending on your goal, there are multiple points of view, with different landing zones.

I also suspect that what’s happening now ( meta, apple slowing down) is a power play : they’re just putting pressure on the eu, but I’m harboring doubts that this can work at all.

lolinder · 2024-07-23T15:28:43.000000Z

Competition is a funny thing—it doesn't just apply to companies competing for customers, it also applies to governments competing for companies to make products available to their citizens. Turns out that if you make compliance with your laws onerous enough they can actually just choose to opt out of your country altogether, or at a minimum delay release in your country until they can check all your boxes.

The only solution is a worldwide government that can impose laws in all countries at once, but that's unlikely to happen any time soon.

Teknomancer · 2024-07-23T15:39:14.000000Z

Be careful what you wish for.

A Gibsonesque global Turing Police is a sure sign of Dystopia.

diego_sandoval · 2024-07-24T06:08:50.000000Z

> The only solution is a worldwide government that can impose laws in all countries at once, but that's unlikely to happen any time soon.

Let's hope the next moustached guy that tries to do this ends up dying in a bunker just like the last one.

sva_ · 2024-07-23T15:19:15.000000Z

You can load the page using a VPN and then turn off the VPN and the page will still work.

sunaookami · 2024-07-23T15:20:57.000000Z

You can't sign in though, that worked before. Seems like they also check from which country your Facbook/Instagram account is. You can't create images without an account sadly.

WinstonSmith84 · 2024-07-23T17:23:31.000000Z

I changed my Facebook country (to Canada), using also a VPN to Canada, but that didn't help. That used to work before somehow.

lawlessone · 2024-07-23T17:03:57.000000Z

Someone will torrent it soon enough i'm sure.

anotherpaulg · 2024-07-24T07:15:03.000000Z

Llama 3.1 405B instruct is #7 on aider's leaderboard, well behind Claude 3.5 Sonnet & GPT-4o. When using SEARCH/REPLACE to efficiently edit code, it drops to #11.

https://aider.chat/docs/leaderboards/

  77.4% claude-3.5-sonnet
  75.2% DeepSeek Coder V2 (whole)
  72.9% gpt-4o
  69.9% DeepSeek Chat V2 0628
  68.4% claude-3-opus-20240229
  67.7% gpt-4-0613
  66.2% llama-3.1-405b-instruct (whole)

j_maffe · 2024-07-24T09:47:38.000000Z

Ordinal value doesn't really matter in this case, especially when it's a categorically different option, access-wise. A 10% difference isn't bad at all.

sagz · 2024-07-23T15:41:53.000000Z

The 405B model is already being served on WhatsApp: https://ibb.co/kQ2tKX5

tarasglek · 2024-07-24T10:37:16.000000Z

Is this official? How does one use this. I'm a very newbie whatsup so sorry for dumb q

ofou · 2024-07-24T04:12:11.000000Z

    Llama 3 Training System
          19.2 exaFLOPS
              _____
             /     \      Cluster 1     Cluster 2
            /       \    9.6 exaFLOPS  9.6 exaFLOPS
           /         \     _______      _______
          /  ___      \   /       \    /       \
    ,----' /   \`.     `-'  24000  `--'  24000  `----.
   (     _/    __)        GPUs          GPUs         )
    `---'(    /  )     400+ TFLOPS   400+ TFLOPS   ,'
         \   (  /       per GPU       per GPU    ,'
          \   \/                               ,'
           \   \        TOTAL SYSTEM         ,'
            \   \     19,200,000 TFLOPS    ,'
             \   \    19.2 exaFLOPS      ,'
              \___\                    ,'
                    `----------------'

v3ss0n · 2024-07-24T08:28:38.000000Z

how much would it cost?

kibibu · 2024-07-24T13:08:27.000000Z

I think this is one of those "if you have to ask, you can't afford it" questions.

unraveller · 2024-07-23T16:27:02.000000Z

What are the substantial changes from 3.0 to 3.1 (70B) in terms of training approach? They don't seem to say how the training data differed just that both were 15T. I gather 3.0 was just a preview run and 3.1 was distilled down from the 405B somehow.

thntk · 2024-07-23T17:20:34.000000Z

Correct me if I'm wrong, my impression is that 3.1 is a better fine-tuned variant of base 3.0 with extensive use of synthetic data.

sfblah · 2024-07-23T15:33:54.000000Z

Is there an actual open-source community around this in the spirit of other ones where people outside meta can somehow "contribute" to it? If I wanted to "work on" this somehow, what would I do?

sangnoir · 2024-07-23T15:50:47.000000Z

There are a bunch of downstream fine-tuned and/or quantized models where people collaborate and share their recipes. In terms of contributing to Llama itself - I suspect Meta wants (or needs) code contributions at this time.

sebastiennight · 2024-07-23T23:03:02.000000Z

Did you mean, Meta does not want or need code contributions? It would seem to make more sense.

sangnoir · 2024-07-24T03:19:42.000000Z

Yes - that's ehat I meant, but mangled it it while editing.

sfblah · 2024-07-23T18:58:23.000000Z

Can you give me a tip of where to look? I'm interested in participating.

sangnoir · 2024-07-23T21:34:39.000000Z

You'll probably find interesting threads and links at https://old.reddit.com/r/LocalLLaMa

denz88 · 2024-07-23T15:02:19.000000Z

I'm glad to see the nice incremental gains on the benchmarks for the 8B and 70B models as well.

loudmax · 2024-07-23T15:40:20.000000Z

Some of those benchmarks show quite significant gains. Going from Llama-3 to Llama-3.1, MMLU scores for 8B are up from 65.3 to 73.0, and 70B are up from 80.9 to 86.0. These scores should always be taken with a grain of salt, but this is encouraging.

405B is hopelessly out of reach for running in a homelab without spending thousands of dollars. For most people wanting to try out the 405B model, the best option is to rent compute from a datacenter. Looking forward to seeing what it can accomplish.

sroussey · 2024-07-23T19:31:29.000000Z

How much can you quantize that down to run on a Mac Studio with 192GB? Is it possible? Feels like it would have to be 2bit…

Davidzheng · 2024-07-24T01:49:48.000000Z

Less than 2bit i think. There's this IQ2 quant that fits

chown · 2024-07-23T16:10:03.000000Z

Wow! The benchmarks are truly impressive, showing significant improvements across almost all categories. It's fascinating to see how rapidly this field is evolving. If someone had told me last year that Meta would be leading the charge in open-source models, I probably wouldn't have believed them. Yet here we are, witnessing Meta's substantial contributions to AI research and democratization.

On a related note, for those interested in experimenting with large language models locally, I've been working on an app called Msty [1]. It allows you to run models like this with just one click and features a clean, functional interface. Just added support for both 8B and 70B. Still in development, but I'd appreciate any feedback.

[1]: https://msty.app

d13 · 2024-07-24T07:24:02.000000Z

I love Msty too. Could you please add a feature to allow adding any arbitrary inference endpoint?

sagz · 2024-07-23T19:05:02.000000Z

Hi! Love Msty

Can you add GCP Vertex AI API support? Then one key would enable Claude, Llama herd, Gemini, Gemma etc

downvotetruth · 2024-07-23T18:04:35.000000Z

Tried using msty today and it refused to open and demanded an upgrade from 0.9 - remotely breaking a local app that had been working is unacceptable. Good luck retaining users.

zhanghsfz · 2024-07-23T19:02:16.000000Z

We supported Llama 3.1 405B model on our distributed GPU network at Hyperbolic Labs! Come and use the API for FREE at https://app.hyperbolic.xyz/models

Let us know if you have other needs!

TechDebtDevin · 2024-07-23T14:49:54.000000Z

Nice, someone donate me a few 4090s :(

lawlessone · 2024-07-23T14:58:18.000000Z

maybe someone will figure out some ways to prune/ quantize it a huge amount ;-;

edit: If the AI bubble pops we will be swimming in GPUs... but no new models.

yard2010 · 2024-07-24T08:26:55.000000Z

This bubble collapsing along with most blockchains going all in with proof of stake rather than proof of work is myself and every other gamer wet dream.

Sakthimm · 2024-07-23T19:51:54.000000Z

This is absurd. We have crossed the point of no return, llms will forever be in our lives in one form or another, just like internet, especially with the release of these open model weights. There is no bubble, only way forward is better, efficient llms, everywhere.

tymscar · 2024-07-23T21:42:24.000000Z

You seem to not understand what a bubble popping is. Yes we have the internet around, that doesn’t mean the dot com bubble didn’t pop…

foxhop · 2024-07-23T14:52:56.000000Z

Your going to need a lot more than a few, 800G VRAM needed

AaronFriel · 2024-07-23T14:59:55.000000Z

If previous quantization results hold up, fp8 will have nearly identical performance while using 405GiB for weights, but the KV cache size will still be significant.

Too bad, too, I don't think my PC will fit 20 4090s (480GiB).

knicholes · 2024-07-23T15:21:46.000000Z

I've got a motherboard that will support 8!

Zambyte · 2024-07-23T15:47:07.000000Z

40,320 4090s?? What witchcraft is this?! :D

sebastiennight · 2024-07-23T23:05:46.000000Z

All the more impressive when you realize that Groq's infrastructure (based on LPUs) was built using only 6!

lolinder · 2024-07-23T14:58:01.000000Z

Quantized to 4 bits you'll only need ~200GB! 5 4090s should cover it.

angoragoats · 2024-07-23T14:59:29.000000Z

You'll probably need 9 or more. 4090s have 24GB each.

lolinder · 2024-07-23T15:12:32.000000Z

Oops, I read 48 somewhere but that's wrong. Thanks.

htrp · 2024-07-23T15:29:55.000000Z

a6ks however

woodson · 2024-07-23T15:04:47.000000Z

I wonder if AutoAWQ works out of the box, given no architectural changes (?). That would be most straightforward together with vLLM for serving.

downvotetruth · 2024-07-23T17:53:12.000000Z

If an implementation had NVidia's Heterogeneous Memory Management implemented, then 192 GB RAM DDR5 + GPU VRAM would seem to be close.

pat2man · 2024-07-23T15:29:35.000000Z

Two 128gb Mac studios networked via thunderbolt 4?

Teknomancer · 2024-07-23T15:45:06.000000Z

This is actually a promising endeavor. Id love to see someone try that.

angoragoats · 2024-07-23T16:00:37.000000Z

There's already at least one project that attempts this:

https://github.com/exo-explore/exo

whalesalad · 2024-07-23T15:03:18.000000Z

follow the trail of tears to my credit card

TechDebtDevin · 2024-07-23T14:54:58.000000Z

beeboobaa3 · 2024-07-23T15:01:16.000000Z

how is this even useful? no one can run it.

jermaustin1 · 2024-07-23T15:09:26.000000Z

You don't use the 405B parameter model at home. I have a lot of luck with 8B and 13B models on a single 3090. You can quantize them down (is that the term) which lowers precision and memory use, but still very usable... most of the time.

If you are running a commercial service that uses AI, you buy a few dozen A100s, spend a half million, and you are good for a while.

If you are running a commercial inferencing service, you spend tens of millions or get a cloud sponsor.

beeboobaa3 · 2024-07-23T15:38:25.000000Z

I can't expect all my users to have 3090s and if we're talking about spending millions there are better things to invest in than a stack of GPUs that will be obsolete in a year or three.