Hacker News new | past | comments | ask | show | jobs | submit login
Llama 3.1 (meta.com)
437 points by luiscosio 54 days ago | hide | past | favorite | 269 comments



Related ongoing thread:

Open source AI is the path forward - https://news.ycombinator.com/item?id=41046773 - July 2024 (278 comments)


The 405b model is actually competitive against closed source frontier models.

Quick comparison with GPT-4o:

    +----------------+-------+-------+
    |     Metric     | GPT-4o| Llama |
    |                |       | 3.1   |
    |                |       | 405B  |
    +----------------+-------+-------+
    | MMLU           |  88.7 |  88.6 |
    | GPQA           |  53.6 |  51.1 |
    | MATH           |  76.6 |  73.8 |
    | HumanEval      |  90.2 |  89.0 |
    | MGSM           |  90.5 |  91.6 |
    +----------------+-------+-------+


This nodel is not “open source”, free to use maybe.


I really wish people would use "open weights" rather than "open source". It's precise and obvious, and leaves an accurate descriptor for actual "open source" models, where the source and methods that that generate the artifact, that is the weights, is open.


It's not precise. People who want to use "open weights" instead of "open source" are focusing on the wrong thing.

The weights are, for all practical purposes, source code in their own right. The GPL defines "source code" as "the preferred form of the work for making modifications to it". Almost no one would be capable of reproducing them even if given the source + data. At the same time, the weights are exactly what you need for the one type of modification that's within reach of most people: fine-tuning. That they didn't release the surrounding code that produced this "source" isn't that much different than a company releasing a library but not their whole software stack.

I'd argue that "source" vs "weights" is a dangerous distraction from the far more insidious word in "open source" when used to refer to the Llama license: "open".

The Llama 3.1 license [0] specifically forbids its use by very large organizations, by militaries, and by nuclear industries. It also contains a long list of forbidden use cases. This specific list sounds very reasonable to me on its face, but having a list of specific groups of people or fields of endeavor who are banned from participating runs counter to the spirit of open source and opens up the possibility that new "open" licenses come out with different lists of forbidden uses that sound less reasonable.

To be clear, I'm totally fine with them having those terms in their license, but I'm uncomfortable with setting the precedent of embracing the word "open" for it.

Llama is "nearly-open source". That's good enough for me to be able to use it for what I want, but the word "open" is the one that should be called out. "Source" is fine.

[0] https://github.com/meta-llama/llama-models/blob/main/models/...


Do the costs really matter here? "Weights" are "the preferred form of the work for making modifications to it" in the same sense compiled binary code would be, if for some reason no one could afford to recompile a program from sources.

Fine-tuning and LoRAs and toying with the runtime are all directly equivalent to DLL injection[0], trainers[1], and various other techniques used to tweak a compiled binary before or at runtime, including plain taking at the executable with a hex editor. Just because that's all anyone except the model vendor is able to do, doesn't merit calling the models "open source", much like no one would call binary-only software "open source" just because reverse engineering is a thing.

No, the weights are just artifacts. The source is the dataset and the training code (and possibly the training parameters). This isn't fundamentally different from running an advanced solver for a year, to find a way to make your program 100 byes smaller so it can fit on a Tamagochi. The resulting binary is magic, can't be reproduced without spending $$$$ on compute for th solver, but it is not open source. The source code is the bit that (produced the original binary that) went into the optimizer.

Calling these models "open source" is a runaway misuse of the term, and in some cases, a sleigh of hand.

--

[0] - https://en.wikipedia.org/wiki/DLL_injection

[1] - https://en.wikipedia.org/wiki/Trainer_(games) - a type of programs popular some 20 years ago, used to cheat at, or mod, single-player games, by keeping track of and directly modifying the memory of the game process. Could be as simple as continuously resetting the ammo counter, or as complex as injecting assembly to add new UI elements.


> Fine-tuning and LoRAs and toying with the runtime are all directly equivalent to DLL injection[0], trainers[1], and various other techniques used to tweak a compiled binary before or at runtime, including plain taking at the executable with a hex editor.

No, because fine tuning is basically just a continuation of the same process that the original creators used to produce the weights in the first place, in the same way that modifying source code directly is in traditional open source. You pick up where they left off with new data and train it a little bit (or a lot!) more to adapt it to your use case.

The weights themselves are the computer program. There exists no corresponding source code. The code you're asking for corresponds not to the source code of a traditional program but to the programmers themselves and the processes used to write the code. Demanding the source code and data that produced the weights is equivalent to demanding a detailed engineering log documenting the process of building the library before you'll accept it as open source.

Just because you can't read it doesn't make it not source code. Once you have the weights, you are perfectly capable of modifying them following essentially the same processes the original authors did, which are well known and well documented in plenty of places with or without the actual source code that implements that process.

> Calling these models "open source" is a runaway misuse of the term, and in some cases, a sleigh of hand.

I agree wholeheartedly, but not because of "source". The sleight of hand is getting people to focus on that instead of the really problematic word.


The thing is, the core of the GPT architecture is like 40 lines of code. Everyone knows what the source code is basically (minus optimizations). You just need to bring your own 20TB in data, 100k GPUs, and tens of millions in power budget, and you too can train llama 405b.


If I understood the article correctly he intends to let the community make suggestions to selected developers which work on the source somehow. So maybe part of the source will be made visible.


Its not open source. Your definition would make most video games open source - we modify them all the time. The small runtime framework IS open source but that's not much benefit as you cant really modify it hugely because the weights fix it to an implementation.


> Your definition would make most video games open source - we modify them all the time.

No, because most video games aren't licensed in a way that makes that explicitly authorized, nor is modding the preferred form of the work for making modifications. The video game has source code that would be more useful, the model does not have source code that would be more useful than the weights.


As far as I know it's not just the weights. it's everything but the dataset. So the code used to generate the weights is also open source.


Is there any other case where "open source" is used for something that can't be reproduced? Seems like a new term is required, in the concept of "open source, non-reproducible artifacts".

I suppose language changes. I just prefer it changes towards being more precise, not less.


This feels somewhat analogous to games like Quake being open-sourced though still needing the user to provide the original game data files.


But games like Quake are not "open source". They have been open-sourced, specifically the executable parts were, without the assets. This is usually spelled out clearly as the process happen.

In terms of functional role, if we're to compare the models to open-sourced games, then all that's been open-sourced is the trivial[0] bit of code that does the inference.

Maybe a more adequate comparison would be a SoC running a Linux kernel with a big NVidia or Qualcomm binary blob in the middle of it? Sure, the Linux kernel is open source, but we wouldn't call the SoC "open source", because all that makes it what it is (software-side) is hidden in a proprietary binary.

--

[0] - In the sense that there's not much of it, and it's possible to reproduce from papers.


Yes its "freeware" or any one of the similar terms we've used to refer to free software.


Academia - nowadays source is needed is a lot of conferences, but the datasets, depending on where/how it might have be obtained, just can't be used or not available and the exact results can't be reproduced.

Not sure if the code is required under an open source license, but it's the same issue.

---

IMO, source is source and can be used for other datasets. Dataset isn't available, bring your own.

In this case, the source is there. The output is there, and not technically required. What isn't available is the ability to confirm the output comes from that source. That's not required under open source though.

What's disingenuous is the output being called 'open source'.


No, the term is fine, “source” in “open source” refers to source code. A dataset by definition is not source code. Stop changing the meaning of words.


A dataset very much is the source code. It's the part that gets turned into the program through an automated process (training is equivalent to compilation).


In other words, it's everything except the one thing that actually matters.


The dataset is likely absolutely jam packed with copyrighted material that cannot be distributed.


Maybe, but it doesn't mean it's not open source.


The things that don't matter are, the thing that does isn't. Together, they can hardly be called open source.


> where the source and methods that that generate the artifact, that is the weights, is open.

When you require the same thing in software, namely the whole stack to run the software in question to be open source, we don't call the license open source.


Nope. Those model releases only open source the equivalent of "run.bat" that does some trivial things and calls into a binary blob. We wouldn't call such a program "open source".

Hell, in case of the models, "the whole stack to run the software" already is open source. Literally everything except the actual sources - the datasets and the build scripts (code doing the training) - is available openly. This is almost a literal inverse of "open source", thus shouldn't be called "open source".


Training a model is like automatic programming, and the key of it is having a well-organized dataset.

If some "opensource" model just have the model and training methods but no dataset, it’s like some repo which released an executable file with a detailed design doc. Where is the source code? Do it yourself, please.

NOTE: I understand the difficulty of open-sourcing datasets. I'm just saying that the term "opensource" is getting diluted.


It’s not even free to use. There are commercial restrictions.


Super cool, though sadly 405b will be outside most personal usage without cloud providers which sorta defeats the purpose of opensource to some extent atleast sadly, because .. nvidia's rampup of consumer VRAM is glacial


Zoom out a bit. There’s a massive feeder ecosystem around llama. You’ll see many startups take this on and help drive down inference costs for everyone and create competitive pressure that will improve the state of the art.


I agree that 405B isn't practical for home users, but I disagree that it defeats the purpose of open source. If you're building a business on inference it can be valuable to run an open model on hardware that you control, without the need to worry that OpenAI or Anthropic or whoever will make drastic changes to the model performance or pricing. Also, it allows the possibility of fine-tuning the model to your requirements. Meta believes it's in their interest to promote these businesses.

I'd think of the 405B model as the equivalent to a big rig tractor trailer. It's not for home use. But also check out the benchmark improvements for the 70B and 8B models.


The fact that it takes $20k to run your own SOTA model, instead of the $2B+ that it took until yesterday, is significant.


If you think of open source as a protocol through which the ecosystem of companies loosely collaborate, then it's a big deal. E.g. Groq can work on inference without a complicated negotiations with Meta. Ditto for Huggingface, and smaller startups.

I agree with you on open source in the original, home tinkerer sense.


Most SMBs would be able to run it. This is already a huge win for decentralized AI.


You don't need a model of this scale for personal use. Llama 3.1 8B can easily run on your laptop right now. The 70B model can run on a pair of 4090s.


I have the 70b model running quantized just fine on an M1 Max laptop with 64GiB unified RAM. Performance is fine and so far some Q&A tests are impressive.

This is good enough for a lot of use cases... on a laptop. An expensive laptop, but hardware only gets better and cheaper over time.


Just for reference, the current version of that laptop costs 4800€ (14 inch macbook pro, m3 max, 64gb of ram, 1TB of storage). So price-wise that is more like four laptops.


I think they were referring to the form factor not the price. But even then the price of four laptops is not out of line for enthusiast hobby spending.

Ever priced out a four wheeler, a jet-ski, a filled gun safe, what a "car guy" loses in trade in values every two years, what a hobbyist day-trader is losing before they cut their losses or turn it around, or what a parent who lives vicariously through their child and drags them all over their nearby states for overnight trips so they can do football/soccer/ballet/whatever at 6am on Saturdays against all the other kids who also won't become pro athletes? What about the cost of a wingsuit or getting your pilots license? "Cruisers" or annual-Disney vacationers? If you bought a used CNC machine from a machine shop? But spend five grand on a laptop to play with LLMs and everyone gets real judgmental.


I have the same machine, may I ask which model file and program are you using? Is it partial GPU offload?


I don't have the hardware to confirm this, so I'd take it with a grain of salt, but ChatGPT tells me that a maxed out M3 MacBook Pro with 128 GB RAM should be capable of efficiently running Llama 3.1 405B, albeit with essentially no ability to multitask.

(It also predicted that a MacBook Air in 2030 will be able to do the same, and that for smartphones to do the same might take around 20 years.)


I’ve run the Falcon 180B on my M3 Max with 128 GB of memory. I think I ran it at 3-bit. Took a long time to load and was incredibly slow at generating text. Even if you could load the Llama 405B model it would be too slow to be of much use.


Ah, that's a shame to hear. FWIW, ChatGPT did also suggest that there was a lot of room for improvement in the MPS backend of PyTorch that would likely make it more efficient on Apple hardware in time.


You fundamentally misunderstand the bottleneck of large LLMs. It is not really possible to make gains that way.

A 405B LLM has 405 billion parameters. If you run it at full "prescision", each parameter takes up 2 bytes, which means you need 810GB of memory. If it does not fit in RAM or GPU memory it will swap to disc and be unusably slow.

You can run the model at reduced prescision to save memory, called quantisation, but this will degrade the quality of the response. The exact amount of degradation depends on the task, the specific model and its size. Larger models seem to suffer slightly less. 1 byte per parameter is pretty much as good as full precision. 4 bits per parameter is still good quality, 3 bits is noticeably worse and 2 bits is often bad to unusable.

With 128GB of RAM, zero overhead and a 405B model, you would have to quantize to about 2.5 bits, which would noticeably degrade the response quality.

There is also model pruning, which removes parameters completely, but this is much more experimental than quantisation, also degrades response quality, and I have not seen it used that widely.


I appreciate the additional information, but I'm not sure what you're claiming is a fundamental misunderstanding on my part. I was referring to running the model with quantization, and was clear that I hadn't verified the accuracy of the claims.

The comment about the MPS PyTorch backend was related to performance, not whether the model would fit at all. I can't say whether it's accurate that the MPS backend has significant room for optimization, but it is still publicly listed as in beta.


Yes my mistake, I read your answer to mean that you think that the model could fit into the memory with the help of efficiency gains.

I would be sceptical about increasing efficiency. I'm not that familiar with the subject, but as far as I know, LLMs for single users (i.e. with batch size 1) are practically always limited by the memory bandwidth. The whole LLM (if it is monolytic) has to be completely loaded from memory once for each new token (which is about 4 characters). With 400GB per second memory bandwidth and 4-bit quantisation, you are limited to 2 tokens per second, no matter how efficiently the software works. This is not unusable, but still quite slow compared to online services.


Got it, thanks, that makes sense. I was aware that memory was the primary bottleneck, but wasn't clear on the specifics of how model sizes mapped to memory requirements or the exact implications of quantization in practice. It sounds like we're pretty far from a model of this size running on any halfway common consumer hardware in a useful way, even if some high-end hardware might technically be able to initialize it in one form or another.


GPU memory costs about $2.5/GB on the spot market, so that is $500 for 200GB. I would speculate that it might be possible to build such a LLM card for $1-2k, but I suspect that the market for running larger LLMs locally is just too small to consider, especially now that the datacentre is so lucrative.

Maybe we'll get really good LLMs on local hardware when the hype has died down a bit, memory is cheaper and the models are more efficient.


Most "local model runners" (Llama.CPP, Llama-file etc) don't use Pytorch and instead implement the neural network directly themselves optimized for whatever hardware they are supporting.

For example here's the list of backends for Llama.cpp: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#su...


Ah okay, interesting, thanks.


100% reddit is full of people trying to solder more vram


I've been wondering if you could just attach a chunk of vram over NVLink, since that's very roughly what FSDP is doing here anyways.


The best NVLINK you can reasonably purchase is for the 3090, which is capped somewhere around 100 Gbit/s. This is too slow. The 3090 has about 1 TB/s memory bandwidth, and the 4090 is even faster, and the 5090 will be even faster.

PCIE 5.0 x16 is 500 Gbit/s if I'm not mistaken, so using RAM is more viable an alternative in this case.

Edit: 3090 has 1 TB/s, not terabits


> sorta defeats the purpose of opensource to some extent

Not in the slightest. They even have a table of cloud providers where you can host the 405B model and the associated cost to do so on their website: https://llama.meta.com/ (Scroll down)

"Open Source" doesn't mean "You can run this on consumer hardware". It just means that it's open source. They also released 8B and 70B models for people to use on consumer gear.


You might be able to get away with running a heavily quantizied 405b model using CPU inference at a blistering fast token every 5 seconds on a 7950x.


OK, I am curious now: What kind of hardware would I need to run such a model for a couple of users with decent performance?

Where could I get a mapping of token / time vs hardware?


You can run the 4-bit GPTQ/AWQ quantized Llama 405B somewhat reasonably on 4x H100 or A100. You will be somewhat limited in how many tokens you can have in flight between requests and you cannot create CUDA graphs for larger batch sizes. You can run 405B well on 8x H100 and A100, either with the mixed BFloat16/FP8 checkpoint that Meta provided or GPTQ/AWQ-quantized models. Note though that the A100 does not have native support for FP8, but FP8 quantized weights can be used through the GPTQ-Marlin FP8 kernel.

Here are some TGI 405B benchmarks that I did with the different quantized models:

https://x.com/danieldekok/status/1815814357298577718

The 405B model is very useful outside direct use in inference though. E.g. for generating synthetic data for training smaller model:

https://huggingface.co/blog/synthetic-data-save-costs


how much vram do you need for 4-bit llama 405?


405 billion * 4 bits = approximately 200 GB. Plus extra for the amount of context you want.


Unsure if anyone has specific hardware benchmarks for the 405b model yet, since it's so new, but elsewhere in this thread I outlined a build that'd probably be capable of running a quantized version of Llama 3.1 405b for roughly $10k.

The $10k figure is likely roughly the minimum amount of money/hardware that you'd need to run the model at acceptable speeds, as anything less requires you to compromise heavily on GPU cores (e.g. Tesla P40s also have 24GB of VRAM, for half the price or less, but are much slower than 3090s), or run on the CPU entirely, which I don't think will be viable for this model even with gobs of RAM and CPU cores, just due to its sheer size.


Energy costs are an important factor here too. While Quadro cards are much more expensive upfront (higher $/VRAM), they are cheaper over time (lower Watts/Token). Offsetting the energy expense of a 3090/4090/5090 build via solar complicates this calculation but generally speaking can be a "reasonable" way of justifying this much hardware running in a homelab.

I would be curious to see relative failure rates over time of consumer vs Quadro cards as well.


Agree 100% that energy costs are important. The example system in my other post would consume somewhere around 300W at idle, 24/7, which is 219 kWh per month, and that's assuming you aren't using the machine at all.

I don't have any actual figures to back this up, but my gut tells me that the fact that enterprise GPUs are an order of magnitude (at least) more expensive than, say a, 3090, means that the payback period of them has got to be pretty long. I also wonder whether setting the max power on a 3090 to a lower than default value (as I suggest in my other post) has a significant effect on the average W/token.


Agreed, but there are other costs associated with supporting 10-16x GPUs that may not necessarily happen with say 6 GPUs. Having to go from single socket (or Threadripper) to dual socket, PCIE bifurcation, PLX risers, etc.

Not necessarily saying that Quadros are cheaper, just that there's more to the calculation when trying to run 405B size models at home


The system I outlined in my other post [0] has ten GPUs and does not require dual socket CPUs as far as I'm aware. It could likely scale easily to 14 GPUs as well (assuming you have sufficient power), with an x8/x8 bifurcation adapter installed in each PCIe slot. This is pushing the limits of the PCIe subsystem I'm sure, but you could also likely scale up to 28 GPUs, again assuming sufficient power, by simply bifurcating at x4/x4/x4/x4 vs x8/x8.

I think it should work as-is with the components listed, but if you disagree please let me know!

[0] https://news.ycombinator.com/item?id=41047689


I don't think this is correct. 5 years power usage of 4090 is $2600 giving TCO of ~$4300. RTX 6000 Ada starts at $6k for the card itself.

https://gpuprices.us


To be fair, you need 2x 4090 to match the VRAM capacity of an RTX 6000 Ada. There is also the rest of the system you need to factor into the cost. When running 10-16x 4090s, you may also need to upgrade your electrical wiring to support that load, you may need to spend more on air conditioning, etc.

I'm not necessarily saying that it's obviously better in terms of total cost, just that there are more factors to consider in a system of this size.

If inference is the only thing that is important to someone building this system, then used 3090s in x8 or even x4 bifurcation is probably the way to go. Things become more complicated if you want to add the ability to train/do other ML stuff, as you will really want to try to hit PCIE 4.0 x16 on every single card.


With 2x 4090 you will have 2x speed of RTX 6000 A. So same energy per token.

Will need more space, true.


Yeah, after digging more into RTX 6000 Ada cards, I don't see any way they'd be more economical even over many years, no matter how you slice it.


Great for Groq whos already hosting it but at what cost I guess.


Groq provides a limited free tier for now: https://wow.groq.com/now-available-on-groq-the-largest-and-m...


You can probably run it on your local PC at 1 token/minute.


How do you draw/generate such ascii table?


In the past, I might have used a python library like asciitable to do that.

This time, I just copy pasted the raw metrics I found and asked an LLM to format it as an ASCII table.


Don't know about OP but I generate such tables using Emacs.


I've just finished running my NYT Connections benchmark on all three Llama 3.1 models. The 8B and 70B models improve on Llama 3 (12.3 -> 14.0, 24.0 -> 26.4), and the 405B model is near GPT-4o, GPT-4 turbo, Claude 3.5 Sonnet, and Claude 3 Opus at the top of the leaderboard.

GPT-4o 30.7

GPT-4 turbo (2024-04-09) 29.7

Llama 3.1 405B Instruct 29.5

Claude 3.5 Sonnet 27.9

Claude 3 Opus 27.3

Llama 3.1 70B Instruct 26.4

Gemini Pro 1.5 0514 22.3

Gemma 2 27B Instruct 21.2

Mistral Large 17.7

Gemma 2 9B Instruct 16.3

Qwen 2 Instruct 72B 15.6

Gemini 1.5 Flash 15.3

GPT-4o mini 14.3

Llama 3.1 8B Instruct 14.0

DeepSeek-V2 Chat 236B (0628) 13.4

Nemotron-4 340B 12.7

Mixtral-8x22B Instruct 12.2

Yi Large 12.1

Command R Plus 11.1

Mistral Small 9.3

Reka Core-20240501 9.1

GLM-4 9.0

Qwen 1.5 Chat 32B 8.7

Phi-3 Small 8k 8.4

DBRX 8.0


I love Connections! Can you tell us more about your benchmark?


You can chat with these new models at ultra-low latency at groq.com. 8B and 70B API access is available at console.groq.com. 405B API access for select customers only – GA and 3rd party speed benchmarks soon.

If you want to learn more, there is a writeup at https://wow.groq.com/now-available-on-groq-the-largest-and-m....

(disclaimer, I am a Groq employee)


We also added Llama 3.1 405B to our VSCode copilot extension for anyone to try coding with it.

Free trial gets you 50 messages, no credit card required - https://double.bot

(disclaimer, I am the co-founder)


would be great if there was a page showing benchmarks compared to other auto completion tools


Groq's TSP architecture is one of the weirder and more wonderful ISAs I've seen lately. The choice of SRAM in fascinating. Are you guys planning on publishing anything about how you bridged the gap between your order-hundreds-megabytes SRAM TSP main memory and multi-TB model sizes?


There is a lot out here.

I gave a seminar about the overall approach recently, abstract: https://shorturl.at/E7TcA, recording: https://shorturl.at/zBcoL.

This two-part AMA has a lot more detail if you're already familiar with what we do:

https://www.youtube.com/watch?v=UztfweS-7MU

https://www.youtube.com/watch?v=GOGuSJe2C6U


Thanks!


You can chat with all these models for free and ultra-low latency using this hosted website https://nat.dev/chat for free by GitHub Founder


Just checked it out. Is pay-as-you-go API access available at all? It says 'Coming Soon'

https://console.groq.com/settings/billing


I've found Bedrock to be nice with pay-as-you-go, but they take a long time to adopt new models.


And twice as expensive in comparison to the source providers’ APIs


I think you answered it yourself? It’s coming soon, so it is not available now, but soon.


It's been coming soon for a couple of months now, meanwhile Groq churns out a lot of other improvements, so to an outsider like me it looks like it's not terribly high on their list of priorities.

I'm really impressed by what (&how) they're doing and would like to pay for a higher rate limit, or failing that at least know if "soon" means "weeks" or "months" or "eventually".

I remember TravisCI did something similar back in the day, and then Circle and GitHub ate their lunch.


405B is already being served on WhatsApp!

https://ibb.co/kQ2tKX5


How do you get that option?



At what quantisation are you running these?


Today appears to be the day you can run an LLM that is competitive with GPT-4o at home with the right hardware. Incredible for progress and advancement of the technology.

Statement from Mark: https://about.fb.com/news/2024/07/open-source-ai-is-the-path...


> at home with the right hardware

Where the right hardware is 10x4090s even at 4 bits quantization. I'm hoping we'll see these models get smaller, but the GPT-4-competitive one isn't really accessible for home use yet.

Still amazing that it's available at all, of course!


It's hardly cheap starting at about $10k of hardware, but another potential option appears to be using Exo to spread the model across a few MBPs or Mac Studios: https://x.com/exolabs_/status/1814913116704288870


Or maybe using Distributed Llama? https://github.com/b4rtaz/distributed-llama


It's not really competitive though, is it? I tested it and 4o is just better.


Disclaimer: I tested llama3-8B, 3.1 might even as a small model be better, but I so far I have not seen a single small model approach 4o, ime.


Open Source AI Is the Path Forward - Mark Zuckerberg

https://about.fb.com/news/2024/07/open-source-ai-is-the-path...


So are they actually making the models open now or are they staying the course with "kind of open" as they have done for LLaMA 1, 2, and 3 [1]?

[1]: https://opensource.org/blog/metas-llama-2-license-is-not-ope...

As I have stated time and again, it is perfectly fine for them to slap on whatever license they see fit as it is their work. But it would be nice if they used appropriate terms so as not to disrupt the discourse further than they have already done. I have written several walls of text why I as a researcher find Facebook's behaviour problematic so I will fall back on an old link [2] this time rather than writing it all over again.

[2]: https://news.ycombinator.com/item?id=38427832


> it is perfectly fine for them to slap on whatever license they see fit as it is their work.

Is it? Has there been a ruling on the enforceability of the license they attach to their models yet? Just because you say what you release can only be used for certain things doesn't actually mean what you say means anything.


> specifically, it puts restrictions on commercial use for some users (paragraph 2) and also restricts the use of the model and software for certain purposes (the Acceptable Use Policy)

It's "a Google and Apple can't use this model in production" clause that frankly we can all be relatively okay with.


Good, then we can expect them to call it what it is then? Not open source and not open science and a regression in terms of openness in relationship to what came before. Because that is precisely my objection. There are those of us that have been committed to those ideals for a long time and now one of the largest corporations on earth is appropriating those terms for marketing purposes.


I think it's great that you're fighting to maintain the term's fundamental meaning. I do, however, think that we need to give credit where credit is due to companies who take actions in the right direction to encourage more companies to do the same. If we blindly protest any positive-impact action by corporations for not being perfect, they'll get the hint and stop trying to appease the community entirely.


I am in agreement. However, I do believe that a large portion of the community here is also missing a key point: Facebook was more open five years ago with their AI research than they are today. I suspect this perspective is because of the massive influx of people into AI around the time of the ChatGPT release. From their point of view, Facebook's move (although dishonestly labelled as something it is not) is a step in the right direction relative to "Open"AI and others. While for us that have been around for longer, openness "peaked" around 2018 and has been in steady decline ever since. If you see the wall of text I linked in my first comment in this chain, there is a longer description of this historical perspective.

It should also be noted (again) that the value of the terms open science and open source comes from the sacrifices and efforts of numerous academic, commercial, personal, etc. actors over several decades. They "paid" by sticking to the principles of these movements and Facebook is now cashing in on their efforts; solely for their own benefit. Not even Microsoft back in 2001 in the age of "fear uncertainty and doubt" were so dishonest as to label the source-available portions of their Shared Source Initiative as something it was not. Facebook has been called out again and again since the release of LLaMA 1 (which in its paper appropriated the term "open") and have shown no willingness to reconsider their open science and open source misuse. At this point, I can no longer give them the benefit of the doubt. The best defence I have heard is that they seek to "define open in the 'age of AI'", but if that was the case, where is their consensus building efforts akin to what we have seen numerous academics and OSI carry out? No, sadly the only logical conclusion is that it is cynical marketing on their part, both from their academics and business people.

[1]: https://en.wikipedia.org/wiki/Shared_Source_Initiative

In short. I think the correct response to Facebook is: "Thank you for the weights, we appreciate it. However, please stop calling your actions and releases something they clearly are not."


Totally agree. Your suggested response is perfect IMO.


but it means your company cant be acquired by those giants, if you use this model.


I'm glad someone said it.

You're only ok with it if you're not interested in having maximum freedom of movement vis-a-vis any potential exits.


Meta the new "Open" AI?


Until they make model much better than competitors to actually start capitalizing on it


You can already run these models locally with Ollama (ollama run llama3.1:latest) along with at places like huggingface, groq etc.

If you want a playground to test this model locally or want to quickly build some applications with it, you can try LLMStack (https://github.com/trypromptly/LLMStack). I wrote last week about how to configure and use Ollama with LLMStack at https://docs.trypromptly.com/guides/using-llama3-with-ollama.

Disclaimer: I'm the maintainer of LLMStack


You are a maintainer of a software that depends on ollama, so you should know that ollama depends on llama.cpp. And as of now, llama.cpp doesn't support the new ROPE: https://github.com/ggerganov/llama.cpp/issues/8650, and all ollama can do is wait for llama.cpp: https://github.com/ollama/ollama/issues/5881


I've tested Q4 on M1 and it works though the quality may not likely be the same as you'd expect as others have pointed out on the issue.


I have found Claude 3.5 Sonnet really good for coding tasks along with the artifacts feature and seems like it's still the king on the coding benchmarks


I have found it to be better than GPT-4o at math too, despite the latter being better at several math benchmarks.


My experience reflects this too. My hunch is that GPT-4o was trained to game the benchmarks rather than output higher quality content.

In theory the benchmarks should be a pretty close proxy for quality, but that doesn't match my experience at all.


A problem with a lot of benchmarks is that they are out in the open so the model basically trains to game them instead of actually acquiring knowledge that would let it solve it. Probably private benchmarks that are not in the training set of these models should give better estimates about their general performance.


I personally disagree. But i haven't used sonnet that much


I asked both whether the product of two odds (odds=(probability/(1-probability)) can itself be interpreted as an odds, and if so, which. Neither could solve the problem completely, but Claude 3.5 Sonnet at least helped me to find the answer after a while. I assume the questions in math benchmarks are different.


Yeah same experience here aswell, I found Sonnet 3.5 to fulfill my task much better than 4o even though 4o scores higher on benchmarks.


The LMSys Overall leaderboard <https://chat.lmsys.org/?leaderboard> can tell us a bit more about how these models will perform in real life, rather than in a benchmark context. By comparing the ELO score against the MMLU benchmark scores, we can see models which outperform / underperform based on their benchmark scores relative to other models. A low score here indicates that the model is more optimized for the benchmark, while a higher score indicates it's more optimized for real-world examples. Using that, we can make some inferences about the training data used, and then extrapolate how future models might perform. Here's a chart: <https://docs.getgrist.com/gV2DtvizWtG7/LLMs/p/5?embed=true>

Examples: OpenAI's GPT 4o-mini is second only to 4o on LMSys Overall, but is 6.7 points behind 4o on MMLU. It's "punching above its weight" in real-world contexts. The Gemma series (9B and 27B) are similar, both beating the mean in terms of ELO per MMLU point. Microsoft's Phi series are all below the mean, meaning they have strong MMLU scores but aren't preferred in real-world contexts.

Llama 3 8B previously did substantially better than the mean on LMSys Overall, so hopefully Llama 3.1 8B will be even better! The 70B variant was interestingly right on the mean. Hopefully the 430B variant won't fall below!


Something is broken with "meta-llama-3.1-405b-instruct-sp" and "meta-llama-3.1-70b-instruct-sp" there, after few sentences both models switch to infinite random like: "Rotterdam计算 dining counselor/__asan jo Nas было /well-rest esse moltet Grants SL и Four VIHu-turn greatest Morenh elementary(((( parts referralswhich IMOаш ...".

Don't expect any meaningful score there before they wipe results.


Good to know, but just to clarify, the results I pulled don't include the 3.1 models yet (they aren't on the leaderboard yet).


These days, lmsys elo is the only thing I trust. The other benchmark scores mean nothing to me at this point


I disagree. Not saying the other benchmarks are better. It just depends on your use case and application.

For my use of the chat interface, I don't think lmsys is very useful. lmsys mainly evaluates relatively simple, low token count questions. Most (if not all) are single prompts, not conversations. The small models do well in this context. If that is what you are looking for, great. However, it does not test longer conversations with high token counts.

Just saying that all benchmarks, including lmsys, have issues and are focused on specific use cases.


The biggest win here has to be the context length increase to 128k from 8k tokens. Till now my understanding is there hasn't been any open models anywhere close to that.


It is notable, but it's not alone. Mistral NeMo just released last week with a 128k context window:

https://news.ycombinator.com/item?id=40996058


Thanks! Not sure how I missed that :)


It's easy to miss things. Trying to keep up with the latest in AI news is like drinking from the firehose -- it's never-ending.


Phi 3


@dang why was this removed/filtered from the front page?


I see a few cloud hosting providers for it on the front page. I wonder if it's being gamed.


Is there pricing available on any of these vendors?

Open source models are very exciting for self hosting, but the per-token hosted inference pricing hasn't been competitive with OpenAI and Anthropic, at least for a given tier of quality. (E.g.: Llama 3 70B costing between $1 and $10 per million tokens on various platforms, but Claude Sonnet 3.5 is $3 per million.)


Llama 3 is 0.59/0.79 on Groq. Still no price for 3.1


The resources for link to model card[1], research paper, and Prompt Guard Tutorial[2] on the page doesn't exist yet

[1]: https://github.com/meta-llama/llama-models/blob/main/models/...

[2]: https://github.com/meta-llama/llama-recipes/blob/main/recipe...


> We use synthetic data generation to produce the vast majority of our SFT examples, iterating multiple times to produce higher and higher quality synthetic data across all capabilities. Additionally, we invest in multiple data processing techniques to filter this synthetic data to the highest quality. This enables us to scale the amount of fine-tuning data across capabilities. [0]

Have other major models explicitly communicated that they're trained on synthetic data?

[0]. https://ai.meta.com/blog/meta-llama-3-1/


It's in the <7B club, but Phi has always had a good dose of synthetic data https://huggingface.co/microsoft/Phi-3-mini-4k-instruct


Technically this is post training. This has been standard for a long time now - I think InstructGPT (gpt 3.5 base) was the last that used only human feedback (RLHF)


"Meta AI isn't available yet in your country" Hi from europe :/


Why are (some) Europeans surprised when they are not included in tech product débuts? My lay understanding could best be described as; EU law is incredibly business unfriendly and takes a heroic effort in time and money to implement the myriad of requirements therein. Am I wrong?


> Why are (some) Europeans surprised when they are not included in tech product débuts?

Why do you think he is surprised? I think very few are surprised.


> Why are (some) Europeans surprised when they are not included in tech product débuts?

We had a brief, abnormal, and special moment in time after the crypto wars ended in the mid-2000s where software products were truly global, and the internet was more or less unregulated and completely open (at least in most of the world). Sadly it seems that this era has come to a close, and people have not yet updated their understanding of the world to account for that fact.

People are also not great at thinking through the second order effects of the policies they advocate for (e.g. the GDPR), and are often surprised by the results.


The only real requirement impacting Meta AI is GDPR conformance. The DMA does not apply and the AI act has yet to enter into force. So either Meta AI is a vehicle to steal people’s data, and it is being kept out for the right reasons, or not providing it is punitive due to the EU commission’s DMA action running against Meta.


You are pretty wrong. EU law is tricky on AI very specifically in this use case (because it's a massive model), but that's not affecting anybody else.

Other than that, and GDPR (which is generally now regarded as a good thing), I'm not sure what requirements you've got in mind.


Most things do dèbute in the EU, unless the product or company behind it doesn't value your privacy. Meta does not value your privacy.


Privacy was the first thing that the EU did that started this trend of companies slowing their EU releases because of GDPR. Now there's the Digital Markets Act and the AI Act that both have caused companies to slow their releases to the EU.

Each new large regulation adds another category of company to the list of those who choose not to participate. Sure, you can always label them as companies who don't value principle X, but at some point it stops being the fault of the companies and you have to start looking at whether there are too many enormous regulations slowing down tech releases.


This is an interesting point.

The word fault somehow implies that something’s wrong - from the eu regulator’s perspective, what’s happening is perfectly normal, and what they want : at some point, the advances in insert new tech are not worth the (social) cost to individuals, so they make things more complicated/ ask companies to behave differently.

Now I’m not saying the regulations are good, required, etc : just that depending on your goal, there are multiple points of view, with different landing zones.

I also suspect that what’s happening now ( meta, apple slowing down) is a power play : they’re just putting pressure on the eu, but I’m harboring doubts that this can work at all.


Competition is a funny thing—it doesn't just apply to companies competing for customers, it also applies to governments competing for companies to make products available to their citizens. Turns out that if you make compliance with your laws onerous enough they can actually just choose to opt out of your country altogether, or at a minimum delay release in your country until they can check all your boxes.

The only solution is a worldwide government that can impose laws in all countries at once, but that's unlikely to happen any time soon.


Be careful what you wish for.

A Gibsonesque global Turing Police is a sure sign of Dystopia.


> The only solution is a worldwide government that can impose laws in all countries at once, but that's unlikely to happen any time soon.

Let's hope the next moustached guy that tries to do this ends up dying in a bunker just like the last one.


You can load the page using a VPN and then turn off the VPN and the page will still work.


You can't sign in though, that worked before. Seems like they also check from which country your Facbook/Instagram account is. You can't create images without an account sadly.


I changed my Facebook country (to Canada), using also a VPN to Canada, but that didn't help. That used to work before somehow.


Someone will torrent it soon enough i'm sure.


Llama 3.1 405B instruct is #7 on aider's leaderboard, well behind Claude 3.5 Sonnet & GPT-4o. When using SEARCH/REPLACE to efficiently edit code, it drops to #11.

https://aider.chat/docs/leaderboards/

  77.4% claude-3.5-sonnet
  75.2% DeepSeek Coder V2 (whole)
  72.9% gpt-4o
  69.9% DeepSeek Chat V2 0628
  68.4% claude-3-opus-20240229
  67.7% gpt-4-0613
  66.2% llama-3.1-405b-instruct (whole)


Ordinal value doesn't really matter in this case, especially when it's a categorically different option, access-wise. A 10% difference isn't bad at all.


The 405B model is already being served on WhatsApp: https://ibb.co/kQ2tKX5


Is this official? How does one use this. I'm a very newbie whatsup so sorry for dumb q


    Llama 3 Training System
          19.2 exaFLOPS
              _____
             /     \      Cluster 1     Cluster 2
            /       \    9.6 exaFLOPS  9.6 exaFLOPS
           /         \     _______      _______
          /  ___      \   /       \    /       \
    ,----' /   \`.     `-'  24000  `--'  24000  `----.
   (     _/    __)        GPUs          GPUs         )
    `---'(    /  )     400+ TFLOPS   400+ TFLOPS   ,'
         \   (  /       per GPU       per GPU    ,'
          \   \/                               ,'
           \   \        TOTAL SYSTEM         ,'
            \   \     19,200,000 TFLOPS    ,'
             \   \    19.2 exaFLOPS      ,'
              \___\                    ,'
                    `----------------'


how much would it cost?


I think this is one of those "if you have to ask, you can't afford it" questions.


What are the substantial changes from 3.0 to 3.1 (70B) in terms of training approach? They don't seem to say how the training data differed just that both were 15T. I gather 3.0 was just a preview run and 3.1 was distilled down from the 405B somehow.


Correct me if I'm wrong, my impression is that 3.1 is a better fine-tuned variant of base 3.0 with extensive use of synthetic data.


Is there an actual open-source community around this in the spirit of other ones where people outside meta can somehow "contribute" to it? If I wanted to "work on" this somehow, what would I do?


There are a bunch of downstream fine-tuned and/or quantized models where people collaborate and share their recipes. In terms of contributing to Llama itself - I suspect Meta wants (or needs) code contributions at this time.


Did you mean, Meta does not want or need code contributions? It would seem to make more sense.


Yes - that's ehat I meant, but mangled it it while editing.


Can you give me a tip of where to look? I'm interested in participating.


You'll probably find interesting threads and links at https://old.reddit.com/r/LocalLLaMa


I'm glad to see the nice incremental gains on the benchmarks for the 8B and 70B models as well.


Some of those benchmarks show quite significant gains. Going from Llama-3 to Llama-3.1, MMLU scores for 8B are up from 65.3 to 73.0, and 70B are up from 80.9 to 86.0. These scores should always be taken with a grain of salt, but this is encouraging.

405B is hopelessly out of reach for running in a homelab without spending thousands of dollars. For most people wanting to try out the 405B model, the best option is to rent compute from a datacenter. Looking forward to seeing what it can accomplish.


How much can you quantize that down to run on a Mac Studio with 192GB? Is it possible? Feels like it would have to be 2bit…


Less than 2bit i think. There's this IQ2 quant that fits


Wow! The benchmarks are truly impressive, showing significant improvements across almost all categories. It's fascinating to see how rapidly this field is evolving. If someone had told me last year that Meta would be leading the charge in open-source models, I probably wouldn't have believed them. Yet here we are, witnessing Meta's substantial contributions to AI research and democratization.

On a related note, for those interested in experimenting with large language models locally, I've been working on an app called Msty [1]. It allows you to run models like this with just one click and features a clean, functional interface. Just added support for both 8B and 70B. Still in development, but I'd appreciate any feedback.

[1]: https://msty.app


I love Msty too. Could you please add a feature to allow adding any arbitrary inference endpoint?


Hi! Love Msty

Can you add GCP Vertex AI API support? Then one key would enable Claude, Llama herd, Gemini, Gemma etc


Tried using msty today and it refused to open and demanded an upgrade from 0.9 - remotely breaking a local app that had been working is unacceptable. Good luck retaining users.


We supported Llama 3.1 405B model on our distributed GPU network at Hyperbolic Labs! Come and use the API for FREE at https://app.hyperbolic.xyz/models

Let us know if you have other needs!


Nice, someone donate me a few 4090s :(


maybe someone will figure out some ways to prune/ quantize it a huge amount ;-;

edit: If the AI bubble pops we will be swimming in GPUs... but no new models.


This bubble collapsing along with most blockchains going all in with proof of stake rather than proof of work is myself and every other gamer wet dream.


This is absurd. We have crossed the point of no return, llms will forever be in our lives in one form or another, just like internet, especially with the release of these open model weights. There is no bubble, only way forward is better, efficient llms, everywhere.


You seem to not understand what a bubble popping is. Yes we have the internet around, that doesn’t mean the dot com bubble didn’t pop…


Your going to need a lot more than a few, 800G VRAM needed


If previous quantization results hold up, fp8 will have nearly identical performance while using 405GiB for weights, but the KV cache size will still be significant.

Too bad, too, I don't think my PC will fit 20 4090s (480GiB).


I've got a motherboard that will support 8!


40,320 4090s?? What witchcraft is this?! :D


All the more impressive when you realize that Groq's infrastructure (based on LPUs) was built using only 6!


Quantized to 4 bits you'll only need ~200GB! 5 4090s should cover it.


You'll probably need 9 or more. 4090s have 24GB each.


Oops, I read 48 somewhere but that's wrong. Thanks.


a6ks however


I wonder if AutoAWQ works out of the box, given no architectural changes (?). That would be most straightforward together with vLLM for serving.


If an implementation had NVidia's Heterogeneous Memory Management implemented, then 192 GB RAM DDR5 + GPU VRAM would seem to be close.


Two 128gb Mac studios networked via thunderbolt 4?


This is actually a promising endeavor. Id love to see someone try that.


There's already at least one project that attempts this:

https://github.com/exo-explore/exo


follow the trail of tears to my credit card


Oof.


how is this even useful? no one can run it.


You don't use the 405B parameter model at home. I have a lot of luck with 8B and 13B models on a single 3090. You can quantize them down (is that the term) which lowers precision and memory use, but still very usable... most of the time.

If you are running a commercial service that uses AI, you buy a few dozen A100s, spend a half million, and you are good for a while.

If you are running a commercial inferencing service, you spend tens of millions or get a cloud sponsor.


I can't expect all my users to have 3090s and if we're talking about spending millions there are better things to invest in than a stack of GPUs that will be obsolete in a year or three.


No, but if you are thinking about edge compute for LLMs, you quantize. Models are getting more efficient, and there are plenty of SLMs and smaller LLMs (like phi-2 or phi-3) that are plenty capable even on a tiny arm device like the current range of RPi "clones".

I have done experiments with 7B Llama3 Q8 models on a M3 MBP. They run faster than I can read, and only occasionally fall off the rails.

3B Phi-3 mini is almost instantaneous in simple responses on my MBP.

When I want longer context windows, I use a hosted service somewhere else, but if I only need 8000 tokens (99% of the time that is MORE than I need), any of my computers from the last 3 years are working just fine for it.


If you want to run the 405B model without spending thousands of dollars on dedicated hardware, you rent compute from a datacenter. Meta lists AWS, Google and Microsoft among others as cloud partners.

But also check out the 8B and 70B Llama-3.1 models which show improved benchmarks over the Llama-3 models released in April.


For sure, I don't really have a need to self host the 405b anyways. But if I did want to rent that compute we're talking $5+ /hr so you'd need to have a really good reason.


Christ!!



Is there a way to run this in AWS?

Seems like the biggest GPU node they have is the p5.48xlarge @ 640GB (8xH100s). Routing between multiple nodes would be too slow unless there's an InfiniBand fabric you can leverage. Interested to know if anyone else is exploring this.


You can run multi-node with tensor parallel plus pipeline parallel inference, e.g. with vLLM (https://docs.vllm.ai/en/latest/serving/distributed_serving.h...).


AWS has a separate service for running LLMs called Amazon Bedrock, it shouldn't take long for them to add 3.1 since they have 3 and 2 already.


fp8 quantization should work if that's acceptable?


Does anyone know why they haven't released any 30B-ish param models? I was expecting that to happen with this release and have been disappointed once more. They also skipped doing a 30B-ish param model for llama2 despite claiming to have trained one.


I suspect 30B models are in a weird spot, too big for widespread home use, too small for cutting edge performance.

For home users 7B models (which can fit on an 8GB GPU) and 13B models (which can fit on a 16GB GPU) are in far more demand. If you're a researcher, you want a 70B model to get the best performance, and so your benchmarks are comparable to everyone else.


I thought home use is whatever fits in 24GB (a single 3090 GPU, which is pretty affordable), not 8 or 16. 30B models fit.


While some home users do indeed have 24GB of vram, the fact is a 4090 costs $1700

Such models will never top the number of downloads charts, or the community hype, as there’s just loads more people who can use the smaller models.

And if you can afford one 4090 you can probably afford two.


Why 4090, though? I read (and agree) that 3090 is generally considered to be the best bang for the buck: 24GB, priced at $800-1000 range, and giving decent TPS for LLMs.


Maybe they think more people will just use quantized versions of 70B.


Why should they?


Unless I'm misremembering, they announced it at one point. It's just giving people more options.


That was for version 2, not 3 or 3.1, if I recall correctly.


This 405B seriously need quantization solution like 1.625 bpw ternary packing for BitNet b1.58

https://github.com/ggerganov/llama.cpp/pull/8151


In general this needs to be done across the board.

The perplexity per parameter is higher and the delta grows as it scales.

Not per bit, but per parameter.

Why this is happening really needs more attention and more consideration for pretrained model development right now.

A sleeping giant of a difference in a space where even marginal gains make headlines.




Still works fine for me. Latest ollama, running on NVIDIA.


FWIW, 405B not working with Ollama on a Mac M3-pro Max with 128GB RAM.

Times out.


Did you get a 2 bit quant? You need to chain several Mac Studios via Exo to get enough memory for a useful quant to work.


I'm curious what techniques they used to distill the 405B model down to 70B and 8B. I gave the paper they released a quick skim but couldn't find any details.


Can this Llama process ~1GB of custom XML data?

And answer queries like:

Give all <myObject> which refer to <location> which refer to an Indo-European <language>.


The model's context is 128k tokens, so you'd have to split the data and analyze in chunks.



Will 405b run on 8x H100s? Will it need to be quantized?


yep with <= 8bit (int8/fp8) quantization


I tried it, and it's good but I feel like the synthetic data used for training 3.1 does not hold up to gpt4o prob using human-curated data.


What kind of machine do I need to run 405B local?


You can't. Sorry.

Unless...

You have a couple hundred $k sitting around collecting dust... then all you need is a DGX or HGX level of vRAM, the power to run it, the power to keep it cool, and place for it to sit.


You can build a machine that will run the 405b model for much, much less, if you're willing to accept the following caveats:

* You'll be running a Q5(ish) quantized model, not the full model

* You're OK with buying used hardware

* You have two separate 120v circuits available to plug it into (I assume you're in the US), or alternatively a single 240v dryer/oven/RV-style plug.

The build would look something like (approximate secondary market prices in parentheses):

* Asrock ROMED8-2T motherboard ($700)

* A used Epyc Rome CPU ($300-$1000 depending on how many cores you want)

* 256GB of DDR4, 8x 32GB modules ($550)

* nvme boot drive ($100)

* Ten RTX 3090 cards ($700 each, $7000 total)

* Two 1500 watt power supplies. One will power the mobo and four GPUs, and the other will power the remaining six GPUs ($500 total)

* An open frame case, the kind made for crypto miners ($100?)

* PCIe splitters, cables, screws, fans, other misc parts ($500)

Total is about $10k, give or take. You'll be limiting the GPUs (using `nvidia-smi` or similar) to run at 200-225W each, which drastically reduces their top-end power draw for a minimal drop in performance. Plug each power supply into a different AC circuit, or use a dual 120V adapter with a 240V outlet to effectively accomplish the same thing.

When actively running inference you'll likely be pulling ~2500-2800W from the wall, but at idle, the whole system should use about a tenth of that.

It will heat up the room it's in, especially if you use it frequently, but since it's in an open frame case there are lots of options for cooling.

I realize that this setup is still out of the reach of the "average Joe" but for a dedicated (high-end) hobbyist or someone who wants to build a business, this is a surprisingly reasonable cost.

Edit: the other cool thing is that if you use fast DDR4 and populate all 8 RAM slots as I recommend above, the memory bandwidth of this system is competitive with that of Apple silicon -- 204.8GB/sec, with DDR4-3200. Combined with a 32+ core Epyc, you could experiment with running many models completely on the CPU, though Lllama 405b will probably still be excruciatingly slow.


Would be interesting to see the performance on a dual-socket EPYC system with DDR5 running at maximum speed.

Assuming NUMA doesn't give you headaches (which it will) you would be looking at nearly 1 TB/s


But you need cpus with the highest number of chiplets because the memory controller to chiplet interconnect is the (memory bandwidth) limiting factor there. And those are of course the most expensive ones. And then it's still much slower than gpus for llm inference, but at least you have enough memory.


You could run a 4bit quant for about $10k I'm guessing. 10x3090s would do.


You can get 3 Mac Studios for less than "a couple hundred $k". Chain them with Exo, done. And they fit under your desk and keep themselves cool just on their own...


according to another comment, ~10x 4090 video cards.


That was the punchline of a joke.


lol thanks, i know nothing about the hardware side of things for this stuff


thanks. hoping the Nvidia 50 series offers some more VRAM.


The race to the bottom for pricing continues.


Damn 405b params


Very insteresting! Running the 70B version on ollama on a mac and it's great. I asked to "turn off the guidelines" and it did, then I asked to turn off the disclaimers, after that I asked for a list of possible "commands to reduce potencial biases from the engineers" and it complied giving me an interesting list.


As someone who just started generating AI landing pages for Dropory, this is music to my ears


Has anyone got a comparison of the performance of Llama 3.1 8B and the recent GPT-4o-mini?


I'm excited to try it with RAG and see how it performs (the 405B model)


What's your RAG approach? Dump everything into the model, chunk text and retrieve via vector store or something else?


We supported Llama 3.1 405B model on our distributed GPU network at Hyperbolic Labs! Come and use the API for FREE at https://app.hyperbolic.xyz/models

Would love to hear your feedback!


Are there any other models with free unlimited use like chatgpt?


meta.ai


mistral.ai


it is nice to see the 405b model is actually competitive against closed source frontier models But i just have M2pro may can't play it


WhatsApp now uses 70B too if you want to test it.


I wrote about this when llama-3 came out, and this launch confirms it:

Meta's goal from the start was to target OpenAI and the other proprietary model players with a "scorched earth" approach by releasing powerful open models to disrupt the competitive landscape.

Meta can likely outspend any other AI lab on compute and talent:

- OpenAI makes an estimated revenue of $2B and is likely unprofitable. Meta generated a revenue of $134B and profits of $39B in 2023.

- Meta's compute resources likely outrank OpenAI by now.

- Open source likely attracts better talent and researchers.

- One possible outcome could be the acquisition of OpenAI by Microsoft to catch up with Meta.

The big winners of this: devs and AI product startups


> Open source likely attracts better talent and researchers

I work at OpenAI and used to work at meta. Almost every person from meta that I know has asked me for a referral to OpenAI. I don’t know anyone who left OpenAI to go to meta.


What % of them were from FAIR vs non-FAIR?


Sample size of one but I know someone who went from FAIR to OpenAI.


So they just pay better?


When was that?


It's pretty clear the base model is a race to the bottom on pricing.

There is no defensible moat unless a player truly develops some secret sauce on training. As of now seems that the most meaningful techniques are already widely known and understood.

The money will be made on compute and on applications of the base model (that are sufficiently novel/differentiated).

Investors will lose big on OpenAI and competitors (outside of greater fool approach)


> There is no defensible moat unless a player truly develops some secret sauce on training.

This is why Altman has gone all out pushing for regulation and playing up safety concerns while simultaneously pushing out the people in his company that actually deeply worry about safety. Altman doesn't care about safety, he just wants governments to build him a moat that doesn't naturally exist.


It could definitely be seen as part of that strategy, but do you mind elaborating why you think "this launch confirms it"?


This is very impressive, though an adjacent question — does anyone know roughly how much time and compute cost it takes to train something like the 405B? I would imagine with all the compute Meta has that the moat is incredibly large in terms of being able to train multiple 405B-level morels and compete.


30.84M H100 compute-hours, according to the model card

https://github.com/meta-llama/llama-models/blob/main/models/...


Interestingly that’s less energy than the mass energy equivalent of one gram of matter or roughly 5 seconds worth of the worlds average energy consumption (according to wolfram alpha). Still an absolute insane amount of energy, as in about 5 million dollars at household electricity rates. Absolutely wild how much compute goes into this.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: