Hacker News new | past | comments | ask | show | jobs | submit login
Fine tune a 70B language model at home (answer.ai)
909 points by jph00 3 months ago | hide | past | favorite | 206 comments
Jeremy from Answer.AI here. This is our first project since launching our new R&D lab at the start of this year.

It's the #1 most requested thing I've been hearing from open source model builders: the ability to use multiple GPUs with QLoRA training. So that's why we decided to make it our first project.

Huge thanks to Tim Dettmers for helping us get started to this -- and of course for creating QLoRA in the first place!

Let me know if you have any questions or thoughts.




One thing I forgot to mention in the post which I think is kinda cool: at the NeurIPS Efficiency Challenge this year, where Tim Dettmers and I both did keynotes, every single top-ranked entry used QLoRA! The challenge was to create the most accurate model on a single GPU in 24 hours.

I think that is a great example of how important and useful QLoRA is. Maybe we should run a dual-GPU challenge next time not that multi-GPU is working...


It's clear that QLoRA has opened up finetuning to a wider audience with limited compute, which is a good thing.

One thing I've wondered about: what are the drawbacks to using QLoRA? For example if compute is not a limit, I'm guessing one should not use QLoRA and finetune in full precision instead?

Afaik when a model is first quantized to nf4 (before finetuning begins), model performance is degraded from baseline (see https://x.com/Tim_Dettmers/status/1661482614811918338?s=20).

Dettmers shows that after finetuning wrt the dataset, the result is as good as full precision. But afaik never explored the effects outside the finetuning data. Assuming the finetuning dataset is small, the model will just be the degraded nf4 version, right? Or perhaps finetuning will even skew the model in weird ways (trying to fix quantization errors).

Anecdotally models finetuned wth QLoRA perform well. Does anyone have any papers or a careful analysis of this?


Are any of the NIPS resources available online?



Congrats on the release.

On Twitter someone asked you if you would provide any risk assessment reflection, and you replied that the risk would be similar to the release of a new model of pen or pencil.

That reply, while cute, isn’t accurate. A new pencil does not mean new capabilities for humanity, while the whole point of your release is that it does afford new capabilities and therefore new risks.

The woman, a professor working on the societal impacts of AI, asked a straightforward question in apparent good faith [1]. Your reply did not seem to me to be in good faith.

Can you explain the apparent disconnect? I’m less concerned as to whether or not you would release an assessment and more concerned at the dismissive attitude, especially towards a female professor studying societal impacts of AI, a fellow researcher.

[1] https://x.com/dr_atoosa/status/1765874818283385071


I believe my answer is accurate. I don't know on what basis you claim otherwise. This is my work, and I'm well placed to make the assessment. My response is based on reading a great many papers and doing many experiments over the last few years, and I am confident of it.

By way of background: I studied the philosophy of ethics at university, I co-authored a book chapter on AI/ML ethics, my wife and I quit our jobs and worked for free for years entirely focused on trying to help society adapt to and benefit from AI, I wrote the actual article we're commenting on, and the article is about LLM fine-tuning -- a field I to some extent created by developing the ULMFiT algorithm.

The person in question is, IIRC, at Governance AI, a group whose work I spent months studying in depth -- work which I believe is more likely to cause harm to society than to benefit it, as I explained here:

https://www.fast.ai/posts/2023-11-07-dislightenment.html


Above your scientific and technical contributions to humanity, thank you for being a source of light in front of these shadow philia, dark minds.


I think the power concentration problem, the successor species problem, and the harmful content problem are not particularly aligned in how they would be solved. Am I correct in guessing you believe the power concentration problem is important and the others are much less so?


[flagged]


Implicit in every reply you've given is the assumption that OP is treating the criticism from this researcher differently because she's a woman. Do you have any basis on which you're making this assumption? OP explained that they have substantive issues with the organization of which this researcher is a member.


[flagged]


> I can see why it would appear that I’m saying that, but that was not my intention.

You are talking out of both sides of your mouth. In another comment on this same thread, you say this:

> [...]women in the field are more readily dismissed, and I think they shouldn’t be. It’s a moment to check our internalized biases and make sure we’re operating in good faith.

In your original comment you explicitly accuse the OP of operating in bad faith, presumably as a result of "internalized biases" as described above. How does this not add up to an assumption that OP treated the researcher differently because she's a woman? It is exactly what you are implying.


And you would have called a simple "no" dismissive, too…


Then why's it relevant to keep mentioning it's a woman?


I think it's easier to dismiss risk with this project as it allows democratised access to AI models and furthers research in the field. The ability to generate low-quality content has been available since long before LLM technology, additionally these 70B param models are just barely fitting into $10,000 worth of hardware (not accounting for M-series chips).

The scaling issue with potential runaway AI can be excluded. The potential for virus writing / security exploitation perhaps but such risks are already present with existing models so this point too can be excluded. I'm not sure there's any inherent risk here compared to what's easily available with a considerably reduced amount of resource requirements. The write up here seems concerned with allowing independent and democratised research which is a greater benefit than concentrated efforts.


"I'm an expert and the other person isn't as knowledgeable as me" doesn't make your point very well. And mentioning that you worked for free for years seems irrelevant.


> especially towards a female professor studying societal impacts of AI

> especially towards a professor studying societal impacts of AI

I can't help but feel like you're trying to load this question with some expectation that sex/gender should change how people react and respond. It shouldn't, at all, positively or negatively. Everyone is human (for now).


[flagged]


I think you're saying that correcting a bias isn't itself applying bias.

I think the poster to which you're responding is saying there wasn't any visible evidence of bias in the original behaviour.

From one point of view, you're correcting a bias (in this case, one you suspect might exist), and you believe that isn't a bias.

From another point of view, you're introducing a new bias (that definitely exists) to something that was previously (presumably) agnostic and neutral, by saying that a correction should only be applied when the question-asker meets certain criteria.

Both points of view are legitimate.

PERSONALLY I'd rather we not bring the "have you thought about how this might be discriminatory" bits into the converation unless there's at least some vague reason to think that it was, rather than presuming it's always discrimination when the genders and/or sexes and/or races line up a certain way. But that's because I think it's more important to debate the ideas in their pure form than to correct every instance of discrimination, and that's an arbitrary value judgement I made.


I completely get your take on this and don't disagree. IMO it reads as a flippant response to a potentially serious question, and I am going to assume this is unintentional, which does not mean it is unfair to call it out.

Stepping back, this is the kind of discourse that Twitter can sometimes reinforce: short responses that, lacking context, can be interpreted in polarizing ways. After a bit of reading on the participants (because both of them are working in interesting areas), my belief is that the "pencil" response is actually shorthand for a whole set of beliefs that provide a lot of context and nuance to the discussion, but if you don't know that, it sounds like saying "AI is as dangerous as a new brand of chewing gum".

In addition, without defining what risks we're talking about, it's really hard to know the scope the answer is addressing. E.g., societal risk due to AI in general? vs. say, the risks of affecting the quality an existing model with fine-tuning?

So, I am chalking this up to a misunderstanding due to the limitations of the platform as well as the short-hand used in the response. And I could be completely wrong :)


You are reading a lot into a tweet....


I liked that you link to renting a dual 24GPU for 0.60cents/hour, but how long could it takes to fine tune a 70b model using your system (4 bits for weights)?

If I were a consumer I would be interested in the final price of fine tuning, for example a table with model size, training size, cost of training, and expected loss of quality with this technology.

One obvious question: Can you apply your technology with the recent (-1,0,1) encoding?, I think you will answers that the (-1,0,1) model is not available and you can't try it, but my question is whether once/if that model is available answer.ai will be able to use the same technology that this post to fine tune a big model in two very small GPUs, and then I should ask for a new table with cost/benefits analysis.

Edited: I should add that I find this kind of work very useful for enhancing individual users like me to be able to compete in the applications of LLM market, this is great work and along the lines of the book "from zero to one" (not that I like or dislike the author) to solve the kind of problem that nobody is trying to solve.

Edited: Now that I have a total of 23 points in HN, I will change my password to some random one, just to cure my desire to look for votes, and try to make some work, and again some tomorrow create a new presence in HN.


> Now that I have a total of 23 points in HN, I will change my password to some random one, just to cure my desire to look for votes, and try to make some work, and again some tomorrow create a new presence in HN.

If you use Stylus (or any similar browser extension), I actually wrote a style to hide points for that very reason, replacing karma and scores with `•••`

This is actually the second time I see someone mentioning this need, so I've made it into a gist and published it to userstyles, but here's it is also since it's pretty short:

    @-moz-document domain("news.ycombinator.com") {
        /* Hide karma and points on replies */
        span.pagetop #karma, span.comhead span.score {
            visibility: hidden;
            position: relative;
            display: inline-block;
            height: 10px !important;
            overflow: hidden;
        }
        span.pagetop #karma {
            width: 0.8rem !important;
        }
        span.comhead span.score {
            width: 0.8rem !important;
        }
        span.pagetop #karma::before, span.comhead span.score::before {
            content: "•••";
            visibility: visible;
            overflow: hidden;
            opacity: 0.8;
            font-family: Helvetica, Arial, sans-serif !important;
        }
    }

https://gist.github.com/airstrike/62584e6ffb6104791c0ae48a8e...

https://userstyles.world/style/15164/hackernews-hide-karma-a...


I wish this was built in but understand the intentional abusive psychological exploit that it isn’t.


On how long, finetuning is influenced by your dataset size (more = slower), sequence length since attention is O(N^2), data movement etc and most important is how many steps you want to take. For QLoRA, some runs can do a few hundred steps which can complete in minutes to 1 hour. Too many can overfit. So being able to fit it on consumer GPUs can be very cost effective.

On the 1.58bit paper, from what I understand, this requires a total retraining from scratch. Hopefully the researchers will open source their weights :)

On the technicals, weights are encoded in (-1, 0, 1), whilst QLoRA uses a 4bit dynamic mapping of 16 numbers. The only change required would be the torch.matmul(X, W) step, where it'll be torch.bitlinear_matmul(X, W). Before with QLoRA, one has to do torch.matmul(X, dequantize(W)). So one has to implement torch.bitlinear_matmul. The backward is torch.bitlinear_matmul(dY, W.T).


What's the magic in 1.58bit vs. 4 bit that it makes it so much more efficient (claimed)?


From what I understand, using (-1, 0, 1) removes multiplications in GPUs. Ie assume you have a weight matrix and multiply it by some activations

                   [-1, 0,  1]

                   [0,  1, -1]

    [10, 20, 30] x [1,  1,  0]
Instead of doing 10(-1) + 20(0) + 30(1) + 10(0) + ..., since we know beforehand the weights are simply (-1, 0, 1), we easily flip the sign and do addition, or force the hardware to do addition ie if (-1) do subtraction. If (0) do addition. If (1) do addition.

Floating point multiplication does addition of the exponents and multiplying of the mantissa. So just simplifying:

Float16 has E=5, M=10. Ie around 5 + 10^2 space needed = 105.

Bfloat16 has E=8, M=7. So 8 + 7^2 = 57 space.

Float8(143) E=4, M=3. So 4 + 3^2 = 13 space.

1.58(16bit) E=5, M=10. Addition only, so shift E say 5 + 10 addition = 15.

1.58(8bit) E=4, M=3. Addition only, so shift E say 4 + 3 addition = 7.

Obviously I'm simplifying, but with only additions, 1.58 uses say 7 space, whilst FP8 uses 13 space, so in theory 2x more transistors can be crammed, ie 2x more FLOPs than FP8.


Really simple explanation is that for inference, feed forward networks are threshold circuits and by their nature ANNs are binary output, outputting true and false (same as being a threshold circuit)

So if you train your models with that in mind you're weighs can be reduced to -1,0,1 reducing the space complexity.

I don't think the costs in expressiveness are captured quite yet, but as perplexity doesn't care about correctness, if that is the metric that is important for you it will probably reduce memory requirements for inference.


also just to add, I think the 1.58 bit is mostly faster for inference because training still had to multiply a lot of floating point gradients by integer activations, hold floating point weights/gradients for round, and deal with norms and stuff. could be wrong about that though


> Edited: Now that I have a total of 23 points in HN, I will change my password to some random one, just to cure my desire to look for votes, and try to make some work, and again some tomorrow create a new presence in HN.

The irony of making an unnecessary edit like this to virtue signal for implicit social currency by shitting on the explicit form.


As mentioned in the post, benchmarking results are coming in a later post. But in short: you can train an epoch of Alpaca in 24 hours or so, which is enough to get very significant change in model behavior.


> the recent (-1,0,1) encoding?

A side point, but this "recent" encoding goes back to a 2017 paper from the Allen Institute. These days a seven year old paper is ancient.

They went further and showed you could could get away with binary, you don't even need trinary!


Goes back before then. This got popularized by BinaryConnect in 2015, and groups were training binary networks as early as 2011.

You are probably referring to XNOR net, and the novel piece there was also using binary activations (which bitnet is not).

So as far as I can tell, bitnet is basically BinaryConnect applied to LLMs.

https://arxiv.org/abs/1511.00363


Thanks for your informative comment. What HN is for!


The bitnet paper was showing worse results than fp16 transformer with the same parameter count. The shocking result in the 1.58b paper (same group) is no quality loss compared to fp16.


i think those tables could be a facinating product. All parties involved could purchase them for private and public use.

P.S. I thought one was suppose to spend the HN points on mocking north-americans, shameless self-promotion, unpopular facts, general trolling and complaints about topics existing. I could go on but I haven't the points.


I like how you think about social media.


This is great, but one thing I really hoped would come sooner is fast training on Metal. As things are, you can get an M1/M2 Ultra (~800 Gb/s memory bandwidth; for comparison, RTX 4090 is ~1050 Gb/s) Mac Studio with 128Gb RAM for ~$3500. For large model inference, this is already way more affordable than stacking GPUs while being "fast enough", but training solutions are basically non-existent. I do wonder why; it feels like a low-hanging fruit.


Compute limited - an m2 ultra has 27 tflops, a 4090 80+


So it should just take longer..


If you don't care how long it takes you can get an old server with 128GB of RAM for a lot less than $3500.


But that isn't GPU memory right? On the Mac it is.


> But that isn't GPU memory right? On the Mac it is.

They call it that but it's really LPDDR5, i.e. normal DRAM, using a wide memory bus. Which is the same thing servers do.

The base M3, with "GPU memory", has 100GB/s, which is less than even a cheap desktop PC with dual channel DDR5-6400. The M3 Pro has 150GB/s. By comparison a five year old Epyc system has 8 channels of DDR4-3200 with more than 200GB/s per socket. The M3 Max has 300-400GB/s. Current generation servers have 12 channels of DDR5-4800 with 460GB/s per socket, and support multi-socket systems.

The studio has 800GB/s, which is almost as much as the modern dual socket system (for about the same price), but it's not obvious it has enough compute resources to actually use that.


But if you need to get 2x consumer GPUs seems to me the reason is not for the compute capabilities but rather to be able to fit the model on the VRAM of both. So what exactly does having lots of memory on a server help with this when it’s not memory the GPU can use unlike on Apple Silicon computers?


The problem with LLMs is that the models are large but the entire model has to be read for each token. If the model is 40GB and you have 80GB/s of memory bandwidth, you can't get more than two tokens per second. That's about what you get from running it on the CPU of a normal desktop PC with dual channel DDR5-5200. You can run arbitrarily large models by just adding memory but it's not very fast.

GPUs have a lot of memory bandwidth. For example, the RTX-4090 has just over 1000GB/s, so a 40GB model could get up to 25 tokens/second. Except that the RTX-4090 only has 24GB of memory, so a 40GB model doesn't fit in one and then you need two of them. For a 128GB model you'd need six of them. But they're each $2000, so that sucks.

Servers with a lot of memory channels have a decent amount of memory bandwidth, not as much as high-end GPUs but still several times more than desktop PCs, so the performance is kind of medium. Meanwhile they support copious amounts of cheap commodity RAM. There is no GPU, you just run it on a CPU with a lot of cores and memory channels.


Got it, thanks!


It's fast enough to do realtime (7 tok/s) chat with 120b models.

And yes, of course it's not magic, and in principle there's no reason why a dedicated LLM-box with heaps of fast DDR5 couldn't cost less. But in practice, I'm not aware of any actual offerings in this space for comparable money that do not involve having to mess around with building things yourself. The beauty of Mac Studio is that you just plug it in, and it works.


> It's fast enough to do realtime (7 tok/s) chat with 120b models.

Please list quantization for benchmarks. I'm assuming that's not the full model because that would need 256GB and I don't see a Studio model with that much memory, but q8 doubles performance and q4 quadruples it (with corresponding loss of quality).

> But in practice, I'm not aware of any actual offerings in this space for comparable money that do not involve having to mess around with building things yourself.

You can just buy a complete server from a vendor or eBay, but this costs more because they'll try to constrain you to a particular configuration that includes things you don't need, or overcharge for RAM etc. Which is basically the same thing Apple does.

Whereas you can buy the barebones machine and then put components in it, which takes like fifteen minutes but can save you a thousand bucks.


6.3 tok/s has been demonstrated on q4_0 Falcon 180B on the 192gb Mac studio: https://x.com/ggerganov/status/1699791226780975439?s=46&t=ru...


A lot of people are reporting low context sizes tokens per second without discussing how slow it is at bigger sizes. In some cases they are also not mentioning time to first inference. If you dig around /LocalLLaMA you can find some posts with better bench-marking.


That is with 4-bit quantization. For practical purposes I don't see the point of running anything higher than that for inference.


That's interesting though, because it implies the machine is compute-bound. A 4-bit 120B model is ~60GB, so you should get ~13 tokens/second out of 800GB/s if it was memory-bound. 7/s implies you're getting ~420GB/s.

And the Max has half as many cores as the Ultra, implying it would be compute-bound too.


The issue here isn't specifically about the classification of memory, be it "unified memory," RAM, or VRAM. The primary concern is ensuring there's enough memory capacity for the models required for inference. The real question at hand is the Mac's value proposition in terms of inference speed, particularly for models as large as 70 billion parameters. Utilizing a 4090 GPU can facilitate real-time inference, which is the desired outcome for most users. In contrast, a Mac Studio offers close to real-time inference speeds, which might be disappointing for users expecting a real-time experience. Then, there's the option of CPU + RAM-based inference, which suits scenarios where immediate responses aren't crucial, allowing for batch processing of prompts and subsequent retrieval of responses. Considering the price points of both the Mac Studio and high-end GPUs are relatively comparable, it begs the question of the practicality and value of near real-time inference in specific use cases.


Considering that the topic is approachability and energy efficiency, that Mac Studio will do reasonably fast inference while consuming <200W at full load.

The speed is certainly not comparable to dedicated GPUs, but the power efficiency is ridiculous for a very usable speed and no hardware setup.


This, and then you get to have a Mac Studio.

I have one, where I selected an M1 Ultra and 128G RAM to facilitate just this sort of thing. But in practice, I'm spending much more time using it to edit 4K video, and as a recording studio/to develop audio plugins on, and to livestream while doing these things.

Turns out it's good at these things, and since I have the LLAMA 70b language model at home and can run it directly unquantized (not at blinding speed, of course, but it'll run just fine), I'm naturally interested in learning how to fine tune it :)


Yep, I also got mine specifically for LLMs and ended up using it as a second desktop for other things; actually strongly considering making it my primary at this point.

I still wouldn't recommend it to someone just looking for a powerful desktop, just because $3K is way overpriced for what you get (non-replaceable 1Tb SSD is so Apple!). But it's certainly great if you already have it...


"Real-time" is a very vague descriptor. I get 7-8 tok/s for 70b model inference on my M1 Mac - that's pretty real-time to me. Even Professor-155b runs "good enough" (~3 tok/s) for what I'd consider real-time chat in English.


[flagged]


I'm not a gpt. But now you could say that this is exactly how a gpt would answer and we get stuck in a loop and there's no obvious way to prove that I'm not a gpt.


Interesting. Let's review the comment.

> The issue here isn't specifically about the classification of memory, be it "unified memory," RAM, or VRAM. The primary concern is ensuring there's enough memory capacity for the models required for inference.

The comment chain is about training, not inference.

> The real question at hand is the Mac's value proposition in terms of inference speed, particularly for models as large as 70 billion parameters.

Again, wrong topic.

> Utilizing a 4090 GPU can facilitate real-time inference, which is the desired outcome for most users.

Generic statement. Semantically empty. Typical LLM style.

> In contrast, a Mac Studio offers close to real-time inference speeds, which might be disappointing for users expecting a real-time experience.

Tautological generic statement. Semantically empty. Typical LLM style.

> Then, there's the option of CPU + RAM-based inference, which suits scenarios where immediate responses aren't crucial, allowing for batch processing of prompts and subsequent retrieval of responses.

Contradicts first sentence that "classification of memory" isn't important. Fails to recognize this the same category as previous statement. Subtle shift from first sentence that declared "primary concern is ... membory capacity", to focusing purely on performance. This kind of incoherent shift is common in LLM output.

> Considering the price points of both the Mac Studio and high-end GPUs are relatively comparable, it begs the question of the practicality and value of near real-time inference in specific use cases.

Completes shift from memory capacity to performance. Compares not really comparable things. "Specific use cases" is a tell-tale LLM marker. Semantically empty.


I feel the need to point out that people, who spend many hours writing with an LLM, will eventually start writing like the LLM.


I'm definitely guilty of this. Especially for non native speaker who might more easily lend towards adapting phrases from others (including gpts), because they are not sure how to phrase it correctly.


antinf congratulations I think you have proven, beyond any doubt, that I'm a gpt.

(This is a semantically empty tautological generic statement.)


'Write me something profane?' That probably weeds out commercially available GPTs?


No, this is a common fallacy. You can tell ChatGPT, one of the most infamously hobbled GPTs, in custom instructions that you do want profanity, and it will oblige. This is not a jailbreak, this is supported behavior.


I do think that there is a certain amount of gpt bot activity present on HN. But I don't think it makes sense to call people out and saying they are a gpt just based on one comment.


I'm sorry, but I can't fulfill that request. Is there anything else I might assist you with? ;)



4090 is actually 300+, so > 10 times faster. What would take 1 month on 2x4090 on M2 will take 2 years.


Memory limited - an m2 ultra has >150GiB, a 4090 24GiB


So why is nobody doing this?


My personal experience with Apple Silicon and machine learning in comparison with Nvidia is that the libraries are often missing various features on Apple, leading to headaches when trying to use various libraries and tools. Apple is working to bridge the gap and I am excited for the gap to be closed because the memory bandwidth on big M2 and M3 machines is monstrous.


sounds similar to how people have described game dev for Mac. The hardware is there. It just isn't supported.


Apple could single-handedly kill consumer dGPU market if they have released proper low-level APIs for their M1/2/3. I feel they have some huge coming out in the pipe to tumble the "AI" market.


M1, M2, M3 still have very low number of GPU cores. Apple should release some better hardware to take advantage of their recently released MLX library.


At this moment it looks clear to me that Apple won’t go that way. It’s enough for them to focus on inference and actual application not the heavy training part. They have been probably training models on a cluster with non Apple silicon and make them available for their chips only for inference.


Not to mention entirely outsourcing training workloads to specialist firms. Apple does a lot of secretive outsourcing of things you might think they would or should do in-house. This contrasts with Google and Meta who seem to like keeping everything in-house.


It’s true that their GPUs are slower than Nvidia’s. But keep in mind that cores are really different and cannot be compared across architectures. You want more Gflops, not necessarily more cores.


They do, but for inference at least, it's memory bandwidth that is the primary limiting factor for home LLMs right now, not raw compute.


Wonder if the apple silicon ultra series will start using HBM3(e) on desktop in the future.


This might be the most interesting constructive approach in "Open Source" LLMs I've seen. Grounded, reasonable and inviting to replicate! I wish academia took that as a standard.

Great job!


Answer.ai is truly open AI. :)


That's what was said about OpenAI, Mistral, before the VCs and investors came in.

After that, the larger flagship AI models were then closed up again and used as an server only offering.


I doubt it, Jeremy’s been walking the walk for quite a while now, when it comes to opening up access to AI, especially with his excellent, free fast.ai course, it seems pretty clear that his primary motivations are in helping others. (If you’re in this thread, Jeremy, thanks for fast.ai, it helped me immensely in getting started in training models).


For the most part this post was easy to read, and I could feel the collective excitement of the team. I came away feeling like I'd learned something and ready to try it myself. The only time the post gets a little fuzzy is "...store the quantized parameters in a selectable data type, where that storage data type is the same data type as the “computation type” of the mode". I assume "selectable datatype" is the float size of the quantization?


We've got a technical post with all the juicy details coming next week. But that bit refers to packing the 4-bit weights into a type FSDP is happy to shard (like float16 or float32) which matches the other non-quantized bits of the model. This way FSDP will happily wrap and shard all the parameters as if they were just normal floats.


This is a fantastic breakthrough for those of us who fine-tune LLMs on limited hardware budgets.

I was curious about the choice of FSDP over DeepSpeed. I have been using Axolotl for fine-tuning, and FSDP has been broken there, whilst DeepSpeed is rock solid. Why FSDP over DeepSpeed jph00?


DeepSpeed has more features than FSDP, but it's much more complex to hack on -- FSDP is written directly in python using calls to the PyTorch library, whereas DeepSpeed is 20% C++ and 10% CUDA (according to the GitHub stats).

We've found that FSDP works just as well for our needs, and we appreciated the increased "hackability".

(Axolotl is terrific BTW. I hadn't heard of problems with it with FSDP before -- I'll see if that's something we can help with.)


Good news -- axolotl has just merged support for FSDP/QLoRA training, thanks to a rapid collaboration between the axolotl and Answr.AI teams!


There's a long gh issues thread with technium struggling with Mistral 7 and loss spikes. Easy to find googling.


Yes I'm familiar with Teknium's Mistral issues, which were resolved some time ago. IIRC they weren't related to FSDP.


This is such exciting news! Huge thanks to you for your continued work in making sense of AI.

I wonder if the recent Bitnet 1.58 paper [the use of ternary bits in lieu of fp/int] might be an advancement that could further reduce the computation required for inference?


Yes, along with the many other <4 bit quant methods recently developed -- there's been a wonderful boom in low-bit quant methods in the last 6 months, and we've got our own ideas for taking them further too. Along with QLoRA/FSDP, we're likely to see big advances in model training this year on consumer hardware.


Would be cool to build an “LLM@home” project like folding@home or SETI@home (rip), where tons of folks could donate their GPUs and train something huge and FOSS. I don’t know enough about how these models are trained though. Could it be chunked up and distributed in that way, then stitched/merged back together?


https://stablehorde.net/ comes somewhat close.


Golem has been building this since 2017

https://www.golem.network/

They also have on option to get paid in crypto for your GPU power.

The challenge is that the AIsoftware architectures are not made "to run over Internet."



Always figured it would be too slow. Distributed training on clusters is usually done with 1+ gb/s interconnects.


If you are gonna be doing stuff like this I’m damn excited for answer.ai!

It’ll be the first time we’ll have someone who knows AI create leverage to open source it.

Way to go!


> It’ll be the first time we’ll have someone who knows AI create leverage to open source it.

It can’t be overstated how important this is. Thank you again.


What’s the best way for people to contribute to AI open source? I can’t produce things like this for many reasons so how can I and others like me do our part to keep SOTA AI open?


There is a ton you can do to help SOTA AI remain open.

Join the community building the tools - Help with UI/UX, documentation, keeping up with the latest, and evangelizing whatever method the team building it has devised to keep it sustained.

Being part of the community itself is more valuable than you realize.


Where are you finding this community?


Off the top of my head

- try to implement techniques that are doable on home hardware like the one described in OP (requires some $$$ investment) and give feedback or contribute to documentation / guides

- learn about different techniques and do educational writeups or documentation (like https://vickiboykis.com/2024/02/28/gguf-the-long-way-around/)

- build a tool / library that wrap academic techinques and expose them more easily to end users (like A1111 or comfyUI for stable diffusion)

Anything that can translate the high end research down to something a moderately technical user can use or learn from is a gigantic win


I am random software engineer, but from what I learned high-quality open source data sets seems to be enabler. There is shortage of golden datasets for training and evaluation in every popular and niche area you can imagine.


Have you guys looked at using sparsification? It would probably require true re-training of the foundation model, to go at high sparse ratios (say 90% weights excluded), which could be done once on expensive GPU - but fine tuning such sparse models would require less RAM hopefully.

The trick with getting more benefit from sparse approach is to do block sparse (iirc, Tim Dettmers used to work on this as well, a few years ago), but large block size (say 16x16) would require much longer retraining to recover for the lost accuracy…


Yes, sparsification is another useful approach for higher efficiency, although block sparse kernels are pretty complex to work with -- especially when combined with quantization and LoRA! Most of the sparsity papers I've seen use "structured" sparsity; i.e removing layers, attention heads, and features. But the upside from this seems somewhat limited so far.


I'm not sure about structured sparsity, but for the weights sparsity in my experience going to around 50-70% of excluded weights (even with block sparsity - say 4x4) did not cause any noticeable degradation to training & quality at all (original paper on sparsity from LeCun suggests much higher sparsity ratios - like 90% - but for DNNs I didn't find those attainable if accuracy is important)

The block sparsity can really help with saving RAM - because you only need to keep a short array of indexes for the excluded weights. The trouble is the kernel mult functions become complex, so it's a bit of a trade-off between RAM and GPU cycles.


Has anyone seen an implementation of 'SpQR: A Sparse-Quantized Representation,' published in June 2023 by Tim Dettmers et al.? https://arxiv.org/abs/2306.03078


Found it from https://github.com/Vahe1994/SpQR Was somehow expecting it to be at https://github.com/TimDettmers/bitsandbytes. My bad.


This is great, however there were many opportunities to use the word 'nibble' in this post and they were all missed.


So, as I understand it, this is for finetuning a preexisting llm? So not actually training one from scratch. I guess that would be too much to ask for. Nonetheless, cheers to Jeremy and the gang for the work.


For now, it's for finetuning.

The issue of to what degree it might be possible to train a model from scratch using QLoRA is still an open question. The relora paper showed that it can work in some situations, but attempts to scale it up were unsuccessful. The recent DoRA paper perhaps might allow a "re-DoRA" approach to work. If so, that could be combined with quantization to do "re-QDoRA"!


The headline and introduction on the linked page say "You can now train a 70b language model at home. We’re releasing an open source system, based on FSDP and QLoRA, that can train a 70b model on two 24GB GPUs."

How does "fine tuning" differ from "training?" Reading the linked article I had assumed I could create my own trained LLM at home with two 24GB GPUs.


The article actually sneaks in a footnote that answers this (https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html#fn1): "Throughout this article “training” can refer to either pre-training, or fine-tuning".

(Generally, we've told students at fast.ai since 2017 that they should almost never be starting from random weights -- most of the time it's best to start with a pretrained model and fine-tune that, even if it's from a somewhat different domain to the problem you're working on.)


Have you changed your mind on „The End of Finetuning“ (https://www.latent.space/p/fastai ) or did I simply misunderstand that?

Oh, and thanks for quirky stuff like your APL video!


The title of that podcast isn't something I actually said (IIRC). I commented in that interview that I feel we should not consider pre-training and fine-tuning to be as separate as we do now.


So you‘re generally in favor of mixing training data without separating them in phases, but when I use pretrained weights (as you recommend instead of random weights) I generally do not have access to whatever the neural net was pretrained with by someone else, so I have to make do with my finetuning data, yes?

Thank you!


Yes.


"The right way to fine-tune language models... is to actually throw away the idea of fine-tuning. There's no such thing. There's only continued pre-training."

:) i hope i didnt pervert your intent too too much for clickbait or something, i thought it was the spirit of what you said


You most definitely can, the main difference is that only partial ~2% of the parameters get updated during training. Say you start from a model like llama-70B which already knows english and has some world knowledge based on its pretraining dataset. It might not be ideal for drastic domain shifts, such as adapting a model to learn new languages (which might require a new tokenizer and model embeddings) but still might be possible to some extent.


Thank you for clarifying. I have been wanting to dip my toes into LLMs at home but obviously I have a steep learning curve ahead of me, and would need considerably beefier hardware!


It’s steep but manageable, absolutely go for it. The more people who understand the tech the better.


You can take an existing 70B model and train it to do a more specific task. You're teaching it the task but you're relying on a foundation model for the base understanding of the world/words/etc.


OK, that makes sense. Thank you!


Digging into the low rank structure of the gradients, instead of the weights seems like a promising direction for training from scratch with less memory requirements: https://twitter.com/AnimaAnandkumar/status/17656138151468933...


Simo linked some older papers with this same idea: https://twitter.com/cloneofsimo/status/1765796493955674286


Lit-GPT is what I have been using to pretrain models at home: https://github.com/Lightning-AI/litgpt Using the openwebtext example, I can train a 700M param model to 2.6 loss in a few days on dual 4090s. Pretty awesome!


Training a 70b model from scratch uses 80,000 GPU hours (4,6 years if you have two of those GPUs).

The electricity would cost more than 10,000€ in Germany, just for the GPUs.


> the ability to use multiple GPUs with QLoRA training.

Thorough article!

Question: What's your opinion on:

- How viable will NVIDIA's consumer cards be in the long run?

- Besides https://tinygrad.org, what other cost-effective future alternatives could there be?


Unsloth (mentioned in the Answer.AI post) is planning multi-GPU support in a future release.


Does anyone have sources, or experience, about fine tuning primarily to teach the model some factual data, especially when it comes to later "higher level" question answering.

For example, giving the model a bunch of text (academic papers and such) about 19th century writers, then asking things like "Who were the main influences on writer X"?

Obviously simple RAG-like approaches don't work, as such information is rarely available in the text as-is, and needs to be "extrapolated" to some extent. Long context models might work (just dumping everything into the prompt), but are way too expensive for my needs.


RAG approaches should work quite well for the examples you mentioned. It's a matter of how you approach the retrieval part - you can opt for a larger recall on retrieval, and leverage the large context window for the LLM to figure out the answer. Even if it's not "as-is", semantically if it's in there, it should be able to find it.

Other things to try out is how you approach the question "expansion" part, for example using Hypothetical Document Embeddings (HyDE); or how you approach the filtering-out part, e.g. using "System 2 Attention", https://arxiv.org/abs/2311.11829.


I tried most of such techniques, but the point is that this information really isn't in there directly, and to perform the question expansion, the model needs to know about the domain already.

For example, imagine that one paper is about how author X was French, in early 19th-c.and how they were one of the first ones to write about topic T. Another paper is about how author Y was inspired by the early 19th-c. French writers writing about T. However, this second article does not mention X at all. Asking about "who were the main influences on X" would not give you the second article.

Of course, I could run "multiple-hop" RAG-like process, where the model keeps asking questions itself and so on in a loop, but this becomes extremely clumsy, and the models (even GPT-4) tend to get out of hand. It is also extremely slow, of course.


I've worked with EXACT this type of problem and for me RAG works perfectly well - it may seem "clumsy" as you put it in terms of trying to engineer or optimize the indexing, augmentation, and retrieval techniques, but it's worth it. I would claim that going down the finetuning route lends to a significantly higher probability of going astray at a MUCH larger cost.

There have been some attempts to compare RAG vs FT approaches, I would recommend this paper: https://arxiv.org/abs/2403.01432


Thanks for the input! Did you implement it in this kind of "multi-hop" way, or is there some trick I'm missing?


Nice, I tried to use QLoRA+FSDP in the past with litgpt and obviously at that time it did not work. This is very useful!


This is the best news I’ve seen all month. I think one of the great near-term dangers of AI is the bulk of the economic benefit going mainly to relatively few companies. That risk seems substantially reduced if they have to compete with a great variety of models.


Besides being a great result, the quality and clarity of the technical writing here is excellent.


Any plans on supporting AMD? In Germany, the price of an 7900XTX is HALF of a NV 4090...


take a look at recent posts from https://twitter.com/__tinygrad__ re: the state of AMD for AI work


Very interesting but hard to interpret until the performance numbers / benchmarks are available. I can already fine-tune a 70B language model at home using CPU + RAM, but it would be so slow as to be almost totally impractical (~20x slower than GPU). It would be great to see a comparison to eg 8 x A100 (available for $32/hr on AWS on-demand) and also CPU + RAM. Presumably it’s somewhere in between, but hard to predict where!


Maybe I've missed it in the article - but how long would a full training run take on 2 consumer GPUs (local or rented) ? Ballpark - hours, days... ?


The author is discussing fine-tuning a base model. How long it takes really depends on the dataset, the method, and the hyperparameters. DPO, for example, can achieve some great results with a fraction of the steps of other methods.

Just like with unsloth or axolotl, the people that use this will have to make compromises that give results in a reasonable amount of time.


This article is very well written and super informative. One thing I didn't understand is:

> At Answer.AI our north star is making useful AI more accessible. $150,000 to create your own high-quality personalized model definitely doesn’t count as accessible!

Renting an A100 on RunPod is ~$1.89 / hour. So you'd need ~80,000 A100 hours to train a useful AI model?


In the post it explicitly says you can train on 2 3090 level cards which are significantly cheaper and the headline literally says "Finetune" Not "Pretrain"


So… why do people want to fine tune LLMs at home? It seems very unlikely to provide value.

* you’re probably not going to succeed at injecting new knowledge in a way that feels satisfyingly top of mind to the bot

* you’re probably not going to create such a meaningfully new style that it warrants a Lora like in images

What’s an example use case?


Hmm why would someone on a forum called hacker news want to tinker and explore an exciting new technology. Who can say? One of life’s great mysteries really.


I’m curious what they’re trying to do because I’m curious and I don’t see it. You’re the one being a dismissive dick here.


>So… why do people want to fine tune LLMs at home? It seems very unlikely to provide value.

Asking the first question is fine, but your follow-up comment sounds more dismissive than curious.

That's probably why the snarky response.


I don’t feel that’s accurate given specific bullets explaining why I feel that way and asking why others feel differently but ymmv


I find that available LLMs have difficulty recalling instances in specific works by given authors. For example, if you ask GPT-4 "In which Philip K. Dick novel does the protagonist character consider converting to Judaism and moving to Israel?" it will respond with Dick's best known book _The Man in the High Castle_ and the character Frank Fink. The answer is incorrect. Israel does not exist in the world of that novel; furthermore, the character of Fink already is Jewish. The correct answer is Angel Archer in _The Transmigration of Timothy Archer_.

I have considered the feasibility of fine-tuning an LLM on the writings of a specific author. The idea is that it could aid writing in this way: If I currently am researching a specific author across multiple of their books, I often will get a quote of theirs trapped in my head some length of time after reading it. If I have neglected to jot down (or even to highlight) the source of the quote, I could ask the model where the remembered passage came from and get back a higher-quality response.


Eh, but fine tuning is a very awkward tool to solve those knowledge problems imo.

Author style, maybe, I guess.


Illicit fan fiction. Whether it's image or text models.

It's ALWAYS illicit fan fiction.


Consensual sex between Gilligan and the Professor is not a crime.


I mean I’ve seen the expressive niches on image models of civitai, but do you really need custom fine tuned LLMs for text fanfiction?

Like sure, you need something that is not the friendly question answerer; but do you really need such a broad population as in images to suit your needs? I’m guessing no?


If I wanted to use this software to finetune a 70b model on two 3090s to write fiction, what is the maximum sequence length that would be practical? I'm at the dataset collection stage, but I'm not sure whether to aim for bigger or smaller sequence lengths at the moment.


I wonder whether LoRAs could be useful for U-Net training. Especially thinking of CNN-based U-Net models with pre-trained encoders (but randomly initialized decoders). At least, it seems possible that normal weight updates on the decoder and LoRA training on the encoder could improve efficiency.


Diffusion unet has an "extended" version nowadays that applies to the resnet part as well as the cross-attention: https://github.com/cloneofsimo/lora


Is there any framework/system that distributes the work across multiple GPUs on different computers over a network (LAN or WAN)? I'm not concerned much about latency or generation time, but would love to train or load up huge models and send jobs to run overnight.


Yes, you can run FSDP/QLoRA over multi-node. There's a slurm script in the repo showing how to do it.


OK this is going to come out as moronic because I dont have the proper vocab to phrase it:

--

Is it possible to 'array' tokenized wokloads across providers of GPU?

I want to farm-out my 'compute' across [things]

More importantly can there be a marketplace for GPU resources that I can effectively point my local job at?


Yes you can use multiple computers in parallel with this script. And there's lots of GPU providers out there (the article links to an aggregator of providers and prices).


> home

> two 24GB GPUs.

geez


Nice, i've been hoping this would be possible for a while. I'll have to do a new fine-tune of Inkbot on top of one of the 70b models.

What are the max context lengths / batch sizes you can train at with this method for 2x24gb? What about 4x24gb?


This is great.

I don't think local will be competitive in future IN GENERAL...but if I have a specific use case and I have a specific training dataset...local with specific training will murder the big commercial models.


Question - Can I use this to retrain an LLM (70B) weights on my own data? I am using RAG as of now for asking questions on my text, but always wonder if I could retrain an LLM on my own text. Thoughts?


Fine tuning is generally not the best way to teach an LLM new knowledge. RAG is still more appropriate. Fine tuning is generally more effective for controlling the format of the responses but it's not going to teach the model a lot of new concepts. The model can learn how to handle new vocabulary through fine tuning but it's not a great way to teach the model new facts. Giving it access to a knowledge base is a better way to do that.


Do the two 4090 GPUs need to be on the same machine, or is it possible to somehow use two separate machines, each with its own 4090, and link them somehow via e.g. InfiniBand?


Thank you for the repo and write up. What tools (if any) did you use for performance tuning once you achieved the main goal of being able to finetune the model?


If they can continuously train it, it could be better than a large context as this is how a AI OS would need to work when you have constant updates to your files


I don’t think you’d be fine-tuning a whole model in such cases. That seems over the top, no? I assume you’d get sufficiently far with big context windows, vector search, RAG. Etc.


It's an interesting question. I'm not sure we really know yet what the right mix of RAG and fine tuning is. IMO small-scale fine tuning might be under-appreciated.


Congratulations, fantastic contribution to open source AI. Why does the website headline say "train" instead of "finetune"?


NVlink + two 3090 gives 48GB relatively easy (appears as unified memory). I only skimmed the article briefly, was it considered?


NVlink does not appear as unified memory. But NVlink + two 3090 does work great with FSDP/QLoRA (I have just such a machine at home!)


Imagine the potential of a Folding@Home-inspired project for AI development. What kind of powerful model could a community of gamers and GPU owners create.


Wouldn't bandwidth be a big problem? And I think models like llama already require (ten)thousands of gpus to train in a reasonable timeframe


Is it worth it though when the base model isn't even smart enough?


Does this support multimodal language models (E.g.: LLaVA)?


Can't believe they didn't name this Qolor


This is brilliant. Thank you for doing his!


4x 3080???


If you're buying new, 3x 4070 Super Ti 16GB might be better than 4x 3080 12GB:

    4070 super ti 16GB, fp32 44 tflops, Mem 672 GB/s
    3080 12gb,          fp32 30 tflops, Mem 760 GB/s
    3080 Ti 12GB        fp32 34 tflops, Mem 912 GB/s
For 50% more gddr, a 4070 Super Ti costs less than 50% more than a base 3080. There's a tradeoff between FLOPS and memory bandwidth though, but using 3 cards instead of 4 should makeup for that.

https://www.techpowerup.com/gpu-specs/geforce-rtx-3080-12-gb...

https://www.techpowerup.com/gpu-specs/geforce-rtx-3080-ti.c3...

https://www.techpowerup.com/gpu-specs/geforce-rtx-4070-ti-su...


That should work great, although we haven't tried it. Let us know what you find!


[deleted]


Ah yes, 24gb top-of-the-line GPUs, I happen to have a whole warehouse full.

/s


Fair -- 24GB GPUs are still quite expensive. >10x cheaper than 80GB GPUs though!

We already have something we're testing that's gotten 70b training down to dual 16GB GPUs, and we're working it making it even smaller...


It would be great if we were a bit more respectful towards our natural resources. Using so much energy to play with language models is a waste of resources.


It would be great if I had a bathtub full of ice cream as well, and if we all lived in a world overflowing with love, respect and joy for all living things. Until then, I'm happy that these kinds of incredible tools are (and increasingly will be) be in more of our hands for close to free. Upwards and onwards!


Seems like with every passing year, we are going downwards, not upwards. Perhaps it only seems the other way around to those with the greatest addictions to technology, who will justify any development to satisfy their cravings.


Well, I for one am happy that less compute is being wasted on blockchain, and if the total BTUs and tonnes of CO2 remain equal while the proportion allocated to AI goes up, that'll also be a good thing. Doing useful stuff, and becoming more efficient (eliminating high carbon wasteful human activities and replacing with AI compute using less overall carbon), is also a net win.


What is a little contradictory is that designing a system to use less resources can increases the number of people fine tuning models so that the final result can be a net global increase in the total energy use. A hypothetical goal could be to reuse fine tuning, that is designing a knowledge graph in which you fine tuning from a previously fine tuned model (like dynamic programming, save the result of previous computations). Lora allow us to store the small matrices with low cost.


They are using gaming GPUs. If you want to complain about a waste of natural resources there seems to be a lot of people playing games...


Well, they serve the same function. Modern consumerist society removes almost all real autonomy from people and makes them do fairly meaningless tasks (most jobs). So, it's rather expected that we need to seek out greater and greater amusements (gaming and playing with ridiculous models) so we're kept in a somewhat happy state, instead of realizing the trick of the system that will one day come crashing down due to its unsustainability.


Video games are like the opposite of waste. This planet can go to hell if I can't enjoy art.


There are some people that consider all forms of entertainment that don't create some form of extrinsic value to be a waste of time, energy, and materials.

I feel bad for them. They're going to be the ones that lay in their death bed thinking "I wish I had allowed myself to have more fun".


> Using so much energy to play with language models is a waste of resources.

Why do you get to decide what's wasteful and what's useful?

We have more energy than we could ever hope to use from the sun and from nuclear. The solution isn't telling people they're wasting precious energy that you would put to better use. Just build more nuclear reactors and more solar.


Why do you get to decide which habitats die and which live by using all this modern technology that relies on mining and kills them?


I mean, this is a fair point but right now you're not talking to a libertarian who believes the technology inevitably governs itself, to the destruction of all around it.

You're talking to more of a civilization-type who believes you have to use regulation and, if necessary, state violence to stop types of mining that kill habitats, because the technology certainly isn't going to up and decide to do that. It's the job of society to figure this stuff out, arrive at such positions and defend them. There are plenty of good reasons to defend the protection of habitats, even for purely self-interested pragmatic reasons.


Also, whoever is married, please for the love of god merge your facebook accounts. It takes up too much space on the internet.


Beats crypto in my opinion. I feel like there's a sliding scale for this stuff, and playing with language models is genuinely interesting, though it's harder to predict what the real benefits will be.

I feel certain there will be great benefit, but not in the way AI hypesters expect there to be.


The point in the article is to use less resources. So, yes?


People think that by making a system use less resources, the entire use of it on a societal level will be reduced. Unfortuantely, we must watch out for more efficiency making more poeple use it, and potentially increasing absolute quantity of energy being used.


Energy usage is good, actually. Energy scarcity and dirty energy are bad. But energy usage is good.

We should strive to use and produce orders of magnitude more (clean) energy.


I hate this more recent doomer logic. Who is the arbiter of deciding whats a waste and not a waste? Why use such a basic and uncompelling narrative of telling how others should live their lives? I try to be thoughtful of my purchases and conserve my own resources and happy to talk about it but telling people that "doing x is a waste of resources" is a fools errand. Humanity has always progressed when in a collective group, it won't slow down now even though some individuals might drop out of it. I don't know what the future holds, collectively we will continue to progress and I see the bright side of all the recent momentum in renewable energy and the focus on making systems more efficient.

Not the first time this character has popped up here on HN.

"I write full-time and try to convince people of the danger of AI and advanced technology."


You are probably right...it may not have been the best approach. Well, I get emotional sometimes about the rapid advancement of technology and I do think it is a mistake of humanity to do so.


Running a powerful GPU at full load using coal-generated energy causes two orders of magnitude less emissions than flying on an airliner (per passenger). If you ever flown anywhere in your life, I don't think you can climb on that high horse.


I started reading your blog linked from your profile. I was disappointed that you used so much energy to play with language by writing up thoughts and hosting them on the internet. Seems like a waste of resources. Why do you think you have the right to burn all that carbon just so you can share your thoughts with the world? Why not just save electricity and take a walk through nature? That's how I think you should use your resources. I think I know better than you about how you should use energy. I think you should have to follow my ideology about energy use.


You are right in a way. I hope to one day give up the internet completely...


Careful, he'll have you disappear in a puff of logic. And then get killed at the next zebra crossing. Do you want that on your conscience? :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: