Hacker News new | past | comments | ask | show | jobs | submit login

You don't need to train a ChatGPT-sized LLM, a toy nanoGPT would have been enough. You can train those on a consumer GPU in an afternoon.

And yes I do disregard his research effort. There are hundreds of well-justified and well-researched "clever tricks" for improving Transformers, and almost all of them don't work. I'll believe it when I see the results.




Outliers only begin to appear around 3B parameters (as per the original LLM.int8 paper) so unfortunately not consumer GPU in an afternoon kinda stuff to prove you've managed to suppress them.


I tried to test this with nanoGPT in an afternoon, since the code change is pretty minimal. It's hard to get conclusive results at that scale though - to be able to say anything with confidence you'd need to run multiple tests, figure out if the 'outliers' mentioned only appear above a certain scale, find good tests for quantization performance that work on small enough models that you can iterate quickly ... It's doable but still lots of work, enough that putting out the idea and hoping others with more time+compute will try it out seems a valid strategy to me :) More generally though I definitely agree that the trend among 'improvements' to transformers has been things that don't turn out to work in practice.


Google used it in flaxformers since 2021 apparently


Do you know of handy testing steps? I suppose I could ask ChatGPT, but if someone has a validated "here, this is how you do it" I have a 3090 that I can do it on, but I'm not keen to debug anything here.


Testing steps (based on thinking about this for 30 seconds - so probably can be improved):

Train a Transformer based model with and without the modified Softmax (Suggestions: GPT-2 or nanoGPT)

Measure performance - I'd probably start with Perplexity and see if there is any difference (we'd expect little difference).

Quantize both models with different quantization strategies.

Measure the perplexity of the quantized models of different sizes. We'd expect the performance to drop off quicker for the non-modified model than the modified one if this is working.


I was thinking about a different problem as I was typing that and got some mental memory alias bug. I wanted to know a set of steps to take to train a model. My apologies.

In any case, that was an lmgtfy-level question. Here's what I found: https://til.simonwillison.net/llms/training-nanogpt-on-my-bl...

I shall try that soon.


Shaaaaameless plug:

I did a writeup like this. (Not as nicely as Simon though) where I modal.com (cloud GPU, containers, quick starts, free $30/m spend) to use their GPUs (e.g. T4, A100).

https://martincapodici.com/2023/07/15/no-local-gpu-no-proble...

T4 I think was good enough for the job, not much need for the A100.

Since this post I am working on an easy way to do this with a script called lob.py that requires no code changes to the nanoGPT repo (or whatever repo you are using) and runs in modal.com. The script exists but gets refined as I use it. Once it is battle tested a bit more I will do a post.

(It is named lob.py as it "lobs the code over to the server" where lob is UK slang for throw)

Watch this space.


Thank you. FWIW I often find write-up + script superior to script because I often want to modify. e.g. I want to run GPU-only, but other script provide part-way solution when textual description added. Therefore, much appreciated.


In the Qualcomm AI paper linked in this post it turns out they use a similar testing approach:

BERT 109M, testing perplexity

OPT 125M, testing perplexity

ViT 22M, testing on ImageNet top-1.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: