You don't need to train a ChatGPT-sized LLM, a toy nanoGPT would have been enoug...

mikeravkine · on July 25, 2023

Outliers only begin to appear around 3B parameters (as per the original LLM.int8 paper) so unfortunately not consumer GPU in an afternoon kinda stuff to prove you've managed to suppress them.

Yenrabbit · on July 24, 2023

I tried to test this with nanoGPT in an afternoon, since the code change is pretty minimal. It's hard to get conclusive results at that scale though - to be able to say anything with confidence you'd need to run multiple tests, figure out if the 'outliers' mentioned only appear above a certain scale, find good tests for quantization performance that work on small enough models that you can iterate quickly ... It's doable but still lots of work, enough that putting out the idea and hoping others with more time+compute will try it out seems a valid strategy to me :) More generally though I definitely agree that the trend among 'improvements' to transformers has been things that don't turn out to work in practice.

knewter · on July 24, 2023

Google used it in flaxformers since 2021 apparently

renewiltord · on July 24, 2023

Do you know of handy testing steps? I suppose I could ask ChatGPT, but if someone has a validated "here, this is how you do it" I have a 3090 that I can do it on, but I'm not keen to debug anything here.

nl · on July 24, 2023

Testing steps (based on thinking about this for 30 seconds - so probably can be improved):

Train a Transformer based model with and without the modified Softmax (Suggestions: GPT-2 or nanoGPT)

Measure performance - I'd probably start with Perplexity and see if there is any difference (we'd expect little difference).

Quantize both models with different quantization strategies.

Measure the perplexity of the quantized models of different sizes. We'd expect the performance to drop off quicker for the non-modified model than the modified one if this is working.

renewiltord · on July 25, 2023

I was thinking about a different problem as I was typing that and got some mental memory alias bug. I wanted to know a set of steps to take to train a model. My apologies.

In any case, that was an lmgtfy-level question. Here's what I found: https://til.simonwillison.net/llms/training-nanogpt-on-my-bl...

I shall try that soon.

mcapodici · on July 25, 2023

Shaaaaameless plug:

I did a writeup like this. (Not as nicely as Simon though) where I modal.com (cloud GPU, containers, quick starts, free $30/m spend) to use their GPUs (e.g. T4, A100).

https://martincapodici.com/2023/07/15/no-local-gpu-no-proble...

T4 I think was good enough for the job, not much need for the A100.

Since this post I am working on an easy way to do this with a script called lob.py that requires no code changes to the nanoGPT repo (or whatever repo you are using) and runs in modal.com. The script exists but gets refined as I use it. Once it is battle tested a bit more I will do a post.

(It is named lob.py as it "lobs the code over to the server" where lob is UK slang for throw)

Watch this space.

renewiltord · on July 25, 2023

Thank you. FWIW I often find write-up + script superior to script because I often want to modify. e.g. I want to run GPU-only, but other script provide part-way solution when textual description added. Therefore, much appreciated.

nl · on July 25, 2023

In the Qualcomm AI paper linked in this post it turns out they use a similar testing approach:

BERT 109M, testing perplexity

OPT 125M, testing perplexity

ViT 22M, testing on ImageNet top-1.