Hacker News new | past | comments | ask | show | jobs | submit login
Towards 1-bit Machine Learning Models (mobiusml.github.io)
351 points by homarp 4 months ago | hide | past | favorite | 157 comments



Really strong binary results. So strong it was fishy. I hope someone can explain my confusion below.

> We compared the performance of the Llama2-7B model in three configurations: FP16 (full precision), HQQ (without fine-tuning), and HQQ+ (with adapter layers) using a group-size of 8.

Interesting, what is "group-size of 8"?

From their HQQ post (https://mobiusml.github.io/hqq_blog/), it's the block size at which they add scales (presumably 16-bit) and shifts (in that post, it's 8-bit).

So for every 8 binary weights we have a 16-bit scale and 8-bit shift?

> Fine-tuning with Low-Rank Adapters

They say they inline the shift into the LoRA but how can you do this, block-wise, without increasing your LoRA rank by num-blocks (they claim to only use 1 additional rank)?

Then, the reported 7B sizes, in GB:

> 13.5 (fp16) 1.76 (HQQ 1-bit) 1.85 (HQQ+ 1-bit) 2.72 (quip# 2-bit)

those numbers would make sense if it was _actually_ 1 bit. But if you include the overhead of 16-bit scales (and why is the shift inlineable into lora? still unexplained) it'd be more like 3-bit.

From their HF page:

> This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.

Interesting, so we have to go back to CPU to rescale? Is this how they counted GB? This should have been clearly caveated in the table. I also am amazed they got latency lower than quip if they pingpong to CPU.


Hello, I am the main author, would love to clarify a couple of things:

All the linear-quantization methods have meta-data, including the 1.58bit paper. You can control the quality vs. memory usage by reducing the group-size. However, the meta-data is the not the same thing as the quantized weights for many reasons:

> The meta-data size doesn't change the fact that you can do binary/ternary matmul, which the most important thing in this story.

> The meta-data size doesn't increase the actual compute: these are point-wise operations and even if you have 1 scalar you still need to multiply the same amount of weights.

> Meta-data is offloaded to the CPU with pinned-memory, which allows non-blocking transfers. Technically, you can trigger the copy in the layer before and synchronize and will make it almost seamless. I did some experiments with cuda streams that worked very well on an older machine, but then I tried a better machine and the transfer was much faster. Obviously if you are trying it on Google colab it's very slow for this reason.

> Smaller models like Llama2-7B are very hard to directly quantize at very low bits, so they need a lower group-size to function well. Larger models (like what we showed for Mixtral), can be quantized to 2-bit on the fly, without any data, and still work very well. So basically larger models are less sensitive to extreme quantization and you could use a much larger group-size. I still think that the meta-data size is really not a big deal for the reasons I have explained above.

> There are many other ways to increase the group-size or even get rid of it all together, many ideas available but needs lots of experimentation.

> Binary/ternary CUDA matmul kernels don't exist yet. The current code is implementing the dequantization step in CUDA but then uses torch.matmul as fp16. I tried doing matmul at low-bits with CUDA but it is very difficult to even beat cuBLAS with fp16, especially for a novice CUDA coder like me :)

Please note: this is early experimental work. Since it showed promising results, we wanted to share it with the community first as we progress. There's still a lot of things to be done and we are actively working on it, despite the very limited resources we have.

Happy to answer any questions here!


Thanks for the reply. I’m quite familiar with subchannel quant, but still feel like my questions did not get addressed.

1 Could you post the full memory use of the methods? E.g. you include quip metadata in its GB but not hqq metadata in its GB.

2 If you have to go to cpu to shift and scale, how did you get latency lower than pure on device? Was this bsz1? No speculative decoding?

3 how can lora absorb shifts with only increasing rank by 1 if you have a shift per group?


Sure, I can give you detailed answers:

1- The answer is still ~1.7GB. You only need meta-data of a single nn.Linear at a time. There are 32x(4+3) = 224 layers quantized, so you need an additional (3GB - 1.7GB)/224 = 1.3GB/224 ~ 5.8MB, which is negligible.

2- As the table says, that's the forward pass. (Batch-size=1, context-size=1024). Forward pass means there's no caching and no decoding logic. The actual model generation speed should be much faster with caching + decoding logics like speculative decoding, and using VLLM instead of HF. And even with all of that, a much larger model like Mixtral with the same group-size of 8 offloaded to the CPU works quite well on a 4090.

You mean why it's faster than Quip# despite being all on-device? Because dequantization with HQQ is a simple linear operation. It's not even using a fused kernel, only the dequantization part is done on CUDA.

3- LoRA absorbs -zero x shift, not the shift, the shift is still there, including the BitNet/1.58 work. As the paragraph explains, the math is ignoring reshaping to make the math simple and easy to read.

Let's say you have a matrix of 4096x4096, with grouping done channel-wise (but no reshaping), the -zero x shift part is a rank-1 matrix (4096x1 .dot 1x4096), the lora data will be (4096xr .dot rx4096), you can merge them exactly into (4096x(r+1) .dot (r+1)x4096).

The point of that paragraph is to show two things: - Compared to the BitNet formulation, the additional zero-point (which is necessary to get good quantization results on pre-trained models with minimum calibration), has a negligible overhead. -More importantly, it explains how we even got the idea of adding low-rank adapters: it's not because LoRA is popular, it's because the zero-point alone results in a rank-1 matrix error which is not enough to express the quantization error. As the rank tends to min(num_rows, num_cols), the error goes down. So if we increase the rank by r via low-rank adapters, we would expect better results.

Now, if we include the reshape with a lower-group size than the num_rows, the -zero x shift part is a rank-n matrix (4096xn dot nx4096), but it's not possible to properly estimate the rank n because that would highly depend on the nature of the weights matrix, but in the end, the LoRA part will be (4096x(n+r) .dot (n+r)x4096). We only use a lora rank of 8 for MLPs which are the larger matrices, so even if you double or even 4x to let's say n+r=32, it's still just 1/128=0.78% of the original matrix.

Merging -zero x scale with the low-rank adapters or not doesn't matter much, that would highly depending on which fused kernel implementation performs the best.


I see, so we’re still fetching the metadata to gpu, and rescaling on gpu, just on-demand and discarding metadata when we’re done with that layer?

Why not do the same optimization for layer weights themselves?


Yes, correct, and that fetching operation is a non-blocking operation, once we dequantize the weights we discard it before moving to the next layer.

Technically, you can do it for the weights as well. But that wouldn't work in many situations. For example, when training with FSDP: the quantized weights stay on the device but you can still offload the meta-data (https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html)

I would like to re-iterate that larger models, which would be more interesting to run at low-bits, are much less sensitive to quantization compared to a 7B. So you could potentially use a larger group-size and just keep it on device, like what is done with 4-bit and 3-bit now using a group size of 64. We just started running some experiments with a 13B llama2 and it looks very good so far (outperforming some full-precision llama2-13B-based models), let's see how far we can push it, ideally get-rid of the reshaping all together will be great.


If you’re willing to pay for the latency cost of per layer cpu fetching/offloading, I don’t see what extreme quant buys you.

You could just do a layer-by-layer fetching scheme with 4 bit weights.

For training too, just fetch each layer twice per step as needed for fwd/bwd.

And all for hbm cost equal to one layer’s worth


The extreme quant buys you potentially 70x more efficient matmul via binary/ternary operations.

You still have a group-size of 64 in 4-bit fyi.And even if you keep the meta-data on-device, provided that the quality is high (which is the case for 2-bit, outperforming fp16 on certain tasks), that is a much better option compared to 4-bit even if the VRAM usage is the same.

Again, and I keep repeating this but it seems to be ignored every time: this is experimental work and it's still in progress. This story of small group-sizes on large models should not be an issue.


> You still have a group-size of 64 in 4-bit fyi.

Results may vary :)

> Again, and I keep repeating this but it seems to be ignored every time: this is experimental work and it's still in progress. This story of small group-sizes on large models should not be an issue.

Apologies if something I said (or I guess did not say...) offended you! It's a hypothetical, and one IME is not so easy to achieve, but maybe you have different results. So I didn't want to comment on this, maybe it's possible (but LLMs don't scale up as easily in terms of quantization than other networks like image classifiers, in my experience).

> The extreme quant buys you potentially 70x more efficient matmul via binary/ternary operations.

To be clear, such hardware does not yet exist, and it's unclear if you really can have more efficient binary/ternary matmul if you need high-precision accumulators and more frequent broadcasting shiftss. It's again a complicated hardware question to answer if the sum total latency of doing many high-precision accumulations and many scales/shifts will be smaller (or, chip-area-wise, even feasible to implement), compared to a 4-bit baseline.


Thank you for your efforts on behalf of the GPU poor!

It's getting tougher to use older, cheaper GPUs (Pascal/Maxwell) with modern quantization schemes so anything you can do to keep kernels compatible with SM52 and SM61 would be greatly appreciated.


Thank you, very glad to hear that!


When one does quantization, it's done in blocks. Bitsandbytes uses a blocksize of 64 I think. W * scale + zero_point is needed for each group size of 8. So you need 2 numbers in fp16 for each 64 group of numbers. For BnB, you get 4.5bit approx since 64*4bit + 16bit + 16bit = 288/64 = 4.5. So 4bit is actually 4.5bit.

For HQQ 1bit, a group size of 8 needs 2 fp16 numbers (you mentioned 8bit for shift). So you need 8 * 1bit + 16bit + 8bit for each group ie 32bits for each group size of 8. Or 4bits per param.

I'm assuming the scale and zero_point are both moved to 8bit maybe so 8*1bit + 8bit + 8bit = 24bit / 8 = 3bits per param?

"This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.", so the 8+8 scale / zero_point moves to the CPU. So GPU memory 1bit, but CPU meta data is the rest - very smart!


> "This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.", so the 8+8 scale / zero_point moves to the CPU. So GPU memory 1bit, but CPU meta data is the rest - very smart!

Doesn't it need all the weight metadata for a layer to use that layer?

* If yes, then can't any algorithm offload x% of its data as a balancing act between speed and RAM?

* If no, then what's it for and when does it get used?


Oh yes you need all the metadata, but because it's 2 numbers the scale and zero_point, I think the movement of singular digits are super fast to the GPU registers - cannot confirm though.

It's like in cuBLAS you do alphaAB + beta*C, and alpha and beta are both scalars which can be on the CPU, and moved to the GPU in nanaseconds.

I'm unsure though since I haven't tested it


It still has to go through the entire memory system. It's hard for me to imagine that transferring a number from the CPU to the GPU is faster than transferring a byte, and if you have 2 CPU-resident numbers per GPU-resident byte that's a lot of transferring.


I don't disagree - fair point there definitely is a latency transfer overhead. I would suspect one had prefetch it by calling `.to("cuda", non_blocking = True)` say 2 layers ahead, so you can in theory hide the movement.

I think somewhere the blog did mention HQQ for 1 bit is slower for now, maybe due to the transfer overhead, although I couldn't exactly remember where


My point is more that if it's that many bytes flowing around on demand, you're basically swapping layers in and out of the GPU as you use them (or x% of each layer).

Which is fine, and it's a valid feature, but you don't need to split those bytes into "data" and "metadata" to make that happen.

Is there actually something they gain from this particular method of splitting?


I guess it's mainly to reduce VRAM usage. Assuming we don't do this, then a 7b model with 1bit group size 8 will use 3GB or something of VRAM, whilst a 4bit group size of 64 will use 4GB approx.

Assume we have a 100b model - with 4bit, VRAM is around 57GB or so. With 1bit, 43GB VRAM, but by moving the scalars and zero point to RAM, VRAM use is like 15GB or something, at the cost of like 28GB RAM usage.

I guess maybe a valid approach is to dynamically select which ones to move to RAM or VRAM, given your VRAM budget. Say you have a 40GB GPU, clearly move more of the scalars to GPU. But if you have a weak GPU, then you need to use more RAM.


I still don't think I'm getting my point across.

What if you store 40% of the data and 40% of the metadata in CPU memory. Is that the same for performance?

Is there a reason we want particular parts of the layer to stay on the GPU? Or is it just number of bytes.


Oh, very good question - tbh im not sure. Another close technique is layer offloading - if your network can't fit and has layers 1, 2, ..., 32, we offload layers 16 to 32 to RAM, then load them in to GPU memory on the fly.

I'm gonna guess the performance hit is similar - although I have not tried it myself to verify for benchmarking


Err, you are just restating what I’m saying, without addressing the concerns.

1 - is it fair to use ram in two places and report only one of them without any asterisk? (If you think this is fair-oh boy wait till you hear about my 0GB hbm use inference algorithm)

2 - i know how subchannel quantization works. Are they hitting those reported latency numbers with per layer cpu pingpong to rescale?


Oh sorry - did not expect to restate what you said whoops - all train of thought!

You can see from https://huggingface.co/mobiuslabsgmbh/Llama-2-7b-chat-hf_1bi... that the model disk space is 3GB + 100MB of LoRA weights. I also uploaded a 4bit one to https://huggingface.co/unsloth/llama-2-7b-bnb-4bit/tree/main which uses 3.87GB.

So because of the offloading trick, the GPU VRAM is less, but in actuality, still 3GB is needed.

Unsure on latency sadly


Hey Daniel! The VRAM is still the same as a pure n-bit model would take. Because we only need meta-data for a single nn.Linear at a time, you only need an additional (3GB-1.7GB)/224 = 5.8MB. If we compress the meta-data as well that would become much lower.


Hey :)) Oh I love the idea of async movement from CPU to GPU ahead of time - ingenious! Prefetching a small amount of metadata seems reasonable and very smart!


I believe the future is 1 bit models - for both training and inference.

When people make custom silicon for 1 bit models, they'll find that it is sooooo much more power and silicon-space efficient to do 1 bit math than 16 bit floating point - like 100x or more.

That extra model size will vastly overshadow any worse performance of the models.


I believe the future is 4*4 bit look up tables with output latches, with a bit to/from each Cartesian neighbor. Clock them like the colors of a chessboard, in 2 phases, and you don't have to worry about timing dependencies.

All of your code gets converted to a directed acyclic graph(DAG), executing at Ghz rates if you want.

Imagine a machine that can output a million parallel GPT-4 streams at 1000 tokens per second each.

If the architecture changes it's just a different DAG. Unlike with FPGAs and their custom blocks that have to be optimally used, you can compile a DAG almost instantly.


1. If you write FPGA code as a grid of lookup tables then I would expect it to be easy to compile instantly.

2. In what way is this "acyclic"?

3. Won't putting your code into this form be the hard part? Even if you start with a DAG, 99.99% of them won't fit this form without intense restructuring. So you just moved the hard step over by one.


Is this something from a research finding or is it your idea?


It's my idea... I've been writing about it for over a decade. I hope to get a small version of it made via tiny tapeout later this year once I'm out of the precariat.

The idea is that you can decompose any non-cyclic code, for example, a matrix multiply, FFT, etc.. into a directed graph of bitwise operations, then map those operations out into the grid of LUTs. Pipe inputs in one side of the grid, and get the outputs out the other side, and all of the bitwise operations happen in lock step parallelism. (Unlike an FPGA which is asyncronous)

If you can decompose one stage of an LLM down into a bitwise graph of computation, you can easily map it into the grid. If there's a bad cell in the grid, you can map around it.

Because all of the high speed logic lines are short (to the neighbors) you don't have the long lines / high capacitance issues driving signals all the way across an FPGA or ASIC, thus you can really crank up the clock rate (and/or it's very power efficient).

It should be trivial, design wise, to scale from a 16x16 grid up through chips that could crank out Petaflops.

Here's a simulator for the thing, written in Pascal.

https://github.com/mikewarot/Bitgrid


You can simulate this on a GPU now. This is like running a CA on a GPU with slightly different rules.

I think you would also be delighted to know that many of the non-GPU ai accelerators are in a mesh topology.

Cerebras, Tenstorrent, Esperanto

https://www.esperanto.ai/wp-content/uploads/2021/08/HC2021.E...


I've been intrigued by this approach. On a more highly optimized (but harder to program) take is the GA144[1] from Chuck Moore, the inventor of Forth. It's a grid of 144 F18 Forth based processors in a cartesian grid. These processors are far more limited, but then again they take far less power as well.

[1] https://www.greenarraychips.com/


That is a pretty cool chip. Chuck Moore is a fascinating character.

You would probably get a kick out of David Ackley's T2 Tile Project.

https://t2tile.com/

https://www.youtube.com/@T2TileProject/videos

https://www.youtube.com/@DaveAckley

https://en.wikipedia.org/wiki/Ackley_function

https://www.cs.unm.edu/~ackley/

Fun fact, Tenstorrent wanted to add instructions to enqueue data between processors connected in a mesh and Arm said no (they don't do architectural licenses anymore), so Tenstorrent used RISCV.

Implementing something like a GA144 but using small RISC-V processors would be an interesting hack. https://www.youtube.com/watch?v=6QRKpd28NEE


Probably more than 100x for inference. Not only are you drastically reducing the number of bits and replacing float math with integer math, you can do matrix multiplication with only addition (as pointed out in the BitNet b1.58 paper). Additions require a lot less hardware to implement than multiplication. Adding one-bit or two-bit numbers requires barely any hardware at all. A traditional two-bit adder without carry bit is three xor gates and an and gate.


to me the most exciting thing is that if is training that is speed up on the order of 100x-1000x, a large cluster may be well suited to gradient-descend hyperparameter tuning parameters by LLM training again and again at scale -- this is the first foot in a door towards an AI that iteratively may improve itself


LoRA training should benefit from the same speed-up, because the 1-bit weights will be frozen and all you need for both the forward and backward pass is a binary matmul, then maybe cast after to get more stable gradients.


For training how do you get any kind of meaningful derivative with it?


You don't (you have to use real-valued inertial 'latent weights' during training): https://arxiv.org/abs/1906.02107

(there is still a reduction in memory usage though (just not 24x):

> "Furthermore, Bop reduces the memory requirements during training: it requires only one real-valued variable per weight, while the latent-variable approach with Momentum and Adam require two and three respectively.")


Maybe evolutionary algorithms instead? Hasn't proven super useful historically, but maybe at the scale of enormous LLMs it will be?


Nope, they're orders of magnitude more inefficient because they don't leverage gradient descent.

Rule of thumb in optimization: real numbers are easy, integers are hard


This may be the status quo because of the so called "hardware lottery" which has historically been optimized for floating point. I'm speculating, but if hardware designers were instead only concerned about raw xnor density and throughput, we might end up with chips powerful enough that giant 1-bit nets could be trained purely through evolution.


No, it's a fact at the mathematical level that you can enshrine in big O terms if you want to


How do you optimize memory for floating point?


BF8 and other similar formats?


Evolutionary algorithms made you, didn’t they?


That does not prove that they can beat gradient decent.


It took a lot of human brain flops to get to this point in time though, I wonder how many orders of magnitude more than it took to train ChatGPT...


Gradient-directed evolutionary algorithm sounds kinda interesting.


Maybe something probabilistic?


The OP explicitly excludes training.


The one I replied to said 1-bit for both training and inference.


Doesn’t training need higher precision to avoid getting stuck at local minima, at least with back propagation style learning?

Maybe something alternate like evolutionary algorithms could work in a domain like this, but so far those have proven to be less efficient.


A recent paper used a single ternary "trit" ~1.5 bits per parameter for training. https://news.ycombinator.com/item?id=39535800 They said it would be difficult to apply this technique to pre-trained models and had to be trained in 1-trit from scratch.


But it's using larger weights during training, not just the trits.


Isn’t 1bit too low for optimal radix economy (Euler’s number) though?

I want to see somebody try “imbalanced quaternary” -,0,+,2


Haven't heard this argument before. But from the Wikipedia article it seems base 3 has the best asymptomatic radix economy, but isn't much better than base 2 and base 2 is seemingly easier to program and optimize.

Since this is a new argument I've not heard, would be curious if you had links or some more explanation.


people are indeed working on -1,0,1,2 Q2 models, I read something about it the other day but don't recall the title.

they mentioned decomposition of Q2 into ternary+binary i.e. [[1,2],[-1,0]] -> [[1,1],[0,0]] + [[0,1],[-1,0]]


I bet the optimal "large" value is bigger than 2.


Radix economy is all about which base is the most efficient to represent a given number. It is simple to show that, for large numbers, this is equivalent to how efficient a base can represent itself, b/ln(b). Simple calculus shows this is minimized at e (Euler's number) or 3 if integer (closely followed by 2).

It sounds like you have something to add but you are already dictating the base by saying "bit". Literally from "binary digit". Anyway, quantization is not about which number system is best - virtually all computer systems we use today represents numbers in base 2. Quantization, at its core, is lossy compression. How do you go from a large model trained to high precision to a smaller model without hindering performance? This can be studied without needing to know the base.

Suppose you are using a decimal computer. You can ask yourself, I have a 128-decimal precision numbers, do I need that much precision? What happens if I just use 1-decimal precision by chopping off the 127 digits after the first decimal? You realize that there are two parts of an operation. The numbers involved (the operands) and the operation itself. You then ask yourself, if I keep one of the operands fixed (the original input), can I represent my 128-decimal precision neural network simply as a series of operations without the other operand? Perhaps only the most basic ones? Like: noops (add 0 or multiply by 1), increments (add 1), decrements (subtract 1), negations (multiply by -1), and clears (multiply by 0)? You count those numbers (-1, 0, and 1). There are 3 so you proudly proclaim you've made a neural network that only uses 0.477 dits. People get excited and confused because that is less than 1 dit which seems like a fundamental point. You further surprise the scientific field by finding a clever trick for getting rid of negations. You beat your previous record and now you only need 0.301 dits to represent your network. You are about to accept your Turing reward when the ghost of Claude Shannon appears and says "Why are you using a unit that measures entropy to mean how many symbols you have? If you insist, at least realize 0.301 dits is 1 bit." You are shocked when you realize 10^0.301 = 2^1. Reviewing Shannon's seminal paper, you are awestruck by Shannon's prescient comment "Change from the base a to base b merely requires multiplication by log_b(a).". You humbly give your award to Shannon. You keep the $1M since ghosts aren't as fast a new NVidia DGX. No matter how quantized the ghost is.

[1] - https://people.math.harvard.edu/~ctm/home/text/others/shanno...


The 1 ternary bit models only compress the weights. You still add and subtract using bfloat16 for better accuracy. Dedicated silicon is mostly a waste, because you are only processing two operations per parameter during inference. Loading the parameters from slow DDR, GDDR or HBM memory is the bottleneck in practice and the only solution is PIM. I was honestly disappointed by Nvidia's Blackwell since it is just barely competitive with GDDR PIM.


At least you can copy 16 times more data to the shared memory with binary weights.


> I believe the future is 1 bit models - for both training and inference.

1 bit's nothin'. The future is training directly on electronic/photonic circuits.


This! Maybe just integer, but not floating point. That's a ridiculous way to do computation when you don't really need the precision.


> That extra model size will vastly overshadow any worse performance of the models.

...What?


I think OP was referring to parameter size. You can make up for quantization by having more parameters.


I know. I was being a bit mean, but they seem to be implying that more parameters even with worse performance is preferable. That seems backward to me. There is no reason for parameter size to be intrinsically valuable, and I think people are a bit infatuated with it.


It seems the trick here is they first quantize it to 1- or 2-bit, and then they fine-tune the quantization bias parameters (the parameters that dequantize from 1-2 to 16 bit) via LoRA. Then they have specialized kernels to do matrix multiplication at the bit level.

Also, the 2-bit model seems much better than the 1-bit model - they use [-1, 0, 1, 2] - I wonder if '2' is needed in light of the 1.58b paper (which claims -1 is def. needed).


Interesting, and it kinda makes sense. You quantize, which invariably means you lose some precision, but then you can finetune post-quantization to recover at least some of it. Neat idea.


Which is itself a little counterintuitive as the arxiv papers they cite say models need to be pretrained from the ground up using 1- or 2-bit (or 1.58bit). It definitely adds some interesting data points for the open source community who are experimenting in every possible direction.


1-bit weights have been a thing since at least 2016:

https://arxiv.org/abs/1606.06160


XNOR-Net was the "first" in that generation, as I recall. It's not a new idea at all though.

Check out the dates on these papers -

https://ieeexplore.ieee.org/abstract/document/286901 < 1994

https://ieeexplore.ieee.org/abstract/document/344783 < 1993

https://link.springer.com/article/10.1007/BF00337115 < 1986


First version of BNN was submitted to ArXiv a month before XNOR-Net: https://arxiv.org/abs/1602.02830


Oh nice, good catch!


We're just speed-running the NN pendulum, at this point.



I don't think that paper's methods could be applied to LLMs


Don't forget Michael

https://arxiv.org/abs/1511.00363 < 2015


In both the '1-bit Model' and '2-bit Model' tables, the forward time (sec) for Llama2-7B with FP16 (full precision) is 0.1 s, whereas it's ~0.231, ~0.257, ~0.353 s respectively for HQQ (1-bit) / HQQ+ (1-bit) / Quip# (2-bit) meaning the FP16 model has ~3x lower inference time.

On the contrary, in the BitNet b1.56 paper [0] the authors report their 7b model has 2.9x reduced inference latency.

It's not clear to me what's happening here. Can someone explain why the 1/2bit HQQ/HQQ+ models are so much slower than the BitNet b1.56 models?

[0] https://arxiv.org/pdf/2402.17764.pdf


GPU's aren't really designed for 1 bit math... They don't perform much faster than floating point math.

Whereas a custom ASIC or updated design of GPU could give massive speedups with 1 bit math.


Yes, exactly. Neither GPUs nor CPUs are setup for 1 bit math. Pulling 1 or 2 bits out of a word isn't all that straightforward on CPU or GPU - lots of shifting and masking. I wonder how long it's going to be before we see custom hardware for bitnets? I suspect we'll see it on FPGAs first.


For 1 bit math, at least it should be possible to populate every other bit of an integer type, right? Surely one could do better with a dedicated type for this, but at least we could pack 16 single-bit weights into a 32 bit int for addition, right?


You're telling me GPUs aren't designed for additions and subtractions? Where did you hear that?


I think they are moreso saying that GPUs are not optimized for those operations. CPU aren't "designed" for matrix multiplies yet we can still run them, albeit at a slower rate than on a GPU.


A100 (> 5yo GPU) has a 1-bit tensor core engine


Real world GPU performance is hugely influenced by hand optimization of the CUDA kernels.


Sounds like these guys didn't use custom kernels, but BitNet did.


That's correct. Only the dequantization is done on CUDA, the matmul is done with Pytorch. If they put their kernels open-source we could re-use them!


Reduction to decision tree!

But I'm unclear how it actually run, and the article talks about the conversion and training but doesn't describe how it runs... I suppose because it's obvious to someone who has followed quantization.

Thinking out loud... if you have a model of just 1 and 0, my first thought is that the outputs are 1's and 0's but I think that's wrong. Instead it's a bunch of floats, and you multiply them by 1 or 0 (in a sense you are sampling the output of the previous layer?), add them up, and put the result through some activation function. And two-bit quantization sounds kind of similar, just with a _little_ scale, going from -1 to 2 instead of 0 to 1.

It seems kind of interesting that you now have a bunch of weights that are exactly 0, meaning you can assert something about what parameters and weights affect what parts of the output. Though in some sense the way they compress the weights down to one bit is also a heuristic you could use to interpret the original model... this just makes it clearer that in totality you are making a defensible simplification, because the end result is still a working model.

It also seems like you could make a lot of mathematical assertions about a one bit model that would be harder to make otherwise. Like maybe you could start thinking of a model as an equation, a _particular_ (though giant) equation, and look at its properties and consider symbolic transformations to that equation.


A comment I really liked on a previous post about ternary highlighted that what you are actually measuring with {-1, 0, 1} is inverse correlation, no correlation, and correlation.


I like the decision tree analogy


Does anyone know if this works on vanilla deep networks? These quantization articles always seem to target LLM's which leads me to wonder if there's something special about the LLM architecture vs a vanilla deep architecture.


Transformer LLMs are just a bunch of MLPs (linear layers) where you sometimes multiply/softmax the output in a funny way (attention). In other words, they're arguably more "vanilla deep net" than most architectures (e.g., conv nets).

(There are also positional/token embeddings and normalization but those are a tiny minority of the parameters)


So there's no performance gain for quantization enabled by the transformer architecture? It seems very strange that quantization works so well since in most of my experiments, the internal model weights of mlps look random.


Ok, but what does a perceptron look like in 1-bit? Would it be just some simple logic gate, like an OR-gate?


Not my area of expertise but I'd assume it becomes a decision tree or something.

Edit: lol https://news.ycombinator.com/item?id=39868508


LLMs have been trending towards obscenely large number of parameters (314B for grok), which makes quantization crucial if you want to run them without a Meta-sized budget.


Certainly does, people have been doing this in computer vision for years.


The most exciting part about ternary or binary weights is the inevitable hardware revolution for AI dedicated chips that's going to result from it.


Your hardware already supports addition and subtraction and the tensor cores of NVIDIA GPUs are already fast enough to keep up. The only benefit is reducing memory capacity and bandwidth requirements.


You mean we'll have hardware accelerated ternary instructions? https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

(Okay probably those are not ready to be used as NN weights if the activations are not binary too, but... the gap to what CPUs already can do is getting smaller.)


I look forward to the inevitable probabilistic sub-bit machine learning models :)


Sub 1-bit has been done at least as far back as 2016 for VGG style networks (my work).

I was able to get 0.68 "effective" bits.

The idea is that in each forward pass you add noise to each weight independently drawn from normal distribution, and when you calculate snr, it's sub 1 bit. Points to the idea that a stochastic memory element can be used.

https://arxiv.org/abs/1606.01981


If noise is desirable in this way, why not to just switch back to analog computing which comes with free noise


My 0-bit model can predict if you have cancer with 99.5% accuracy. It doesn't even need input data! Don't ask about the false negative rate though.


My 0-bit, no-input model can predict if you have cancer with 99.5% accuracy and 0.5% false negative rate. Don't ask about whether the cancer cell(s) are benign.


My computation-free model can give you cancer with 100% certainty.


I assume this is ment to be a joke, but isn't this actually being worked on? (minus the probabilistic portion) https://arxiv.org/abs/2310.16795


I like how it goes from 13.5 GB to 1.76 GB and gets comparable results. Definitely there is some underlying process of why this works so well that we are missing. Are the bits selecting the different subspaces? Can we improve this process by using better normalization layers / gating units?


I wonder what information theory can tell us here.


The entropy of English is ~1 bit per letter (Measured in a funny way by Shannon - https://cs.stanford.edu/people/eroberts/courses/soco/project...)

In general, it's a bit funny in how we build ML models. You take a bucket of matrices, fill them with random numbers, and then start to introduce "biases" through back propagation. A model can converge down loss, but most of them are still filled with random noise.

Binary weights are somewhat obvious in retrospect. Weights indicate the strength of an association between two neurons. Intuitively, most of the value is probably just that an association between two neurons exists, and whether it's positive or negatively associated.


Wow, fascinating read. 1999

Would a monkey who knew the n-gram frequencies of the letters in English where n is large be able to produce credible English text? Furthermore, does this monkey "know" English? If the N-gram monkey is behind one door and a human is behind the other, could a third-party observer tell which was the monkey? This question rings of the Turing test for artificial intelligence, to which there is no easy answer.

No easy answer indeed!


Down to one bit but that’s taking the 2.62 bits and then applying the redundancy factor.

What’s cool is that the differentiable activation function is important—-to avoid the linear, perceptron limitation—but the weight scaling can be so simple, at least for LLMs.

It makes me wonder whether the extra layers are effectively compensating; in other words, can the number of layers or hidden neurons be trimmed down if we then add more bits to each weight and still see equivalent effectiveness?


You can just delete whole layers, because the point of residual layers is to make the model learn the hyperparameter "layer count" automatically. Compare this to the absence of residual layers where the model must use all layers. Then you will have to get the layer count perfect, but this isn't possible, since each data point might benefit from a different layer count. The extra layers therefore exist primarily so the model becomes robust to poorly chosen hyper parameters. You still need a minimum amount of layers, but that isn't the problem.

https://arxiv.org/abs/2403.17887


I'm surprised that 1-bit works. Trinary (-1, 0, 1) makes sense, but if you only have 0 and 1, the asymmetry ought to be a problem. Or is multiplying an XOR, so 1x1 = 0?


It seems like they have learned floating point parameters x and y so that dequantized(bit 0) = x and dequantized(bit 1) = y. Thus there is no built in asymmetry. Or more precisely they learned a zero point and a scale but it's equivalent to this simpler model in the 1 bit case.

It still seems like there would be a problem because either [x, y] looks like [0ish, 1ish] and you can't have negative weights, or [x, y] looks like [-1ish, 1ish] and you can't have "don't care" weights. But if you have some redundancy in your neurons I guess this is acceptable because you can just cancel out the positive contribution from a neuron you don't care about with a negative contribution from a very similar neuron that you also don't care about.


It's more like, addition is an XOR. (In fact, AND as mult and XOR as add are GF(2)). Throw in NOT (or, really, just a constant 1) and you can compute any circuit.

Biologically, inhibitory neurons are every bit as important as excitatory ones, so if you squint just right, XOR looks like a neuron's activation being inhibited by another presynaptic neuron.


There's a scale and bias afterwards, so its not necessarily asymmetric.


Yeah basically. In binary, multiplication is an XNOR operation.

00 = 1

01 = 0

10 = 0

11 = 1


XNOR does not distribute over AND or any other binary operators (try 0 XNOR (0 AND 1)), nor does it have a multiplicative identity, so it's not really multiplication in a way that's useful.


0*0 = 1? I've always hear it as being the output of AND


For balanced binary [-1, 1] yes. -1*-1=1


That makes sense, thanks!


Neural networks work, but why they work is not well known. Researchers continue to find new “free lunches” all the time.


idk that we need to think that hard. we have shown can build arithmetic functions from networks of logic gates given sufficient connectivity. so in some sense it just drops the learning network below the level of words and asked it to make its own math.

since this generalizability thing seems to be real, this kind of directly folds in quantization-aware-training, doesn't it?


Stupid question: how can you do deep learning on 1 bit? DL works by calculating a gradient and following it to minimize a loss function, but if all your parameters are 1 bit, there's no such thing as a gradient as you don't have differentiation at all. So how does it work?


Not a stupid question at all.

Early quantization approaches just quantized a high-bit-precision pre-trained model - no need to calculate gradients on the quantized weights. BitNet[1] changed the game by training a low-precision model from scratch. It achieves this by keeping high precision in the gradients, optimizer state, and in "latent weights" which are then quantized on the fly. I don't really understand the finer details of how this works, so check out the paper if you're interested.

This article's approach is interesting. They start by quantizing a pre-trained high-precision model, and then they fine-tune the quantized model using LoRA (which dramatically improves the performance of the quantized model). They don't talk about the bit depth of the values in the LoRA matrices, so it may be that those are higher bit-depth.

[1] https://arxiv.org/pdf/2310.11453.pdf


Not sure what you mean, why wouldn't there be a gradient? There is a problem that small changes to weights may have no effect (eg ((0 + 0.3) + 0.3) + 0.4 = 0 if you have to round the result to 0 or 1 after every step), but to fix this I believe everyone maintains high precision weights for the backwards pass, from which the quantized weights are derived.


You could treat fractional addition as probabilistic operation.


I was thinking about this idea (using only additions for inference) before too. The article says about using 1-bit values for weights, but there is also another option: use integers for weights, but allow the output of a neuron take only 0 and 1 (for example, 1 if output is > 0 and 0 otherwise). In this case we do not need multiplications as well, but cannot save memory for storing weights. Maybe it has some other advantages, who knows. Never had time to research this further.


You may be interested in the "binary step" activation function. This does what you're suggesting. In general, complex behaviour really takes a hit though using this for the activation function of a neuron (though I'm also not sure which papers show metrics on this being used for transformer models).


So new AI chip will be coming optimised for large scale non-byte bit only AI world ?


Every CPU and GPU now has popcount [1] instruction. You can thank US three letter agencies for that [2].

  [1] https://en.wikipedia.org/wiki/Hamming_weight#Processor_support
  [2] https://vaibhavsagar.com/blog/2019/09/08/popcount/
Multiplication then is just bitwise AND and a popcount.


Makes you wonder what the agencies know that we don’t…


It's just cryptography and encoding. Popcount is useful for understanding entropy.


https://en.wikipedia.org/wiki/Adjacency_matrix

What would a graph look like when edges are represented with less than 1-bit?


It would have predictible structure. There would be groupwise, regular sparsity. You could exploit these patterns to infer properties like, "there is no connection from this vertex to this entire span of verticies".

Just think about representing a compressed version of a matrix. You want to have a short encoding for matrices are ones that are more ordered in some real, exploitable way. With some determnistic computation of the compressed representation, you could produce the full adjacency matrix.


https://proceedings.mlr.press/v206/meller23a/meller23a.pdf

It seems to have a lot of similarities to the technique in OPs article.


I think the problem is that you are also trying to represent the biases of directionality in the connection e.g.: [-1,0,1]. In the adjacency matrix, if you store the sign (direction) of the edge, then I think that would map fairly isomorphically.


Basically factorized into a giant state machine. Would be logical.


Highly respect HQQ team's work - same accuracy as GPTQ / AWQ and with no activation aware tuning calibration part - ie no more 3 hour calibration runs! A big fan of the 4bit ones especially, and the 3,2, and now 1bit ones!

Also super cool idea of 1bit needing some calibration like AWQ - no calibration data shows very bad results, but with some LoRA adapters and finetuning, a great recovery of performance is possible.

Planning to add support for this inside Unsloth to make all low bit finetuning 2x faster and save tonnes of VRAM!


Is this similar to binary search or could it be a binary search where you're just excluding all possibilities that you know can't be the probable result that you're looking for?


Feed forward networks are effectively DAGs

With the smaller model size, quantized models have less accuracy and less stable training, but large models take advantage of increased parallelism.

Binary search is more linear, circuits are a better lens than turing machines.

While it still needs to be verified, if you look into what a uniform consent depth threshold circuit is, that will help.

Although I guess binary weights may be in AC0 and not TC0, but that may not hold with billions of parameters.


I wonder if there is a correspondence to Sony's 1-bit DSD audio stream? Could hardware developed for this purpose be used for some part of the processing?


what will be the next step of 1 bit? could it be 0.5 bit? Any kind of quantization is not gonna to last long.


Not using traditional numbers at all, I imagine. It's possible that trying to build neural nets out of vectors that are themselves built with quantized scalars is a suboptimal approach.


The way things are going we'll have -1 bit before long :)


As an AI layman, I am for some reason excited by the low-bit learning models.

Apple, now leaving behind the albatros that was the Apple car, could be the one to roll their own silicon and put an LLM in our back pockets.

Goodbye Siri, hello Seriously.


Or we could end up getting another Google integration, with iOS being too rigid and locked-down to feasibly support third-party private AI alternatives. Fate has a cruel sense of humor, sometimes.


Are you referring to the Swift version of TensorFlow?

https://www.infoworld.com/article/3608151/swift-for-tensorfl...


I believe they're referring to these rumors: https://www.macrumors.com/2024/03/21/apple-iphone-ai-talks-g...


Could be a way for Apple to buy time as well until they do have something competitive.


Actually they plan to put an LLM in your back pocket using flash memory, not silicon: https://arxiv.org/abs/2312.11514


The flash doesn't do the computations though, that's just a method of getting it to the processor


It would be better to have eeprom or some such directly attached as memory. No loading.


> could be the one to roll their own silicon and put an LLM in our back pockets.

Wouldn't they have to do a lot of catching up with everyone else already working on it?


They've been designing their own chips a while now, including with an NPU.

Also because of their unified memory design, they actually have insane bandwidth which is incredibly useful for LLMs. IMO they may have a head-start in that respect for on-device inference of large models (e.g. 1B+ params).


I don't think people are running 1B+ models on the Neural Engine these days. The high-performance models I've seen all rely on Metal Performance Shaders, which scales with how powerful your GPU is. It's not terribly slow on iPhone, but I think some people get the wrong idea and correlate an ambient processor like the Neural Engine with LLMs.

The bigger bottleneck seems like memory, to me. iPhones have traditionally skimped on RAM moreso than even cheap and midrange Android counterparts. I can imagine running an LLM in the background on my S10 - it's a bit harder to envision iOS swapping everything smoothly on a similarly-aged iPhone.


Sure, but we're discussing 1.8-bit models that, again I'm a layman, I assume are over an order of magnitude smaller in their memory overhead.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: