Great project! I’m amazed by the ingenuity of running a CNN on a TI-84 Plus CE. ...

PaulHoule · 2025-01-04T15:14:45 1736003685

One funny topic is "why did it take us so long to develop advanced neural networks?" Part of the current advance is about improving hardware performance but I'd imagine if you went back to the 1980s with the knowledge we have you could have gotten much more out of neural nets than we did then.

As for fixed point vs floating I got into an argument with a professor in grad school about it. I'd argue that floating point often gets used out of familiarity and convenience, if you look at fields such as signal processing it is very common to do algorithms like FFT in fixed point, it seems absurd that an activation in a neural net could range from (say) 10^-6 to 10^6 and have the first digit be meaningful the whole way. If everything was properly scaled and normalized you shouldn't need to waste bits on exponents but maybe when you have a complex network with a lot of layers that normalization is easier said than done.

pmarreck · 2025-01-04T18:55:02 1736016902

> why did it take us so long to develop advanced neural networks?

I have a pet theory (with some empirical support) that Marvin Minsky put somewhat of a multi-year kibosh on "perceptron"/neural-net research due to his (true, but arguably irrelevant) assertion that trained neural nets were essentially "nondeterministic black boxes" which was basically heretical.

PaulHoule · 2025-01-04T19:11:38 1736017898

That's not just a pet theory. I'll tell you though that

https://en.wikipedia.org/wiki/Perceptrons_(book)

is a brilliant book from the early days of computer science which is asking questions about the fundamentals of algorithms and what you can and can't do with different methods. (Like the result that it takes at least N log N comparisons to sort N items)

hansvm · 2025-01-04T19:38:40 1736019520

> I'd argue that floating point often gets used out of familiarity and convenience

You're probably right. I can't tell you how often I've seen floats used in places they really shouldn't have been, or where nobody bothered to even write an algorithm that got the dynamic range for _floats_ (much less fixed-points) correct.

> if you look at fields such as signal processing it is very common to do algorithms like FFT in fixed point

FFT is almost uniquely suited to a fixed-point implementation. All the multiplications are with powers of roots of unity, and a typical problem size in that domain only involves O(10k) additions, keeping the result firmly in an appropriately scaled fixed-point input's dynamic range.

> If everything was properly scaled and normalized you shouldn't need to waste bits on exponents but maybe when you have a complex network with a lot of layers that normalization is easier said than done.

It's much harder to accomplish that even in vanilla neural networks (compared to FFT), much less if they have anything "interesting" (softmax, millions of inputs, ...) going on.

Take a look at a single matmul operation, or even just a dot product, with f32 vs i32 weights. Unless you do something in the architecture to prevent large weights from ever being lined up with large inputs (which would be an odd thing to attempt in the first place; the whole point of a dot-product in a lot of these architectures is to measure similarity between two pieces of information, and forcing inputs to be less similar to the weights in that way doesn't sound intuitively helpful), you'll eventually have to compute some product `x y` where both are roughly as large as you've allowed with whatever normalization scheme you've used up to that point. The resulting large activation will be one of the more important pieces of information from that dot product, so clipping the value is not a good option. With i32 inputs, you're left with one of:

1. Cap the inputs to 15 bits of precision (as opposed to 23 with f32)

2. Compute the outputs as i64, perhaps even larger (or still partially restricting input precision) when you factor in the additions and other things you still have to handle

3. Rework the implementation of your dot product to perhaps, somehow, avoid large multiplications (not actually possible in general though; only when the result fits in your target dynamic range)

The first option is just straight-up worse than an f32 implementation. The third isn't general-purpose enough (in that you again start having to design a network amenable to fixed points). The second might work, but to keep the bit count from exploding through the network you immediately require some form of normalization. That's probably fine if you have a layer-norm after every layer, but it still complicates the last layer, and it doesn't work at all if you have any fancy features at that particular matmul precluding layer normalization (like enforcing a left-unitary Jacobian at each composite layer (including any normalization and whatnot)).

You might, reasonably, counter that you don't need more than 15 bits of precision anyway. That's a fair counter (mostly, other than that the argument you're making is that by not wasting space on the exponent you have room for more real data), but all the common <f32 floating-point implementations have a higher precision than the required reduction in precision from an equally-sized fixed-point. Plus, you still need _some_ kind of normalization after the matmul to keep the dynamic range correct, and doing that by tweaking the model architecture precludes some features you might otherwise reasonably want to include, so you need to do something like separately keep track of a multiplicative scaling factor at each layer, which itself screws with how most activation functions behave, ....

Plus, at a hardware level, almost nobody provides operations like fmadd for integers/fixed-point yet. As a collective action problem, you're starting off at least twice as slow if you try to popularize fixed-point networks.

Not to mention, when you use fixed-points you generally start having to build larger computational blocks rather than working in smaller, re-usable units. As a simple example, the order of operations when computing a linear projection almost doesn't matter for floats, but for fixed-points you have all the same precision problems I mentioned for multiplications earlier, with the added anti-benefit of the internal division actively eating your precision. You can't have a component that just computes similarity and then applies that similarity without also doubling the bits of the return type and multiplying the result by something like (1<<32) if the divisor is large and some other hackery if it's small. Since that component is unwieldy to write and use, you start having to code in larger units like computing the entire linear projection.

I've probably gone on more than long enough. Floats are a nice middle ground when you're doing a mix of nearly unconstrained additions and multiplications, and they allow you to focus a bit more on the actual ML. For a fixed-point network to have advantages, I think it'd need to be built specifically to take advantage of fixed-point's benefits, especially since so many aspects of these networks instead explicitly play into its drawbacks.

PaulHoule · 2025-01-04T19:55:29 1736020529

Great!