Implementing Neural Networks on a "10-cent" RISC-V MCU

geor9e · 2024-04-28T16:03:11

>I felt there is no solution that I really felt comfortable with

I wish the author elaborated at all about why they felt that way. Even if it was just "existing solutions are too easy and I want to learn the hard way". They linked to a pretty big list of established microcontroller neural network frameworks. I still have my little "sparkfun" microcontroller that runs Tensforflow Lite neural networks powered by just a coin cell battery. They were free in the goodie bags at Tensorflow Summit 2019. "Edge Computing" on "Internet of Things" was the hype that year.

Edit: Ah, I see they do have elaboration linked - "By simplifying the model architecture and using a full-custom implementation, I bypassed the usual complexities and memory overhead associated with Edge-ML inference engines." Nice work!

_Microft · 2024-04-26T09:21:46

How would one get these 16x16 images generated in a way that does not need a lot more compute power than the inference itself? Maybe by using a sensor from an optical mouse which seems to have a similar resolution? [0] According to a quick web-search, the CH32V003 seems to support SPI and I²C out of the box [1] which the mentioned sensor supports?

What would one do with such a system?

[0] https://pickandplace.wordpress.com/2012/05/16/2d-positioning...

[1] https://www.wch-ic.com/products/CH32V003.html

jononor · 2024-04-28T12:05:04

IO does tend to take considerable resources/power. In fact it is one of the reasons it is desirable to run ML as close to the sensor as possible. It allows to extract and transmit onwards just the information of interest (usually very low bitrate) instead of raw sensor data. Especially important on wireless and battery.

One area where very low resolution images are used is in 3d and IR sensing. For example a 8x8 depth image from a time of flight sensor like ST VL53L5CX. Could be mounted say household and detect for example human vs pet vs static object. Though the sensor is the expensive part, so one would probably afford a larger microcontroller :D

cpldcpu · 2024-04-28T13:18:18

Indeed using a mouse sensor for data input would be quite interesting. Mayber another option would just be a row of phototransistors.

imtringued · 2024-04-28T14:46:13

"ESP32 CAM" gets me a couple of hits for a camera excluding ESP32 for under $1.50 at 500 minimum quantity.

mianos · 2024-04-28T21:34:23

Specially considering the esp32 has multiply for both integer and floatinf point.

imtringued · 2024-04-28T14:43:19

7 seconds is an eternity, even for micro controllers.

robxorb · 2024-04-28T21:34:08

Unless I'm misreading your comment, you may have misread the article.

Inference for this RISC-V implementation takes 13.7ms. 7 seconds was cited from an Arduino version as a reference.

jononor · 2024-04-28T12:11:47

Image classification is a good demo/test case. However image sensors still cost multiple dollars, so one would likely spent a bit more in the microcontroller in that case. Accelerometer or microphone on the other hand adds just 30 cents to the BOM, and can be processed on similar cheap microcontroller. That is at least what I have found so far, trying to build a sub 1 dollar ML-powered system https://hackaday.io/project/194511-1-dollar-tinyml

cpldcpu · 2024-04-28T13:11:44

Great project! I used MNIST because it is easy to work with as a dataset. Audio classification would be quite interesting as a follow up, but I assume one would need some kind of transform to deal with the data in an easier way.

jononor · 2024-04-28T16:32:19

Thanks! Yeah transforming into a time-frequency is the standard method. Short Time Fourier Transform (STFT) using FFT is the most common, though one can use FIR/IIR filterbanks also. It is however quite challenging to do in just a few kB of RAM. It looks doable with 4 kB in total, miiight be possible with 2 kB.

cpldcpu · 2024-05-01T19:28:16

Maybe something simpler, like a haar wavelet, would also work? Or DFT using Görtzel?

bjornsing · 2024-04-28T11:58:54

Impressive numbers compared with the linked Arduino project. Makes me wonder, what’s the difference in approach?

cpldcpu · 2024-04-28T14:21:42

The different is in using quantization aware training, where the quantization of the weights is already simulated during the training. This helps to restructure the network in a way where it can optimally store information in the allotted number of bits per weight.

When the NN is quantized only after training, a lot of information is lost, or you have to use less aggressive quantization that will have a lot of redundancy.

UncleEntity · 2024-04-28T16:14:07

Does that mean you're training a lower bit-rate(?) network or you are training a full network to 'know' it will eventually be running under quantization?

I'd imagine there's differences in the two approaches?

cpldcpu · 2024-04-28T17:25:18

The latter one. The network is trained in full precision (this is required for the gradient calculation), but the weights are nudged towards the quantized values.

bjornsing · 2024-04-28T19:03:55

Thanks, that explains the accuracy. But it doesn’t explain why it took 7 seconds to run inference on the Arduino, and milliseconds in this project…

robxorb · 2024-04-28T21:57:38

The paper [0] regarding the Arduino inplementation mentions their MCU runs at 16Mhz, and they are also running the inference on 28x28 images.

This projects MCU runs at 48Mhz and is infering 16x16 images.

So, 3x less pixels, at 4x the Mhz. 7000 / 12 = 583ms. Versus 13.5ms, a 43x speed increase. Does seem high, depending maybe on differences between the AVR and RISC-V hardware and ISA. (Eg, might there be a RAM bottleneck on the AVR chip?)

[0] https://arxiv.org/ftp/arxiv/papers/2105/2105.02953.pdf

numpad0 · 2024-04-29T06:18:25

Looks like SRAM load on AVR takes 3 cycles, EEPROM 4 cycles[0], 1 cycle subtracted for consecutive reads. SRAM store is 1-2 cycles. FMUL(fixed point multiply) is 2 cycles. CPU is not pipelined nor cached.

3 cycles for load + 2 for multiplication + 1 for store = 6 clocks for multiplying a float against an array on program ROM. I just couldn't find corresponding document for CH32V003/QingKe V2A/RV32EC, but some of pdfs mention pipelines, so I suppose users are not supposed to count clock cycles and it's just vastly more efficient. That just could be it.

0: pp.70- https://ww1.microchip.com/downloads/en/devicedoc/atmel-0856-...

reply

cpldcpu · 2024-04-29T10:31:45

On the CH32V003, a load should be two cycles if the code is executed from SRAM, there are additional wait states for load from flash. The V2A does only cache a single 32 bit instruction word, so there is basically no cache.

This publication seems to describe more details on the arduino implementation:

https://arxiv.org/abs/2105.02953

It appears that the code is even using floats in some implementations, which have to be emulated. So I'd wager that both on algorithmic level (QAT-NN) and implementation level there are some discrepancies that lead to better performance on the CH32V003.

reply

mbb70 · 2024-04-28T13:12:45

I believe the Arduino project used only a single hidden layer, whereas the authors quantization scheme allowed them to use multiple.

ladyanita22 · 2024-04-28T14:37:16

How would Rust behave here? It'd be interesting to know if it's flexible enough to work as efficiently in these machines.

__s · 2024-04-28T14:46:40

Rust in these contexts can work with no_std https://docs.rust-embedded.org/book/intro/index.html

persnickety · 2024-04-28T18:07:57

Apparently it has been done https://noxim.xyz/blog/rust-ch32v003/