That's freaking awesome. For those unaware, the RP2040 microcontroller has several Programmable IO (PIO) cores, which are essentially standalone state machines that read/write directly from memory to the GPIO pins. Unlike bitbanging the pins with the CPU, you can implement much more timing-sensitive protocols while reducing CPU usage at the same time. There's demos of people implementing VGA out, digital audio, all sorts of interesting protocols.
I love the PIOs, it's such a great idea. I've only used one to emulate a 4021 shift register being clocked at 100 kHz, which is light work compared to what they're capable of, but still something an Arduino struggles to keep up with.
I used them (PRU in Beaglebone) to implement e-fuse burning with JTAG (the chips stops working if you take a little too long) and MDIO (the interface for controlling a network PHY).
My interesting battle story around this is that I first implemented the MDIO as bit banging in the kernel. This used quite a bit of CPU, which I wanted to use for other things. I switched it over to the PRU and CPU usage dropped to 1%. Great! But the data rate was much slower, with huge latency spikes. It turned out that the CPU usage was so low that the CPU governor thought the system was idle, so was throttling the CPU to just a few hundred MHz. I had to change the governor to keep the CPU clock to 100%, then everything was about 10x faster.
It is very humbling to know I’m not the only person to have had to implement bit banged MDIO recently.
If your PHY’s data sheet says clause 45 register access only and you don’t have GPIO access as the bus master, run away.
Unfortunately I was tasked with emulating a clause 45 PHY on a clause 22 bus. Without creative use of the microcontroller peripherals, it is next to impossible to achieve implemented a 2.5Mhz slave.
I had to use an edge interrupt to jump in to a while loop state machine, whose state was determined by a timer counter driven by the MDIO bus, with all GPIO acessd via hard coded bit banded GPIO.
The PRUs in the beaglebone are way more flexible to be honest. Most of my effort with the Pico PIOs is squeezing everything to fit inside the 32 instruction limit.
If you're short on PIO memory, the OUT EXEC instruction will execute an instruction directly from the FIFO. Feed the FIFO with a DMA channel and it'll keep up with the system clock. From the RP2040 datasheet section 3.4.5.2:
> OUT EXEC allows instructions to be included inline in the FIFO datastream. The OUT itself executes on one cycle, and the instruction from the OSR is executed on the next cycle. There are no restrictions on the types of instructions which can be executed by this mechanism. Delay cycles on the initial OUT are ignored, but the executee may insert delay cycles as normal.
No, sorry. I also don't have documentation that I can give. When I left the company, the Beaglebone community was pretty well into a C compiler, which is probably complete by now. There was also some coprocessor library being developed, to provide a nice interface to the host. It was all asm and kernel modules back then. I'm sure things are much easier now.
Noob question: can it be said that the microcontroller contains a very simple FPGA? Or are there some crucial differences between the two technologies?
Much more limited capability wise, but much tighter guarantees about latency. More like a 80Mhz AVR 8bit than an ARM coprocessor.
The real shame of modern microcontrollers is the decoupling of peripherals and GPIO from the main processor. It prevents these sorts of hacks as all access has to be through the memory bus, effectively capping bitbanging at sub 10MHz speeds.
By that I mean how IO is coupled to the processor. In AVR8 (and other older processors I presume) GPIO access was via a CPU register, thus a single instruction (one clock RISC) can directly modify the GPIO state.
Every GPIO implementation I’ve seen on modern processors accesses it via a memory mapped peripheral. The difference being accessing a memory bus is not a single cycle operation. You have to wait for the bus to be free, then wait to fetch or write the data.
The most extreme analogous example of this is modern cpus are effectively infinitely fast but bounded by cache misses that necessitate memory access.
This is fundamentally why every toggle a GPIO pin benchmark is flawed. What is really being measured is memory bus latency.
This misunderstanding is why people have trouble reconciling why a multi ghz processor cannot also bitbang GPIO at ghz speeds, although if such a processor existed it would be amazing.