From zero to main(): How to write a bootloader from scratch

kbumsik · on Sept 30, 2020

A good real-world example is the Arduino bootloader. [1]

Once you read the OP and learn basic idea, it is pretty simple.

[1] https://github.com/arduino/ArduinoCore-samd/blob/master/boot...

fra · on Sept 30, 2020

A great open source bootloader is MCUBoot[1], if you're looking for something a little more substantial.

[1] https://github.com/JuulLabs-OSS/mcuboot

giu · on Sept 30, 2020

Nice little trick I found in your link: A poor man's sleep routine :)

    for (uint32_t i=0; i<2500; i++) /* 10ms */
      /* force compiler to not optimize this... */
      __asm__ __volatile__("");

pizza234 · on Sept 30, 2020

For busy waiting in general, AFAIK one should signal the intention to the compiler, via intrinsics (`_mm_pause` in C, `std::sync::atomic::spin_loop_hint` in Rust, and so on). The compiler and processor will attempt to make the best out of the pause, e.g. send the processor to a lower-consumption state.

It may be overkill for such cases, but if one keeps the strategy as reference, it's best to know the optimal one :-)

I even think it's not overkill for this case, as they're actually working around the compiler, and a clean approach is preferrable regardless.

simias · on Sept 30, 2020

>e.g. send the processor to a lower-consumption state.

You have to be very careful in a bootloader, since you're initializing the clocks and other very low level subsystems you don't want to risk putting the CPU in a state you won't be able to get out of.

In general the standard method is to use a dummy busy delay like the "nop loop" shown above in the very early stages until you get to the point where you can enable interrupts and standby states and implement a proper sleep method that will effectively shut the CPU down waiting for an external event.

If you're talking about userland code (or even driver code) using busyloops you're right though.

saagarjha · on Sept 30, 2020

Isn’t _mm_pause an Intel intrinsic?

pizza234 · on Sept 30, 2020

Based on what I see on [Stack Overflow](https://stackoverflow.com/a/51908785), it's recognized by all the major compilers

The [Microsoft Compiler help](https://docs.microsoft.com/en-us/cpp/intrinsics/x86-intrinsi...) includes it in the "x86 intrinsics list".

I've just compiled with success, for testing purposes, on an AMD processor.

C, relevant section:

    #include <immintrin.h>

    void test() {
        _mm_pause();
    }

Output ASM, relevant section:

    test:
    .LFB4006:
      .cfi_startproc
      endbr64
      pushq %rbp
      .cfi_def_cfa_offset 16
      .cfi_offset 6, -16
      movq  %rsp, %rbp
      .cfi_def_cfa_register 6
      rep nop
      nop
      nop
      popq  %rbp
      .cfi_def_cfa 7, 8
      ret
      .cfi_endproc
    .LFE4006:
      .size test, .-test
      .globl  main
      .type main, @function

It should be the `rep nop`.

Including `ammintrin.h` yields the same `rep nop`.

saagarjha · on Sept 30, 2020

By “Intel” I really meant “x86” :P Is there an ARM compiler that recognizes it?

secondcoming · on Sept 30, 2020

Unlikely, intrinsics are CPU, and sometimes CPU model, specific. Compiler builtins will instead use the intrinsic if it's available/suitable or generate equivalent code if it's not.

pizza234 · on Sept 30, 2020

Ah! Sorry for that; I have no idea.

It seems that at least some ARMs have [timers that can be used](https://electronics.stackexchange.com/q/450971), which is a different strategy, but still more appropriate (I suppose).

kbumsik · on Sept 30, 2020

To be fair, this is not a "sleep" routine, but a delay routine. The purpose of the function is to give a slight delay until other part of the hardware is ready. This is actually the most deterministic (so it means great) way to do it.

More sophisticated does not always means better.

simias · on Sept 30, 2020

These busy loops are not super deterministic in practice unless they're calibrated beforehand or if you know very well 1/ your CPU pipeline 2/ the frequency of your clock source 3/ the frequency of your CPU PLL (which can be dynamic on modern CPUs depending on temperature and other factors) and a few other things like whether the caches and branch prediction are enabled etc...

If you look at the Linux kernel code dealing with calibrating delay loops you can see that it's far from trivial in the general case (and not necessarily very accurate): https://elixir.bootlin.com/linux/latest/source/init/calibrat...

If you want a precise and deterministic delay you almost always want a hardware timer. The problem of course is that in these early boot stages you can't always use one easily, so a rough busy delay approximation does the job just fine.

kbumsik · on Sept 30, 2020

The OP delay loop is for MCU/Arduino where embedded devs typically have all control of the PLLs and know exact CPU timings. You can see that the other part of the bootloader example fixes all related clocks manually. [1]

These MCUs are designed for hard realtime systems so the silicon design is required to have all timings well-known in the datasheets and required to be deterministic. The core (Cortex-M0) design removes nondeterministic aspects of the CPU so there are neither caches nor branch predictions in it. So the busy loops have deterministic timings.

Of course it won't be trivial in MPU/Linux spaces as you mention.

Lots of programming practices are quite different between MPU and MCU worlds. I personally find it interesting that many people are talking about MPU/Linux/x86 here. It's good to learn about them :)

[1] https://github.com/arduino/ArduinoCore-samd/blob/master/boot...

friendzis · on Sept 30, 2020

Poor man's sleeps (various NOP loops) are just that. In all cases you should try and utilize on-chip sleep instruments, timers most probably.

On a Cortex-sized MCU that is most probably as simple as installing appropriate interrupt handler and launching the timer by writing appropriate data to some register(s).

yitchelle · on Sept 30, 2020

I have to support this advice. Poor man's sleep is not really sleep as it is still executing NOPs. The on-chip sleep mechanism puts various parts of the silicon into an off-state. The target of sleep is to reduce current consumption when the hardware is idle.

stjohnswarts · on Oct 1, 2020

Sleep is a generic term. In this context of an arduino the sleep term is perfectly applicable. On a bigger processor with pipelining/memorycontroller/tlbs I'd agree with you.

kbumsik · on Sept 30, 2020

But the example here is to simply delay, not to sleep for power consumption.

yitchelle · on Sept 30, 2020

It might be fine if it is just for delay, then you run into problem of portability to the system with different instruction cycle times.

Not saying it is bad, but technical debt is high from several angles.

Someone · on Sept 30, 2020

But a delay that uses less power still is preferred. Busy-looping should be avoided whenever possible (I wouldn’t know whether that’s the case here)

kbumsik · on Sept 30, 2020

The delay is just 10ms and this is a bootloader code that runs only once at bootup. As a bootloader for an MCU, a less error-prone way with less code size is more preferrable.

friendzis · on Oct 1, 2020

I can agree that justification to use nop loops somewhere along the lines of "core starts executing code when, judging by real-world testing, hardware peripherals do not guarantee stable state. Thorough testing suggests that 8ms+safety margin delay mitigates issues related to hardware readiness. 2500 cycle nop-loop guarantees required 10ms delay on fastest clock speeds and 50ms delay on slowest core speeds without causing observable issues. 50ms is hereby deemed acceptable." would acceptable in most commercial design documents.

rhn_mk1 · on Sept 30, 2020

> In all cases

Aren't you overgeneralizing a little? There is nothing wrong with using loops when execution time is known and constant (like an Arduino). It saves the effort of setting up interrupts, and they can be used with interrupts disabled.

They won't be so accurate: time spent in interrupts will not be counted, and loop setup/teardown won't as well, but it's rare to need such precision over short scales, e.g. when delaying a LED blink.

Never use those when execution time is not guaranteed though, like in emulation or binaries executing on different MCUs.

friendzis · on Oct 1, 2020

> > In all cases you should try

Yes, in all cases you should try to avoid nop loops, similarly like in high level applications you should try and avoid blocking IO in favor of non-blocking IO. And it does not only mean "avoid nop-loops in code". It also means design your thing in a way where random waits (nop loops are random waits) are unnecessary in the first place. Wait on readiness flag, use timers. Sometimes you have shitty hardware that either does not indicate readiness or indicates readiness falsely, happens. There definitely are situations where nop loops are the only sane available solution.

> using loops when execution time is known and constant

Um... Cortex and larger cores normally can run at multiple clock speeds (e.g. via PLL or clock multiplier config), so this speed is constant only as long as you do not change core clock configuration. Furthermore, if the core in question is even more advanced and employs dynamic PLL, nop-loop execution time is not constant.

> (like an Arduino)

Yes, if you run on small cores like ATmega this can be more or less true.

> It saves the effort of setting up interrupts, and they can be used with interrupts disabled.

Sure and that is one of the main advantages of nop-loops. However, unless you have all "userland" interrupts disabled (all except fault handlers), you run inherently concurrent code which by definition cannot have constant real world run time. To be fair, in the context of a bootloader application flow can be controlled and you can avoid most concurrency issues.

> Never use those when execution time is not guaranteed though

This is the core issue. Execution time is most probably not guaranteed. Nop loops can work during development, hobby project or when you need just a tiny little bit of delay and hardware sync is unavailable.

kbumsik · on Sept 30, 2020

I agree. Using timer is not as deterministic as precisely-calculated NOP loop.

And why even bother setting timer peripherals if it is to delay for 10ms once in the entire program?

cozzyd · on Sept 30, 2020

Well, this is the ArduinoCore on SAMD, which (unsurprisingly, since it's same chip) looks a lot like this tutorial.

steveklabnik · on Sept 30, 2020

This whole website is really great; it has been a fantastic resource for me. https://interrupt.memfault.com/blog/cortex-m-fault-debug in particular was really great.

cozzyd · on Sept 30, 2020

Completely agree. Way better than than the crappy documentation you often get from vendors like Microchip.

Cerium · on Sept 30, 2020

with Microchip you get more than crappy docs. You get plenty of bugs in the hardware. After spending weeks finding one you report it and they say "Yeah, we know.". I still use them. Once you patch and work around all the bugs you can get somewhere nice.

Some examples on pic24fj256gb108: The onboard BOR circuit cannot be trusted. One of the GPIO is listed in the data sheet as being a real GPIO, but the driver is missing because it was used for a USB feature.

numlock86 · on Sept 30, 2020

A good start. It's fun until you realize Cortex M0 doesn't have a VTOR. Then again, remapping the vector table by hand isn't much work.

fra · on Sept 30, 2020

> It's fun until you realize Cortex M0 doesn't have a VTOR.

One of the many annoying limitations of the M0. Thankfully ARM seems to agree, and they've added the VTOR(s) to the M23 (the ARMv8M successor to the M0).

mmoskal · on Sept 30, 2020

The tutorial is about SAMD21 which is M0+ with VTOR. Simple bootloaders often don't need interrupts so on M0 one can get away with placing the bootloader at the end of flash and only rewriting the reset vector to lead to bootloader. Regular interrupts are then handled without any delays.

dmitrygr · on Sept 30, 2020

C-M0+ has VTOR as an option and many implementations have it.

numlock86 · on Sept 30, 2020

Yes, there are probably some. I've worked with quite some M0 chips from Renesas, STM and Freescale and none of them had a VTOR. I heard (some?) M0s from NXP have it, though. For portability I always include manual remapping anyway.

peterkelly · on Sept 30, 2020

Note to the author: On the latest Firefox for mac, the code samples for some reason show up as black on grey text and are almost impossible to read.

I've noticed that if I remove 'Source Code Pro' from the font-family in the inspector for the code blocks, it looks fine.

fra · on Sept 30, 2020

Thanks for letting me know! Let me dig into this.

the_spacebyte · on Sept 30, 2020

Not adding much to the discussion, just want to say how great this website is and its articles! They have been a fun read and fantastic resource!

2rsf · on Sept 30, 2020

I wrote my last bootloader more than a decade ago, debugging problems is fun !

dgellow · on Sept 30, 2020

How do you debug a bootloader?

mras0 · on Sept 30, 2020

Depends on the target, but sometimes you can use your regular debugger. Otherwise good old printf debugging/blinking with LEDs/toggling output pins/staring at the code while crying depending on the circumstances.

meekrohprocess · on Oct 1, 2020

Like the article says, it's just a C program.

So you can compile with -g and use a debugger like GDB to step through it, set breakpoints, probe memory and registers, etc.

jlokier · on Sept 30, 2020

On physical hardware, it helps a lot to have at least one simple GPIO which you can control by writing to a memory location or I/O port, or even better a UART output.

Then you can toggle the GPIO output to see if some code was reached and/or a condition was met.

If you have subroutines (or macros if you don't have a stack, or use a register to hold the return address), you can toggle a GPIO repeatedly to output characters, doing a UART in software if you don't have one already on the chip.

It's a lot like "printf" debugging. You repeatedly compile and run the boot code, testing how it went using this basic output.

To run the boot code might need a tool that lets you load it into memory beforehand.

Some boards will only have flash EEPROM to boot from. There you run the boot code by repeatedly flashing your code into it. Until your CPU is running useful software, the flashing tool runs outside the chip you're coding for.

Typically it doesn't take a large number of efforts to get something useful running.

But to boot up fully, there might be tricky or subtle issues around initialising clocks, PLLs, even voltages, enabling a cache, switching from cache-as-RAM to using external RAM, initialising DRAM, bus signal training, negotiating protocols with other devices, that sort of thing. Some CPUs boot into a special state where some features don't work normally yet, a state that only exists during boot. Some are executing direct from flash, slowly, and need RAM initialised before they can start running from it.

That kind of stuff is very similar to writing device drivers: There may be "mystical incantations" with magic numbers and particular instruction sequences, that you just have to find out from somewhere because they aren't obvious, or even because trial-and-error doesn't guarantee you reliable behaviour. For example some PLLs require a set of magic values to be loaded into registers. You could find the values experimentally, but you wouldn't have the same behaviour guarantees in all environmental conditions that you get from using values from the manufacturer.

In practice, most boards come with a "BSP" (board support package) already from the manufacturer, so the tricky essentials are done already. Also, many of them have a boot loader which can "chain" to a second bootloader, which you can write or change as you wish without having to get the chip initialisation right, and sometimes you can load the second bootloader into RAM with an external tool, or over a UART, so no flashing required.

There is also simulation. Many CPUs have simulators, so you can debug the boot loader step by step, stop it and inspect the program state, much like any other program. However, some simulators only simulate a device after the tricky stuff mentioned above is done and the CPU+memory is running "normally". The early boot state might not be simulated accurately or at all.

cozzyd · on Sept 30, 2020

gdb along with a debug server from OpenOCD or JLink or whatever. Assuming the MCU has the proper debugging support. This works well on the Cortex-M0 at least (the annoying thing is you often have to switch between the bootloader .elf and the application .elf).

devthrowawy · on Sept 30, 2020

Often a pentagram is required.

minipci1321 · on Sept 30, 2020

Conveniently omits any mention of hardware initialization and related fallouts in HVM.

Find it easy? Try to push all components -- interconnect, DDR and flash controllers, accelerators, etc -- to max performance and get the phone ready to hear from the factory about how the boards fail testing.

Not all of the SoC blocks may be reconfigured later after the boot completed, some have to be done right in the very early stages of it.