Hacker News new | past | comments | ask | show | jobs | submit login
From zero to main(): How to write a bootloader from scratch (memfault.com)
332 points by tigerlily on Sept 30, 2020 | hide | past | favorite | 44 comments



A good real-world example is the Arduino bootloader. [1]

Once you read the OP and learn basic idea, it is pretty simple.

[1] https://github.com/arduino/ArduinoCore-samd/blob/master/boot...


A great open source bootloader is MCUBoot[1], if you're looking for something a little more substantial.

[1] https://github.com/JuulLabs-OSS/mcuboot


Nice little trick I found in your link: A poor man's sleep routine :)

    for (uint32_t i=0; i<2500; i++) /* 10ms */
      /* force compiler to not optimize this... */
      __asm__ __volatile__("");


For busy waiting in general, AFAIK one should signal the intention to the compiler, via intrinsics (`_mm_pause` in C, `std::sync::atomic::spin_loop_hint` in Rust, and so on). The compiler and processor will attempt to make the best out of the pause, e.g. send the processor to a lower-consumption state.

It may be overkill for such cases, but if one keeps the strategy as reference, it's best to know the optimal one :-)

I even think it's not overkill for this case, as they're actually working around the compiler, and a clean approach is preferrable regardless.


>e.g. send the processor to a lower-consumption state.

You have to be very careful in a bootloader, since you're initializing the clocks and other very low level subsystems you don't want to risk putting the CPU in a state you won't be able to get out of.

In general the standard method is to use a dummy busy delay like the "nop loop" shown above in the very early stages until you get to the point where you can enable interrupts and standby states and implement a proper sleep method that will effectively shut the CPU down waiting for an external event.

If you're talking about userland code (or even driver code) using busyloops you're right though.


Isn’t _mm_pause an Intel intrinsic?


Based on what I see on [Stack Overflow](https://stackoverflow.com/a/51908785), it's recognized by all the major compilers

The [Microsoft Compiler help](https://docs.microsoft.com/en-us/cpp/intrinsics/x86-intrinsi...) includes it in the "x86 intrinsics list".

I've just compiled with success, for testing purposes, on an AMD processor.

C, relevant section:

    #include <immintrin.h>

    void test() {
        _mm_pause();
    }
Output ASM, relevant section:

    test:
    .LFB4006:
      .cfi_startproc
      endbr64
      pushq %rbp
      .cfi_def_cfa_offset 16
      .cfi_offset 6, -16
      movq  %rsp, %rbp
      .cfi_def_cfa_register 6
      rep nop
      nop
      nop
      popq  %rbp
      .cfi_def_cfa 7, 8
      ret
      .cfi_endproc
    .LFE4006:
      .size test, .-test
      .globl  main
      .type main, @function

It should be the `rep nop`.

Including `ammintrin.h` yields the same `rep nop`.


By “Intel” I really meant “x86” :P Is there an ARM compiler that recognizes it?


Unlikely, intrinsics are CPU, and sometimes CPU model, specific. Compiler builtins will instead use the intrinsic if it's available/suitable or generate equivalent code if it's not.


Ah! Sorry for that; I have no idea.

It seems that at least some ARMs have [timers that can be used](https://electronics.stackexchange.com/q/450971), which is a different strategy, but still more appropriate (I suppose).


To be fair, this is not a "sleep" routine, but a delay routine. The purpose of the function is to give a slight delay until other part of the hardware is ready. This is actually the most deterministic (so it means great) way to do it.

More sophisticated does not always means better.


These busy loops are not super deterministic in practice unless they're calibrated beforehand or if you know very well 1/ your CPU pipeline 2/ the frequency of your clock source 3/ the frequency of your CPU PLL (which can be dynamic on modern CPUs depending on temperature and other factors) and a few other things like whether the caches and branch prediction are enabled etc...

If you look at the Linux kernel code dealing with calibrating delay loops you can see that it's far from trivial in the general case (and not necessarily very accurate): https://elixir.bootlin.com/linux/latest/source/init/calibrat...

If you want a precise and deterministic delay you almost always want a hardware timer. The problem of course is that in these early boot stages you can't always use one easily, so a rough busy delay approximation does the job just fine.


The OP delay loop is for MCU/Arduino where embedded devs typically have all control of the PLLs and know exact CPU timings. You can see that the other part of the bootloader example fixes all related clocks manually. [1]

These MCUs are designed for hard realtime systems so the silicon design is required to have all timings well-known in the datasheets and required to be deterministic. The core (Cortex-M0) design removes nondeterministic aspects of the CPU so there are neither caches nor branch predictions in it. So the busy loops have deterministic timings.

Of course it won't be trivial in MPU/Linux spaces as you mention.

Lots of programming practices are quite different between MPU and MCU worlds. I personally find it interesting that many people are talking about MPU/Linux/x86 here. It's good to learn about them :)

[1] https://github.com/arduino/ArduinoCore-samd/blob/master/boot...


Poor man's sleeps (various NOP loops) are just that. In all cases you should try and utilize on-chip sleep instruments, timers most probably.

On a Cortex-sized MCU that is most probably as simple as installing appropriate interrupt handler and launching the timer by writing appropriate data to some register(s).


I have to support this advice. Poor man's sleep is not really sleep as it is still executing NOPs. The on-chip sleep mechanism puts various parts of the silicon into an off-state. The target of sleep is to reduce current consumption when the hardware is idle.


Sleep is a generic term. In this context of an arduino the sleep term is perfectly applicable. On a bigger processor with pipelining/memorycontroller/tlbs I'd agree with you.


But the example here is to simply delay, not to sleep for power consumption.


It might be fine if it is just for delay, then you run into problem of portability to the system with different instruction cycle times.

Not saying it is bad, but technical debt is high from several angles.


But a delay that uses less power still is preferred. Busy-looping should be avoided whenever possible (I wouldn’t know whether that’s the case here)


The delay is just 10ms and this is a bootloader code that runs only once at bootup. As a bootloader for an MCU, a less error-prone way with less code size is more preferrable.


I can agree that justification to use nop loops somewhere along the lines of "core starts executing code when, judging by real-world testing, hardware peripherals do not guarantee stable state. Thorough testing suggests that 8ms+safety margin delay mitigates issues related to hardware readiness. 2500 cycle nop-loop guarantees required 10ms delay on fastest clock speeds and 50ms delay on slowest core speeds without causing observable issues. 50ms is hereby deemed acceptable." would acceptable in most commercial design documents.


> In all cases

Aren't you overgeneralizing a little? There is nothing wrong with using loops when execution time is known and constant (like an Arduino). It saves the effort of setting up interrupts, and they can be used with interrupts disabled.

They won't be so accurate: time spent in interrupts will not be counted, and loop setup/teardown won't as well, but it's rare to need such precision over short scales, e.g. when delaying a LED blink.

Never use those when execution time is not guaranteed though, like in emulation or binaries executing on different MCUs.


> > In all cases you should try

Yes, in all cases you should try to avoid nop loops, similarly like in high level applications you should try and avoid blocking IO in favor of non-blocking IO. And it does not only mean "avoid nop-loops in code". It also means design your thing in a way where random waits (nop loops are random waits) are unnecessary in the first place. Wait on readiness flag, use timers. Sometimes you have shitty hardware that either does not indicate readiness or indicates readiness falsely, happens. There definitely are situations where nop loops are the only sane available solution.

> using loops when execution time is known and constant

Um... Cortex and larger cores normally can run at multiple clock speeds (e.g. via PLL or clock multiplier config), so this speed is constant only as long as you do not change core clock configuration. Furthermore, if the core in question is even more advanced and employs dynamic PLL, nop-loop execution time is not constant.

> (like an Arduino)

Yes, if you run on small cores like ATmega this can be more or less true.

> It saves the effort of setting up interrupts, and they can be used with interrupts disabled.

Sure and that is one of the main advantages of nop-loops. However, unless you have all "userland" interrupts disabled (all except fault handlers), you run inherently concurrent code which by definition cannot have constant real world run time. To be fair, in the context of a bootloader application flow can be controlled and you can avoid most concurrency issues.

> Never use those when execution time is not guaranteed though

This is the core issue. Execution time is most probably not guaranteed. Nop loops can work during development, hobby project or when you need just a tiny little bit of delay and hardware sync is unavailable.


I agree. Using timer is not as deterministic as precisely-calculated NOP loop.

And why even bother setting timer peripherals if it is to delay for 10ms once in the entire program?


Well, this is the ArduinoCore on SAMD, which (unsurprisingly, since it's same chip) looks a lot like this tutorial.


This whole website is really great; it has been a fantastic resource for me. https://interrupt.memfault.com/blog/cortex-m-fault-debug in particular was really great.


Completely agree. Way better than than the crappy documentation you often get from vendors like Microchip.


with Microchip you get more than crappy docs. You get plenty of bugs in the hardware. After spending weeks finding one you report it and they say "Yeah, we know.". I still use them. Once you patch and work around all the bugs you can get somewhere nice.

Some examples on pic24fj256gb108: The onboard BOR circuit cannot be trusted. One of the GPIO is listed in the data sheet as being a real GPIO, but the driver is missing because it was used for a USB feature.


A good start. It's fun until you realize Cortex M0 doesn't have a VTOR. Then again, remapping the vector table by hand isn't much work.


> It's fun until you realize Cortex M0 doesn't have a VTOR.

One of the many annoying limitations of the M0. Thankfully ARM seems to agree, and they've added the VTOR(s) to the M23 (the ARMv8M successor to the M0).


The tutorial is about SAMD21 which is M0+ with VTOR. Simple bootloaders often don't need interrupts so on M0 one can get away with placing the bootloader at the end of flash and only rewriting the reset vector to lead to bootloader. Regular interrupts are then handled without any delays.


C-M0+ has VTOR as an option and many implementations have it.


Yes, there are probably some. I've worked with quite some M0 chips from Renesas, STM and Freescale and none of them had a VTOR. I heard (some?) M0s from NXP have it, though. For portability I always include manual remapping anyway.


Note to the author: On the latest Firefox for mac, the code samples for some reason show up as black on grey text and are almost impossible to read.

I've noticed that if I remove 'Source Code Pro' from the font-family in the inspector for the code blocks, it looks fine.


Thanks for letting me know! Let me dig into this.


Not adding much to the discussion, just want to say how great this website is and its articles! They have been a fun read and fantastic resource!


I wrote my last bootloader more than a decade ago, debugging problems is fun !


How do you debug a bootloader?


Depends on the target, but sometimes you can use your regular debugger. Otherwise good old printf debugging/blinking with LEDs/toggling output pins/staring at the code while crying depending on the circumstances.


Like the article says, it's just a C program.

So you can compile with -g and use a debugger like GDB to step through it, set breakpoints, probe memory and registers, etc.


On physical hardware, it helps a lot to have at least one simple GPIO which you can control by writing to a memory location or I/O port, or even better a UART output.

Then you can toggle the GPIO output to see if some code was reached and/or a condition was met.

If you have subroutines (or macros if you don't have a stack, or use a register to hold the return address), you can toggle a GPIO repeatedly to output characters, doing a UART in software if you don't have one already on the chip.

It's a lot like "printf" debugging. You repeatedly compile and run the boot code, testing how it went using this basic output.

To run the boot code might need a tool that lets you load it into memory beforehand.

Some boards will only have flash EEPROM to boot from. There you run the boot code by repeatedly flashing your code into it. Until your CPU is running useful software, the flashing tool runs outside the chip you're coding for.

Typically it doesn't take a large number of efforts to get something useful running.

But to boot up fully, there might be tricky or subtle issues around initialising clocks, PLLs, even voltages, enabling a cache, switching from cache-as-RAM to using external RAM, initialising DRAM, bus signal training, negotiating protocols with other devices, that sort of thing. Some CPUs boot into a special state where some features don't work normally yet, a state that only exists during boot. Some are executing direct from flash, slowly, and need RAM initialised before they can start running from it.

That kind of stuff is very similar to writing device drivers: There may be "mystical incantations" with magic numbers and particular instruction sequences, that you just have to find out from somewhere because they aren't obvious, or even because trial-and-error doesn't guarantee you reliable behaviour. For example some PLLs require a set of magic values to be loaded into registers. You could find the values experimentally, but you wouldn't have the same behaviour guarantees in all environmental conditions that you get from using values from the manufacturer.

In practice, most boards come with a "BSP" (board support package) already from the manufacturer, so the tricky essentials are done already. Also, many of them have a boot loader which can "chain" to a second bootloader, which you can write or change as you wish without having to get the chip initialisation right, and sometimes you can load the second bootloader into RAM with an external tool, or over a UART, so no flashing required.

There is also simulation. Many CPUs have simulators, so you can debug the boot loader step by step, stop it and inspect the program state, much like any other program. However, some simulators only simulate a device after the tricky stuff mentioned above is done and the CPU+memory is running "normally". The early boot state might not be simulated accurately or at all.


gdb along with a debug server from OpenOCD or JLink or whatever. Assuming the MCU has the proper debugging support. This works well on the Cortex-M0 at least (the annoying thing is you often have to switch between the bootloader .elf and the application .elf).


Often a pentagram is required.


Conveniently omits any mention of hardware initialization and related fallouts in HVM.

Find it easy? Try to push all components -- interconnect, DDR and flash controllers, accelerators, etc -- to max performance and get the phone ready to hear from the factory about how the boards fail testing.

Not all of the SoC blocks may be reconfigured later after the boot completed, some have to be done right in the very early stages of it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: