(n.b. The project looks awesome, and it’s awesome that it’s written in Rust and working well. Great job folks!)
I’ve been the managing functional safety engineer for several safety-critical real-time embedded systems and safety-related components/SEooCs (including a RTOS), and something stood out to me after reading through the repository: This is not a safety-critical operating system, despite the project’s claim.
There was clearly requirements engineering and verification work done, and the authors are under absolutely no obligation to publish the requirements and methodologies. But there’s no safety manual / integrator’s guide provided, no evidence of any qualitative or quantitative safety analysis or modelling at any abstraction level having been performed, no evidence of a safety concept, and no evidence of a safety case. I recognize that this is likely intended as a PoC from the description of it being exploratory to produce a “Lessons Learned” report, but the website for the ESA initiative that’s driving this development claims that “The design of the system will be guided to support potential future qualification activities”. If that was done, there should be evidence of some of all of those work products, because that’s how you support future qualification. You cannot retroactively sprinkle safety on top of a system that hasn’t been designed with safety in mind.
Hey, one of the devs of Aerugo here! Thanks for feedback!
You are absolutely right - at current state, this is not a safety-critical OS, however the project doesn't claim that explicitly - it's "safety-critical applications oriented". It's a small detail, but you're right to point it out.
The lack of mentioned documents is due to the fact that this RTOS was not qualified for any criticality. And this is due to our resource constraints for this project - ESA provided us with a year of time and funds for ~2 full-time developers. It would be physically impossible to create this project from scratch (as we did) and qualify it, even for crit C, in that timeline.
We would love to do a follow-up activity on Aerugo, and one of our ideas was the qualification (maybe not for Crit A, but B would be nice for an RTOS). However, that's a thing for the future, and we don't know what exactly will happen next with Aerugo yet - we're working on it.
I'd also like to point out that we have designed this system with safety in mind - we've been regularly analyzing potentially problematic design choices and code with that in mind (especially unsafe code and functions). There is a ground for criticality qualification, it just needs a lot of work to make it a fact.
PS: We do intend to release our "Lessions Learned" report in near future to the public!
I’d love to learn more about this — can you recommended any resources for building something from the ground up with the relevant safety-critical considerations? Appreciate any pointers
Late reply, I know, but I would highly recommend skimming through “Embedded Software Development for Safety-Critical Systems” by Chris Hobbs. I think it does a good job at getting people thinking about how they can start adopting a safety mindset for their projects, what techniques are applicable, and what common pitfalls are.
> Its design is inspired by purely functional programming paradigm and transputers architecture.
> RTOS is implemented in a form of an executor instead of classic scheduler and doesn't support preemption. Executor runs tasklets, which are fine-grained units of computation, that execute a processing step in a finite amount of time.
Transputers seem to be a leftover technology from the 80s, and lack of preemption seems hard to reconcile with real time guarantees?
As far as my very limited understanding goes, the transputer reference is simply referring to a highly parallel message-passing system based on message queues, not actual hardware implementations.
As for lack of preemption, that's explained by the system not implementing a scheduler. The system "just" implements an executor.
An Executor is a component that handles subscriptions (e.g. to interrupts or events/messages), services (both clients and servers), timers, and QoS events (deadline missed, invalid QoS request, task died/unresponsive, etc.), which are implemented as tasks (Aerugo calls them "tasklets") and registered with the executor.
• Subscriptions are tasks that subscribe to a topic (of a message).
• Service servers are tasks that are invoked when a request from a client is received.
• Service clients are tasks that are invoked whenever a response from a server is received.
• Timers trigger when a timer expired.
• QoS events are fired when a deadline has been missed, a task died/became unresponsive, and other stuff like a QoS request couldn't be met, etc.
Aerugo supports a subset of these capabilities, namely message queues (for subscribers/clients and producers/servers), events (IRQs, h/w signals, etc.), and cyclic execution (execute the task n-times or forever).
Scheduling - i.e. the priority of things - is handled by the user or built into the Executor itself. For Aerugo the former is the case, so if you need control over priorities and order of task execution, you'd have to implement it on top of it.
Now the realtime-part isn't really affected by any of this. It still allows for weaker RT systems. RT just means there's a time constraint on executed tasks. Weaker systems have average runtime guarantees (e.g. a tasks have to finish within a limit on average, i.e. outliers are allowed). Stronger systems are stricter in the sense that tasks always have to finish within the limit. The strongest systems even enforce an exact limit (that is, tasks aren't even allowed to finish sooner - they have to take exactly a given amount of time).
Preemption is required if tasks have to be scheduled w.r.t. to different priorities, especiallyy in multi-threaded scenarios. Lack of it means that the executor will never interrupt a running task to execute tasks that take priority - at least from what I understand, which might be wrong.
Run-to-completion scheduling is a thing in the real-time world. See for example Miro Samek's book Practical UML Statecharts in C++.
XMOS makes real-time embedded CPUs supporting task-per-core architecture and hardware message passing. Commonly used in consumer real-time audio hardware, to pick one example. I believe some of the original Transputer people are involved.
No preemption means that any realtime guarantees are handled by user space, as the kernel relies on that yielding to hit time guarantees. It means it doesn't have to do much of the "difficult" work for a realtime system.
On a preemptive OS, all multitasking with (non-CPU) contended resources is, to a certain extent, cooperative. If one task acquires a mutex and then never releases it, no amount of priority-inheritance, highest-locker semantics &c. will fix that. Making everything cooperative increases the analysis needed because the CPU becomes a manually managed contended resource.
In addition, while not being preemptive, it sounds like executors are expected to run in bounded time, and the system triggers an event when the time is exceeded.
Complete systems, not operating-systems, are realtime. Operating systems can be non-realtime (most general purpose schedulers are completely unsuited for hard realtime systems), but at their best only make the design of realtime systems tractable.
You are confusing deadlock with scheduling. Sure, deadlock can happen in the absence of "cooperation". And lack of on-time scheduling can happen in a cooperatively scheduled system in the absence of the appropriate cooperation.
But scheduling has been considered by most kernel designers to be the responsible of the kernel, not the participants (i.e. preemptively scheduled threads not cooperatively).
Even if a thread is going to deadlock as soon as it starts running (again), there is a huge difference between that thread being scheduled at the right time and it not being scheduled. You can fix the former (deadlock) with better coding. You cannot fix the latter without fixing the kernel.
> You are confusing deadlock with scheduling. Sure, deadlock can happen in the absence of "cooperation". And lack of on-time scheduling can happen in a cooperatively scheduled system in the absence of the appropriate cooperation.
I was describing situations much broader than deadlock. The following pseudocode is not deadlock, but nevertheless a failure of cooperation:
WaitForMutex(m)
DoSomeReallyLongComputation()
My point was that the above code in a preemptively scheduled system is as damaging to all tasks that will contend for "m" as this code is in a cooperatively scheduled system:
DoSomeReallyLongComputation()
> But scheduling has been considered by most kernel designers to be the responsible of the kernel, not the participants (i.e. preemptively scheduled threads not cooperatively).
Yes and no. Schedulers tend to have parameters. Realtime systems will rely heavily on those parameters. Those parameters will sometimes even include promises for thread T to not run for more than X amount of time in a period of Y time.
> Even if a thread is going to deadlock as soon as it starts running (again), there is a huge difference between that thread being scheduled at the right time and it not being scheduled. You can fix the former (deadlock) with better coding. You cannot fix the latter without fixing the kernel.
This is true in a preemptively scheduled kernel. It's kind of tautological that fixing issues with resource X needs to fix the kernel IFF X is managed by the kernel. See also my above paragraph about kernel scheduler parameters.
[edit]
Just saw who I was replying to. I suspect that you and I have different visions of what a "Real Time System" is, given that I'm thinking industrial control and you're probably thinking audio. There's definitely overlap in theory and discipline, but the hardware and software stacks are rather different.
A kernel (because that's where interrupt handlers are located) can ensure that a thread is scheduled with N usecs of when it "ought to be" (which could be based on some sort of time allocation algorithm, or simple priorities or whatever other scheme may be in use. The kernel can say "oh look, it's been N usecs, let's check who is running and who is ready to run ... hey, time for Thread 2 to run". This is preemptive scheduling.
No cooperative scheduling system can ensure this.
Your example involves poorly designed code, which is not the responsibility of the scheduler. Its job is just to make sure that threads run when they "ought to" - it cannot protect against priority inversions in user space and ensure that RT guarantees are met (pick 1, and even then, you lose).
They are saying that a real time application requires a transitive closure analysis of all cooperating partners to verify if service can be guaranteed.
A cooperative scheduler requires you to extend the transitive closure to all code in the system.
For a critical appliance where the appliance as a whole needs to guarantee service, this means your transitive closure already encompasses all code in the system, so you are not losing too much.
The advantage of preemptive scheduling is that it allows you to subdivide your applications so that you do not need to analyze everything in the system. You can restrict yourself to only considering direct and intentional interaction. This provides modularity advantages and allows you to provide guarantees to sub-components even in the presence of errors or malicious behavior in other components.
However, if you are doing whole system analysis anyways and the systems are simple enough to be tractable to analyze without decomposition, then a cooperative scheduler is adequate to ensure real time performance.
“with contended resources” is the part you are misinterpreting. They are saying that two applications that contend on a resource must cooperate to operate properly.
As they clearly state later: “Making everything cooperative increases the analysis needed because the CPU becomes a manually managed contended resource.”
i.e. a cooperative scheduler makes all code contend on the CPU, thus requiring global cooperation.
Given that they are providing that case as establishing a new requirement (the CPU becomes a contended resource) they are clearly stating in the preemptive scheduler case that the CPU is not a contended resource and thus no global cooperation is required. Only if they contend on a resource do they need cooperation amongst the contending partys.
But again, if you are already doing a whole system analysis anyways, then the benefits are less pronounced.
Veserv has correctly restated what I meant by my original comment (which I still stand by). In my first reply, I used a mutex as an example for a contended resource. A cooperative multitasking system just extends "contended resource" to include the CPU.
Just like you need to make sure you don't hold a critical lock for too long in a preemptive multitasking system, you need to make sure you don't run on the CPU for too long in a cooperative multitasking system.
[replying to GP so as to not fork this thread too much]
> A kernel (because that's where interrupt handlers are located) can ensure that a thread is scheduled with N usecs of when it "ought to be" (which could be based on some sort of time allocation algorithm, or simple priorities or whatever other scheme may be in use. The kernel can say "oh look, it's been N usecs, let's check who is running and who is ready to run ... hey, time for Thread 2 to run". This is preemptive scheduling.
> No cooperative scheduling system can ensure this.
Sure it can: just poll for preemption every N usecs. You can even statically analyze the assembly code to calculate the maximum number of clock cycles between two points where you poll for preemption (in the event that your microcontroller has caches, you will want to use writethrough caching to help with this analysis).
> Your example involves poorly designed code, which is not the responsibility of the scheduler. Its job is just to make sure that threads run when they "ought to" - it cannot protect against priority inversions in user space and ensure that RT guarantees are met (pick 1, and even then, you lose).
This was the whole point of that example. A realtime scheduler only guarantees that a thread gets the CPU time it is promised. CPI time is one of (potentially many) contended resources in a multitasking system. It is so helpful because all tasks will be contending for CPU time and there are few other resources for which this is true (memory bandwidth, particularly on SMP systems immediately comes to mind).
To be clear: I like realtime schedulers. The are great and simplify many things; the analyses of tasks that only contend for CPU are greatly simplified by them.
> A kernel (because that's where interrupt handlers are located)..
Thought I would pull this one phrase out to demonstrate how we are clearly talking about different systems. For many embedded control systems, I would not say "the interrupt handlers are located in the kernel." On processors that have the distinction, interrupt handlers typically run in Supervisor mode (What the x86 calls "Ring 0"), but it's entirely possible that 100% of the code will run in Supervisor mode. When the kernel provides preemptive scheduling, it may handle a single interrupt (a timer); alternatively the kernel doesn't handle any interrupts, but instead provides an API call that application developer(s) must invoke for entering the scheduler.
Many things that are called an "RTOS" in the embedded world would be called a "threading library" in the desktop world.
I haven't poked into the source code, but usually a QoS event is fired, e.g. "deadline missed". What happens exactly in that case is very application specific.
The same thing that happens when 1+1 == 3, or when a task tries to write to memory that it doesn't have permissions for. The static analysis that your system relies on for correct behavior is no longer valid, so a hardware belt-and-suspender mechanism (a schedule overrun timer interrupt, a lockstep core check failure, or an MPU fault, respectively) resets or otherwise safe-states the failed ECU and safety is assured higher up in the system analysis.
>> Executor runs tasklets, [...] that execute a processing step in a finite amount of time.
> lack of preemption seems hard to reconcile with real time guarantees?
Depends on the requirements and the bounds of the finite time allowed and the enforcement of it. For example, Erlang claims to be soft real-time; it doesn't have preemption, but any function call can result in yielding, and as a functional language there are no loops without calling functions.
And that works because it is a VM that is executing the instructions. The higher level of the VM can switch threads at will. The measure used is the number of reductions, once a set limit is used or something more urgent crops up a context switch will occur. This functionally indistinguishable from a hardware interrupt for the task that was executing but because it is all done in software you don't actually have that interrupt. It's more as though every reduction has the potential to interrupt a task just like you can expect an interrupt everywhere in a regular stream of machine code for some processor.
The nice benefit of the Erlang method is that you can graft this onto a user process without a need for special privilege or exception handling, nothing ever really has to deal with interrupts or the messy aftermath of it, it's much cleaner than an interrupt driven system.
This is so so often the case. Specially in the last decades, when it is profitable to setup a company which basically does nothing more than writing specifications
> The proposed activity is to evaluate the usage of Rust programming language in space applications,
> The design of the system will be guided to support potential future qualification activities.
> This application will showcase the viability of the developed RTOS and provide input to a Lessons Learned report, describing the encountered issues, potential problem and improvement areas, usage recommendations and proposed way forward.
Looks like it's not intended for real applications, but instead to gain some experience. What better way to ensure that you don't ship the prototype than by doing it on hardware that is similar but different enough to ensure that it won't be used in production.
The rationale is rather simple - SAMV71 is used in critical applications (it has some functional safety certificates), and I have past experience of writing critical software for it at N7.
The choice of microcontroller matters. Functional safety qualified MCUs might have the same CPU core but is built in a way that minimizes interference or common cause of failures between peripherals. The software needs to be written closely following the safety manual of such a MCU to make maximum use of those safety guarantees.
There are several safety critical Cortex implementations around, with features like multi-core lock-step operation that is largely transparent to the RTOS (or whatever) beyond fault handling. There isn't some vast gulf between the ATSAMV71Q21 they've piloted this on and whatever space rated device and requirements you imagine.
While our RTOS in fact doesn't care about it, because it's up to the user to implement the abstraction for the platform, the bigger issue is Rust support for more exotic architectures used in space - for example, SPARC (LEON3), which we'd like to consider for a Rust project, but it's compiler support seems relatively bad.
That seems like an insoluble problem: for every fault tolerant SPARC device someone slices off a wafer the ARM crowd ships a truck load of chips. SPARC will never have the mindshare necessary to polish the LLVM/Rust stack to parity.
If by "no standard" you mean that there is no language specification for rust, then there is no standard. However, a language specification is not sufficient to verify program correctness, nor is it required.
A standard may (and the C standard for example does) leave parts of the behavior as "implementation specific" and there's quite a few edge cases - and that's not even talking about "undefined behavior", of which there is plenty. An even in the behavior that is neither implementation specific nor undefined you'll find enough rope to hang yourself (all the beautiful pointers). There's a reason things such as MISRA C exist - effectively a standard on top of a standard.
On the other hand, the rust language - while having no formal spec - is fairly well described, in the form of its RFCs and testsuite. We (the ferrocene team) were able to derive a descriptive specification from the existing description fairly easily. So while there is no ISO standard, and no spec that would be sufficient to write a competing implementation, there is a description of what the language behaves like. You can read up on it at https://spec.ferrocene.dev/
As for verification of correct behavior of such a program, you can employ a host of different techniques depending on what your requirements are - down to verification of the produced bytecode by means of blackbox testing or other.
Because verification does not require a standard. rustc has already been qualified (though not for any aerospace-specific things yet that I'm aware of, but in my understanding the shape is the same) (via Ferrocene[1]), even though there is no Rust Standard. No issues here.
The behavior of the generated binary can be verified against the requirements. Yeah, the most common way to do this is to verify certain properties at the source code level, and then rely on various ways to show equivalence between the source code and the generated assembly, the generated assembly and the generated binary, and the formal semantics of the generated binary and the as-executed semantics on the chosen hardware; but it's perfectly reasonable, and not even particularly unusual, to skip the first equivalence and verify the assembly against the requirements directly.
I’ve been the managing functional safety engineer for several safety-critical real-time embedded systems and safety-related components/SEooCs (including a RTOS), and something stood out to me after reading through the repository: This is not a safety-critical operating system, despite the project’s claim.
There was clearly requirements engineering and verification work done, and the authors are under absolutely no obligation to publish the requirements and methodologies. But there’s no safety manual / integrator’s guide provided, no evidence of any qualitative or quantitative safety analysis or modelling at any abstraction level having been performed, no evidence of a safety concept, and no evidence of a safety case. I recognize that this is likely intended as a PoC from the description of it being exploratory to produce a “Lessons Learned” report, but the website for the ESA initiative that’s driving this development claims that “The design of the system will be guided to support potential future qualification activities”. If that was done, there should be evidence of some of all of those work products, because that’s how you support future qualification. You cannot retroactively sprinkle safety on top of a system that hasn’t been designed with safety in mind.