1 Introduction

Software for unmanned spacecraft differs in one crucial point to most common embedded software: as soon as the mission launches, the direct maintenance of the embedded device is no longer possible. Therefore, spacecraft software has to be more reliable than common software, not only because of the maintenance problem, but also due to the high costs of a spacecraft mission and the extended risk of potential catastrophic consequences following a failure. Furthermore, the operational environment of a spacecraft offers some additional challenges to integrated circuits. Due to the lack of a protective magnetic field, the circuits are directly exposed to cosmic rays. This radiation, entering the device, can lead to a series of events in the electronics, called soft errorsFootnote 1. These in turn can trigger a multitude of different problems in the operative software of the spacecraft up to the complete loss of the system. The problems with the difficult environment on the one hand and the potential consequences of a failure of the system on the other hand motivate the implementation of reliable systems for spacecraft.

To achieve a reliable system, the system engineers often use space-qualified hardware. This hardware (e.g., the RAD750 from BAE Systems [3]) is radiation hardened and often comes with built-in low level fault-tolerance mechanisms. Unfortunately, the space qualification comes with a price: the performance is subpar in comparison to modern desktop architecture and standard embedded platforms (e.g., Xilinx Zynq-7000 [28]). This lack of performance in space-qualified hardware is a problem when it comes to processing-intensive applications (e.g., high-resolution observation), multi-application scenarios or autonomous spacecraft (e.g., rover missions on distant celestial bodies). Therefore, system engineers are increasingly using commercial-off-the-shelf (COTS) components to build their spacecraft. This increases the performance while reducing the costs but is accompanied by a decrease of reliability. Overcoming this trade-off between reliability and performance is a current research interest in the field of space systems engineering.

The German Aerospace Center (DLR) started researching a solution for this topic in 2012 which became the On-Board Computer—Next Generation (OBC-NG) project [19] in 2013, continuing in 2017 with the Scalable On-Board Computing for Space Avionics (ScOSA) project [26]. The result of those two projects was an on-board computer architecture offering reliability combined with performance by mixing radiation-hardened hardware with COTS components and abstracting it via software to a monolithic system. The system architecture consists of independent, interconnected nodes (either reliable components or high-performance COTS components), each instantiating their own operating system, abstracted to a monolithic execution platform by a middleware. The system supports multiple applications running concurrently. The applications will be divided into channels (containing the states of the application) and tasks by the developer using the provided interface to the execution platform (called Distributed Tasking). The tasks and channels will then be mapped to the nodes of the system by the system designer. The middleware offers features for adjusting the task mapping for different mission phases as well as in case of a node failure. This is called reconfiguration of the system. Additionally, the middleware provides common services well known in the research area of fault detection, isolation and recovery (FDIR). These services include: a voter service which implements a Triple Modular Redundancy [18] and the checkpointing service for a distributed state restoration after a task or node failure. The middleware is capable of operating on a heterogeneous set of processing nodes (either reliable or high-performance COTS components). This also includes a heterogeneous network architecture consisting of either Ethernet, SpaceWire [24] or a combination of both. Furthermore, the middleware is capable of supporting different operating systems (currently RTEMS and Linux) on the different nodes.

Following the ScOSA project, a successor project was launched. This project, called ScOSA Flight Experiment, began in January 2020. As the name suggests, this project aims to achieve a higher Technical Readiness Level (TRL) by preparing the in-orbit demonstration of the developed ScOSA on-board computer.

Alongside further improvements of the entire on-board computer, the middleware shall be enhanced for the demonstration. This includes enhancements to increase the robustness of the network stack, further FDIR techniques and services, and a restructuring of the developer API along with the integration of a new version of the tasking execution platform.

Additionally, the project shall contribute to the scientific area of COTS components in space. The focus in that context lies on the fault-tolerance mechanisms of the middleware as well as on the usage of COTS components and their induced risks to the mission in general.

In this paper, we present the ScOSA Flight Experiment, the ScOSA middleware, and our technical and scientific objectives for the project and first results for the overhead of the reconfiguration mechanism of ScOSA when scaled.

The remainder of this paper is structured as follows: In Sect. 2, we present the preliminary work together with related work for the distributed middleware. Section 3 gives a brief overview of the ScOSA Flight Experiment project, including its duration, milestones, and goals. Afterwards, in Sect. 3.2, we explain each different software component building the middleware in detail. This is followed by an overview of the technical and scientific objectives for the ScOSA middleware (see Sect. 4). The paper ends with the presentation of some first experimental results for the scalability in Sect. 5 and a conclusion including an outlook to future work in Sect. 6.

2 Related work

The ScOSA Flight Experiment presented in this work builds on its predecessor projects, ScOSA [23, 26] and OBC-NG [5, 19]. In addition to the preliminary work of DLR, some related work in the field of fault-tolerance for spacecraft OBCs exists.

The research for fault-tolerant and distributed OBCs for spacecraft goes back several decades [21]. Often, a reliable OBC is implemented by cost-intensive hardware redundancy concepts, like triple modular redundancy [18].

More recently, the system designers tend to implement fault-tolerance more and more using software, especially in the middleware. Many of those rely on the CORBA standard [20] for distributed systems, for example FLARe [4], which is a middleware developed in the context of the Lw-FT-RT-CORBA standardization effort. FLARe is aiming to manage (soft) real-time tasks and fault-tolerance together by a decentralized resource monitor.

Another middleware enabling fault-tolerance based on CORBA is MicroQoSCORBA (MQC) [8]. It enables typical fault-tolerance mechanisms, like checksums, redundancy, and logical time stamping on application level.

Besides the general interest in fault-tolerance middleware for distributed embedded systems, this topic became increasingly relevant for the space domain in recent years. Afonso et al. for example, developed a framework upon a real-time embedded system for spacecraft [1]. The goal of the framework is to provide the applications with fault-tolerance mechanisms like Recovery Blocks (RB), Distributed Recovery Blocks (DRB), TMR and N-version programming (NVP).

In [10], Fayyaz et al. also present a middleware for spacecraft which is distributed and fault-tolerant. The authors implement their middleware by instantiating a special programmed logic block (called Adaptive Middleware for Fault-Tolerance AMFT block) on each computing unit. The AMFT block constantly monitors the computing unit and communicates with all other AMFT blocks in the system. When a computing unit failure is recognized, the AMFT block informs all other blocks of this failure. A certain master computing unit will then redistribute all tasks to the remaining processors in the system.

Similar to this work and similar to our work, NASA developed their own architecture, called High Performance, Dependable Multiprocessors [30]. The architecture consists of one to two reliable processors, acting as system controllers and up to N independent COTS Field Programmable Gate Arrays (FPGAs). Similar to our approach, the authors implemented a fault-tolerant middleware. The middleware differs in terms of the reconfiguration after a node failure from our approach. The Job Manager of the middleware distributes the tasks among the available nodes on-line according to a load-balancing strategy. Our approach features pre-compiled, static configurations for node failures.

In [9], Dubey et al. designed a custom full-stack solution platform for fractionated spacecraft. This includes the operating system, a communication middleware, and a distributed fault manager.

Hecht et al. developed an adaptive fault-tolerant middleware for deep space missions in [12]. The adaptivity is derived from the ability to switch the fault tolerance mechanisms depending on the current environmental conditions and the available resources.

The aforementioned work on fault-tolerant middleware for spacecraft comes with many important features, as the provision of fault-tolerance mechanisms for the application developer, a heterogeneous architecture (with reliable and programmable nodes), a reconfiguration mechanisms, a distributed fault manager, and an adaptive middleware. Nevertheless, to the best of our knowledge, there is no work combining these features and supporting a system consisting of reliable and high-performance COTS components, yet.

3 The ScOSA flight experiment

Fig. 1
figure 1

DLR’s last CompactSat: Eu:CROPIS [7, 16] (Image:DLR, CC-BY 3.0)

The ScOSA Flight Experiment project aims to increase the TRL of the ScOSA OBC by means of an in-orbit demonstration. The targeted mission for the in-orbit demonstration is planned to be the next compact satellite (see Fig. 1 for an image of the last compact satellite by DLR) mission. The ScOSA OBC will be integrated to the compact satellite as a non-mission-critical secondary payload. The launch of the mission is planned to be in 2024.

3.1 ScOSA architecture

ScOSA is a new OBC architecture providing the features of fault-tolerance, high-performance, scalability, heterogeneity and a distributed execution of applications. It consists of a combination of RCNs with HPNs (in our case Zynq-7000 cores [28]), connected by SpaceWire [24] or Ethernet (see Fig. 2). Furthermore, sensors and actuators are connected into the system by so-called Interface nodes, which interact as interfaces to the peripherals for the RCNs and HPNs. Usually, several SpaceWire routers will be implemented to coordinate the communication between all nodes, but direct-linked nodes could also be used. To recognize node failures, the system implements a hierarchy between nodes and gives them one of three roles: coordinator, observer and worker. At each point in time, the system will have exactly one coordinator node which is responsible for monitoring all other nodes (by requesting heartbeat messages). All other nodes will become worker. Additionally, one or two nodes will become observer nodes, monitoring the coordinator.

Fig. 2
figure 2

Example of ScOSA’s system architecture: Two reliable nodes (RCNs) connect the system to the telecommand and telemetry units. The RCNs are connected by SpaceWire routers to N high-performance nodes and the interface nodes which connect the sensors and actuators

3.2 Middleware architecture

In the following chapter, we explain the middleware and its features by means of its components comprising the middleware. Additionally, we explain the underlying thread model of the middleware. The middleware consists of three main components:

  • a distributed execution platform, called Distributed Tasking

  • a set of management services enabling the main FDIR features of the middleware

  • a network stack for reliable messaging via SpaceWire or Ethernet

Those components are ordered into the layered architecture of the software stack (see Fig. 3) and will be explained further in the following subsections, beginning from the bottom with the network protocol.

Fig. 3
figure 3

ScOSA’s layered software stack. The middleware consists of the three blue modules: System Management Services, Distributed Tasking Framework and the SpaceWire-IPC protocol. The system management in turn consists of several services providing fault-tolerance features to the system and the applications

3.2.1 SpaceWire-IPC

SpaceWire-IPC can be described as part of the transport layer (Layer 4 in ISO/OSI model) of the ScOSA system. The main purpose of this layer is to extend the SpaceWire communications standard with means of inter-process communication (IPC) among nodes and management information to and from the coordinator node. The central paradigms of SpaceWire-IPC are reliable messages, error detection and error handling. It provides guaranteed delivery services and a timeout mechanism for reliable message transmissions. Even though its name suggest SpaceWire as transport medium, it is designed to support Ethernet, too.

3.2.2 Distributed tasking

The ScOSA middleware depends on an open-source multi-threading execution platform and a software development framework which is called Tasking Framework [11].

Using the Tasking Framework, applications are implemented as a graph of tasks that are connected via channels, and each task has one or more inputs to connect the tasks with the channels. However, the Tasking Framework does not have a direct support for distributed systems in which several tasks and channels may be mapped to one node, while other tasks and channels are mapped to another node. Therefore, the ScOSA middleware extends the Tasking Framework to the Distributed Tasking Framework.

3.2.3 System management services

The system management services implement the FDIR techniques of the middleware. It is important to understand that these services act as internal providers of functionality for the middleware itself (e.g., the reconfiguration service) or the application developer (e.g., the voter service). They are not meant to provide services to the outside, such kind of in-orbit services can be developed using the ScOSA middleware. The services will be instantiated as threads and connected to the network stack in order to communicate with the other instances on other nodes. Additionally, this software component is responsible for storing the current configuration parameter and settings of the system, as well as the nodes’ states and roles (coordinator, observer, worker). In the following we introduce the most important services in more detail.

Reconfiguration service This service is responsible for the reconfiguration of the task to node mapping (a.k.a. configuration) in case of a planned mission phase transition or a node failure. It maintains all configurations, which were created at design time by the system designer. In Fig. 4, one can see a reconfiguration tree in case of processing node failures. Each node of the tree represents a configuration. The edges of the tree point to configurations to be selected when a specific processing node \(P_i\) fails. Such configuration transitions are reported to the monitoring service which will shut down all tasks and channels on its node immediately. Afterwards, it starts the tasks and channels listed for its node in the new configuration.

Fig. 4
figure 4

This example shows several configurations (mapping of tasks to processing nodes) denoted as C1–C10. At the beginning, three processing nodes P1–P3 are available. If one these processing nodes fails, a different configuration (without the failed node) has to be selected. The edges from a configuration show which configuration has to be chosen. \({\backsim }P_i\) depicts the failure of processing node \(P_i\)

Monitoring service The monitoring service is one of the key services for the reconfiguration feature of the middleware. It uses a heartbeat mechanism to monitor the state of the nodes in a distributed way. Its exact function depends on the role of the node. The different roles are: coordinator, observer and worker. Coordinator and observer nodes have additional responsibilities as part of the monitoring service, while worker nodes are only responsible for processing application tasks. Multiple nodes can have the roles observer and worker, but exactly one coordinator node exists at any time. Additionally, observer nodes need to implement a hierarchical order among themselves, such that there is an observer1, observer2 and so on. The monitoring service of the coordinator node periodically sends messages to all other monitoring services on the other nodes. These messages are called heartbeat requests. A receiving monitoring service on a worker or an observer node will react with an acknowledgment message to this heartbeat request. When receiving the acknowledgment message, the coordinator node knows that the particular node is still alive and responding. If it does not receive an acknowledgment message from a node after a certain timeout and a resend of the heartbeat request, the coordinator node assumes that this node has failed. The monitoring service will then report that failure to the reconfiguration service, which will initiate a reconfiguration of the tasks to maintain the operational state of the system. To avoid that the coordinator node becomes the single-point-of-failure for the system, the first observer will request a heartbeat from the coordinator node periodically, too. The second observer node in turn will send heartbeat requests to the coordinator and the first observer. This scheme continues with every further observer in the system. How many observers a system implements is up to the system designer.

Voting service The voting service implements the concept of triple modular redundancy (TMR), which is often used in the aerospace domain. With TMR, the same processing operation will either be executed three times sequentially on a single node or executed three times concurrently on separate nodes. Afterwards the result of the three operations will be compared and the majority result will be forwarded. The voting service implements the comparison and forwarding part, with the option to inform another service or application in case of any disagreement between the three results.

3.3 Thread model

The ScOSA middleware is implemented in a multi-thread approach, to utilize the capabilities of modern processor architectures. Additionally, extending the middleware with more services is easier when the functionality can be encapsulated into threads. The middleware in its current state invokes eight threads (see Fig. 5). While most of the threads are activated sporadically, the checkpointing thread and the monitoring thread can be configured with a specific period.

Note that number of Tasking Framework executor thread can be specified by the application developers and, therefore, varies from scenario to scenario.

Fig. 5
figure 5

The ScOSA middleware threads on one node. SMS: System Management Services, TF: Tasking Framework. All nodes have the same number of threads; however, the functionalities that will be executed by the threads depends on the node role, i.e., coordinator, observer, or worker. The tasking framework may have at least one executor thread

4 Technical and scientific objectives

In the context of the in-orbit demonstration of the ScOSA OBC, we intend to enhance the fault-tolerant middleware and increase its stability by updating its components. The launch of the aimed compact satellite mission is planned for 2024. The ScOSA Flight Experiment project started in January 2020 and is planned to be finished end of 2022. During those three years, we plan to elaborate technical objectives for the ScOSA system software stack and contribute to the scientific areas of fault-tolerance and COTS systems in space. In the following, we present an excerpt of the technical and scientific objectives:

4.1 New FDIR technique

As an enhancement, we plan to design, implement and evaluate a new service for the application developers, which inherits the transaction management of databases and the transactional memory in processors (e.g. Intel’s Transactional Synchronization Extensions [13]). The idea is to process one operation and store its changes to the data in a special buffer, meanwhile a duplicate of the operation will be executed time-shifted to the first one, operating on the same original data and storing its data only temporarily. If the duplicate reaches a defined control value, it will compare its calculated value with the one from the first process. If the values are the same, the changes in the data will be applied to the real data set. If the values differ, a rollback will be initiated, which means that the data change buffer will be emptied and the operation will be rewound back to the beginning. If after several attempts the values still differ, an alert will be triggered. This is a similar approach to the TMR concept and the voting service with the difference that it leaves the actual data set untouched until the operation was confirmed.

4.2 Configuration modeling

One of the key features of ScOSA is the ability to apply new task configurations in case of a node failure. ScOSA uses a static approach for the reconfiguration, which means that all configurations are pre-compiled by the designer at design time. In case of a node failure during runtime, ScOSA searches the appropriate configuration and applies it. In case it cannot find a pre-compiled configuration for this particular combination of available nodes (node state scenario), it switches into the safe mode. The safe mode is a special operating mode in which the spacecraft will suspend all payloads and wait on ground interaction to solve the failure. To avoid too much ground interaction, it is, therefore, reasonable to provide a configuration for every possible node state scenario. Unfortunately the amount of node state scenarios grows exponentially with the amount of processors |P| in the system and is equal to \(2^{|P|}\). Taking an example of 8 processors, this already leads to 256 node state scenarios and necessary configurations. In that case, creating the configurations by hand is not an option any more. Therefore, we are providing a modeling tool which generates the configurations automatically. To provide good tasks-to-nodes mappings for the scenarios, the tool shall try to optimize two criteria: the utilization of every processor and the network traffic between the processors. Finding optimized configurations for the possible node state scenarios is a NP-complete problem; therefore, two different solvers are developed: an exact optimization solver for small-sized scenarios and a genetic algorithm for larger scenarios. These solvers will be then used to find optimal configurations.

4.3 Overhead of the reconfiguration system

Reconfigurable systems contribute to space systems in two ways. On the one hand, a reconfigurable system is more fault-tolerant compared to a statically configured system. On the other hand, a reconfigurable system can be used to support several mission phases reutilizing the hardware resources. This benefit comes at a cost: the cost of additional processing time for reacting to a failure or phase transition and the processing time for coordinating the reconfiguration. Due to the higher complexity of those algorithms, the validation and verification of such space software becomes more complex, too. To broaden the acceptance for reconfigurable systems in space, we strive to provide a reconfiguration module that generates acceptable overhead and scales with the system within a reasonable factor. For that reason, we will conduct a methodological runtime analysis on the failure recognition and the reconfiguration procedure of the ScOSA system.

4.4 COTS in space

One of the driving arguments for space-qualified processors in missions is their resistance against radiation-induced soft errors and failures, whereas COTS components are known to be vulnerable. For this reason, the use of COTS components is often secured by redundancy. To determine the needed redundancy for a system, it is important to know the failure rate in space of the used COTS components. The ScOSA Flight Experiment will use the Zynq-7000 SoC [28] as high-performance nodes. This COTS component is often used for space missions nowadays [14, 15, 22], but only a few evaluations of its failure rates in space have so far been conducted [27]. Instead, there exist some studies of failure rates in ground-based particle accelerators [2, 6, 17, 25, 29]. We would like to contribute to this research field by providing failure rates for the Zynq-7000 SoC from the flight experiment in space. For this reason, we will design and implement an application which measures the failure rates and sends them via telemetry to ground. The exact measurements of failure rates should correspond to those used in the former studies of [2] in order to gain additional insights into the difference to the results from the particle accelerators.

5 Measurements of scalability

One of the key features of ScOSA is its scalability. A scalable system is a system which can operate on a large amount of resources as well as on a small amount. In the context of a middleware for spacecraft OBCs, this means that the middleware is capable of operating small-sized spacecraft OBCs, e.g., with only two computing nodes, as well as a system consisting of a large number of interconnected computing nodes.

When introducing more and more computing nodes into a system operated by a middleware, the middleware needs to deal with more participants. This in turn might lead to an increased run-time of the key components of the middleware, as the task mapping or the reconfiguration. In addition to the problem of an increased run-time, the network experiences a higher load caused by the additional participants. This leads to more messages being sent and processed by the recipients, which leads to an increased execution time in the network stack. Those effects on the execution time of the middleware inevitably effect the execution time of the payload applications. As a result, some of those applications may be no longer able to meet their deadlines. This limits the scalability of the system.

5.1 Experiment setup

Fig. 6
figure 6

A self-sustaining task cycle. The experiment application is composed of 20 such task cycles

To obtain and estimate the effects on the system when scaling ScOSA, we conducted experiments with different numbers of nodes virtually on a desktop computer. The scenarios ranged from two nodes up to 20 interconnected nodes. We expected that the scaling affects features of the middleware, as for example the reconfiguration, which in turn then affects the tasks executed by the middleware on the nodes. Therefore, we triggered a reconfiguration by causing a single node-stop failure in each of the scenarios and obtained the time needed for reconfiguring all tasks on the remaining nodes as well as the amount of network traffic caused by the reconfiguration mechanism.

We also configured the experiments to execute a simple payload application inside the ScOSA middleware. The payload application consists in all experimental runs of 20 independent pairs of tasks (one called “Ping” and one called “Pong”). Those two tasks build, together with two channels, a self-sustaining trigger cycles (see Fig. 6). To heavily load the network, it has been required for each application in an experimental run to assign the tasks of ping-pong pairs to different nodes such that the work load is balanced (see Fig. 7).

Fig. 7
figure 7

The tasks are evenly distributed among the nodes of the five node experiment

To compute the network traffic, the data size of each packet sent by the reconfiguration mechanism was measured. We define the reconfiguration time as the time from the recognition of a node-stop failure by the coordinator node until all remaining nodes have applied the new configuration. To mitigate timing errors caused by the operating system of our virtual nodes, we repeated each experiment five times and calculated the arithmetic mean value of those runs. The variances of the different runs were small enough to be neglected. Thus, we only present the mean values here.

5.2 Network traffic

The network traffic of the reconfiguration of the 19 different experiment runs (see Fig. 8) shows a clear linear growth.

Fig. 8
figure 8

The network traffic caused by the reconfiguration mechanism of the different experiment scenarios (node count = 2...20)

This is an expected result as the coordinator node informs all the other nodes in the system about the change in the configuration. Subsequently, the nodes inform the coordinator node about the applied change. Hence, as the number of nodes increases, the network traffic rises as well.

5.3 Reconfiguration time

Running all the experiments shows a linear but unsteady growth for the obtained reconfiguration time (see Fig. 9), with a three times higher reconfiguration time in the largest (20 nodes) compared to the smallest (2 nodes) network. The anomalies might be attributable to the scheduling effects of the desktop operating system, on which the virtual nodes were executed. The experiments show that the ScOSA middleware is capable of scaling from 2 to 20 nodes with a linear growth in network traffic as well as reconfiguration time. This shows that on the one hand, the reconfiguration mechanism is generally capable of handling a scaled number of nodes and on the other hand, its overhead scales at a modest, linear rate. The next step is to repeat the experiments on a distributed embedded system.

Fig. 9
figure 9

The reconfiguration time of the 19 different experiments (node amount = 2...20)

6 Summary and outlook

Developing a reliable OBC for spacecraft which is able to process complex algorithms fast remains a major challenge. In this paper, we presented the middleware of ScOSA to meet this challenge by abstracting a distributed and heterogeneous architecture consisting of high-performance nodes combined with reliable nodes.

We gave an overview of the middleware and its abilities and explained the different software components comprising the middleware. The middleware originates from two projects concerning the issue of high-performance, reliable OBCs, OBC-NG and ScOSA by DLR. In the successor project ScOSA Flight Experiment, which we briefly presented, the middleware will be further extended and finally demonstrated in-orbit. Before the middleware will become flight-ready, we would like to elaborate some goals in technical and scientific context. We introduced those goals and provided first ideas for the realization of them. Besides that we presented the results of a first virtual scaling experiment, which showed that the overhead of one of the middleware’s feature scales with a linear factor. The in-orbit demonstration planned as a secondary payload on the next compact satellite in 2024 will show that future OBCs will consist of a combination of reliable nodes and high-performance nodes to cope with the challenging requirements of future spacecraft missions.