FPGA implementation of hardware accelerated RTOS based on real-time event handling

Actual trends in the real-time system field consists of migration towards complex central processing unit (CPU) architectures with enhanced execution predictability and rapid CPU contexts switch, thus obtaining high-performant control systems. The main objective of this paper is to present the results obtained following the implementation of real-time operating systems (RTOS) functions in hardware. Based on the CPU resource multiplication concept, actual researches has been focused on synthesizing in field-programmable gate array (FPGA) and implementing innovative solutions to improve RTOS performance. The results are materialized by validating an efficient hardware scheduler micro-architecture, from which a remarkable efficiency and a plus of performance and predictability are obtained. The experimental results, the FPGA resource requirements for implementation of the processor in different configurations, and the comparisons with other similar processor architectures are presented in order to verify theoretical aspects proposed through this paper.


Introduction
Due to the complexity of the real-time applications and very short response times in the industrial, medical and automotive fields, the design and development of hardware systems with increased computing power was necessary; more convenient management of time was thus enabled. Nevertheless, computing power is not a fundamental feature of RTS, this being an abstract term dependent on the coefficients of the process for which the real-time system is used. Therefore, RTS defines those systems that provide a valid response corresponding to worst case execution time (WCET) limits imposed by tasks deadline associated with the controlled process. In traditional approaches, a processing system can be seen as a sequential machine for controlling an industrial process. Most programming languages require the programmer's attention to specify algorithms as instruction sequences. Processors run programs by extracting machine instructions, one at a time in a given sequence. Each instruction is executed based on operations sequence, such as extracting the instruction, extracting the operands, performing the arithmetic, logical or load/store operation and storing the result. In this context, hazard situations and task contexts' saving and restoring operations must be taken into account.
Affordable FPGA devices [1], with a large number of logic gates [2,3], can be used as hardware support for implementing and testing new CPU architectures. To be a flexible platform for development and implementation, the FPGA circuit resources are varied and performant. However, circuit designers will never be able to generate a clock signal at speeds comparable to a dedicated circuit. Making a comparison, an application-specific integrated circuit (ASIC) can reach speeds of more than 4 GHz, while a FPGA runs in very good conditions at only 450 MHz. Although FPGAs consume much more power compared to ASIC circuits, they have a major advantage because they are suitable for small and medium-sized low-cost implementations with the possibility of theoretically unlimited reconfiguration.
In a real-time application, two processes type are usually executed, namely tasks and interrupt service routines (ISRs). The execution allocation refers to distribution of the processes on the available active resources (processing units). In addition, scheduler designates the establishment of execution sequence on each active resource. The main advantage of using an RTOS is that it provides tasks synchronisation mechanisms and scheduling. The ISRs are served through the interrupt system and have priorities, managed through the interrupt controller, usually higher than the priorities of the tasks. It should be noted that, although the scheduler does not manage ISR directly, any ISR execution requires assignation of an active computing resource and affects the tasks execution sequence.
In this paper we aim to describe and validate the proposed hardware accelerated RTOS (HW_nMPRA_RTOS) based on multiple pipeline registers and custom scheduler implementation. The main contribution of this paper is the novel HW_ nMPRA_RTOS concept that includes a multiple event handling module as hardware support for real-time scheduling techniques. Experimental results show the feasibility of the proposed hardware real-time scheduler integrated into the MIPS32 coprocessor 2 (COP2), using Verilog HDL. The MIPS32 architecture is based on the MIPS II ISA, enhanced with selected instructions from MIPS III, MIPS IV, and MIPS V. Thus, there is additional support for adding user-defined instructions (UDIs), custom coprocessors specific to particular applications, and application-specific extensions (ASEs). Regarding the features of MIPS, ARM, RISC-V, etc. processors, a distinction must be made between the architecture and the hardware implementation of that architecture. Architecture refers to the instruction set, registers, exceptions, memory management, virtual or physical addressing mode, as well as other features that the hardware implements and executes. Implementation refers to the way and techniques by which the processor implements a specific architecture.
For the implementation and validation of the theoretical elements proposed in this paper, we used the FPGA platform based on the Virtex-7 development kit, MIPS32 architecture, Verilog HDL and the integrated simulator of the Vivado Design Suite by Xilinx, Inc. This paper is structured as follows: the 1first section contains the introduction, and Sect.2 describes similar papers in the field of RTS with RTOS implemented in hardware-software. Section3 describes the hardware accelerated RTOS based on real-time event handling module, and Sect. 4 presents the validation results. Sec-tions5 and 6 contains the discussions and application areas of the validated architecture. Section 7 concludes the paper with a brief outlook on future work.

Related work
Current trends in RTS design with hardware schedulers are based on custom coprocessors for running scheduling algorithms. Generically, in a processor-coprocessor architecture, the scheduling operation sequence is executed on the external coprocessor, with context switching performed on the main processor after the ID of the next task is transmitted in advance via data bus. The context remapping sequence is triggered by interrupts signalled on the input pins of the processor. The major advantage of such an approach is that it offers some malleability in choosing a general-purpose processor, and the coprocessor's access for the use of the schedulers hardware primitives can be done through application programming interface (API). The idea of using an external hardware scheduler is widely used. Thus, a method based on priority queues that are implemented in hardware can be used, while the software layer is responsible for execution of the scheduling algorithm. Therefore, the idea of using priority queues was used to implement hardware schedulers in multiprocessor systems.
The concept of hybrid operating systems refers to the idea that the RTOS is not fully implemented in hardware. Thus, it is partially implemented in software and only certain modules are designed in hardware to increase overall OS performance. This offers great versatility, because the hardware part of an RTOS can be easily added to a processor architecture in the form of an IP type, and the entire architecture is then included in an FPGA. This model use the software to change the context of tasks, representing an approach that optimize only the time allocated to the scheduler.
Within the current research based on task scheduling in single and multi-core processors, the published articles present existing task schedulers implemented in ASIC or FPGA, these being the following: High-Performance Real-Time Hardware Scheduler (HRHS) [4,5]; Earliest Deadline First (EDF) scheduler for quad-core CPUs [6] and EDF scheduler with support of periodic tasks and inter-task synchronization [7,8]; Guaranteed Earliest Deadline (GED) scheduler for soft real-time tasks [9]; Robust Earliest Deadline (RED) scheduler for mixed-criticality systems [10]; Simple and Effective hardware based Real-Time Operating System (SEOS) [11]. In [4] the authors propose a distributed, online and time predictable hardware scheduling solution. This concept is suitable for multi-core systems, thus, splitting the main scheduler into uniform partial schedulers to achieve a significant gain in performance and scalability. In paper [5], an efficient hardware scheduler for scheduling dependent tasks in real-time multi-core systems is presented which is based on the operating system selecting tasks that can be scheduled with the EDF algorithm. In [6], an efficient hardware architecture of EDF-based task scheduler is described, which is suitable for real-time hard systems due to the constant response time of the scheduler. The obtained results of ASIC (28 nm) and FPGA synthesis are presented and compared. More than 86% of the chip area and 93% of the total power consumption can be saved if the Heap Queue architecture is used in hardware implementations of the EDF algorithm.
In the articles [7][8][9] and [10] Lukáš Kohútka presents the results of several researches proposing an FPGA embedded task scheduler model supporting periodic heavy aperiodic real-time tasks. The paper [8] contains results obtained by FPGA synthesis performed for various parameters using the Intel FPGA Cyclone V device. In [9] a coprocessor design is presented that performs task scheduling for soft systems in real time, based on the GED algorithm. In [10] the authors describe a new ASIC design of a coprocessor that performs process scheduling for real-time embedded systems with mixed criticality. In [11], experimental results show that SEOS has large performance advantages over software-based RTOS. This is because the proposed design was designed to provide high adaptability for ease of RTOS hardware adaptation.
Existing architectures of priority queues that can be used for sorting tasks in schedulers were presented and validated in practice. In [12] is presented a priority encoder with multiple first-in first-out lists, shift register, and systolic array, and [13] propose the systolic priority queue architecture. In [14] the authors validate the Rocket Queue concept based on shift registers, systolic arrays, and heapsort algorithm, and paper [15] proposes Heap Queue architecture using dual-port RAMs optimized for low chip area costs. In [16] the authors propose MultiQueue, a set of multiple priority queues that can be implemented in FPGA or ASIC. In [17] are analysed the major sources of unpredictability in multi-core processors and memory hierarchy, and [18] presents how predictable asynchronous inter/intra-core communication between tasks can be realized. Based on real-time research focused on the scratchpad operating system [17], multi-core processors are widely used in the industry. It should be noted that their increased complexity often causes a loss of timing predictability, which is a key requirement for hard real-time systems. Major sources of unpredictability are shared resources such as memory hierarchy, I/O subsystem, asynchronous event handling, and ISR impact on the scheduling algorithm. In [18] the authors analyse how predictable asynchronous inter/intracore communication between tasks can be realized. To validate the design in [18] proposed OS was implemented using the commercial MPC5777M platform. Paper [19] investigates scheduling methods for executing aperiodic tasks based on firm and soft deadlines in RTS, and [20] show that the proposed scheduler provides a worst-case usage, being effective for dealing with both real-time and soft tasks. The experimental results validate predictable timing behaviour for hard real-time tasks 1 3 FPGA implementation of hardware accelerated RTOS based on… and provide a performance gain of up to 2.1 compared to traditional approaches. The paper [19] investigates scheduling methods for two-stage real-time systems executing aperiodic tasks with firm and soft deadlines. These are often used to capture task execution in reactive systems where the first stage is responsible for detecting and pre-processing external events that occur irregularly, and the second stage is for taking certain actions to react to identified situations. Practical results highlight that the algorithm [20] provides a worst-case usage similar to partitioned EDF for hard real-time tasks and an empirically related delay (such as global EDF) for soft real-time tasks. It can be verified that the proposed scheduler is effective for dealing with both real-time and soft tasks. The proposed design in [21] has been specifically tailored to meet the needs of real-time applications and exhibits predictable and repeatable timing behaviour. This allows for efficient and accurate worst-case runtime analysis while preserving the performance and efficiency typically observed in other vector processors. The authors demonstrate the predictability, scalability, and performance of the proposed architecture by running a set of benchmark applications on multiple Vicuna configurations synthesized on a Xilinx 7 Series FPGA with a peak performance of over 10 billion 8-bit operations per second.
In [22] the authors propose a mixed hardware-software scheduler architecture in which the hardware scheduler was designed to manage timer-system events and to execute a given scheduling algorithm. The software is responsible for switching task contexts, so the time required to perform this operation is directly influenced by the number of registers that need to be saved and restored. The response time also depends on the number of errors in the cache, because the module responsible for this operation was a software application running on a commercial processor architecture. In the architecture described in [22], the hardware scheduler is an external device interconnected with the processor via a data and address bus and an interrupt line. Whenever the scheduler decides that a task is ready to run, it signals this to the processor with an interrupt. The processor reads from data bus the task ID sent by the external scheduler and switches the appropriate contexts. The authors state that hardware schedulers have a superior advantage over software schedulers in terms of CPU over control.
The uRV kernel presented in [23] is a minimal, robust and open source kernel that addresses embedded FPGA applications. The authors used RISC-V ISA, developed at the University of Berkeley, after which they also implemented a multiplication/division module, according to the RV32IM architecture. The main features of the uRV core are simplicity (27 integer core instructions with clearly defined extensions), Harvard architecture, code and data are accessed through a 32-bit shared memory space with separate memory interfaces, and instructions are executed through a four-stage, single-issue pipeline. The project successfully satisfies the official RV32IM test suite as well as Coremark 1.0 benchmark. Given the small FPGA footprint and GCC tools (version 5.2), the uRV core supports the development of distributed real-time control and data acquisition systems, reducing WCET risks and development lead times.
Efficient use of limited memory resources is very important in the design of heterogeneous multiprocessor systems on chip (HMPSoC) for memory intensive applications. State-of-the-art high-level synthesis tools (HLS) rely on system programmers to manually determine the placement of data within a complex memory hierarchy. In [24], an automatic data placement framework is proposed, a concept that can be perfectly integrated with Vivado HLS. Experimental results demonstrate that traditional frequency and locality data placement strategy designed for CPU architecture leads to low system performance in CPU-FPGA HMPSoC. Validation data using the Zedboard platform shows an average performance speedup of 1.39 times compared to greedy allocation strategies. Each FPGA core running on the Programmable Logic (PL) side is able to access data from on-chip FPGA BRAMs, off-chip DDR through the accelerator coherence port (ACP), and the CPU's Level 2 shared cache, or the DDR off-chip memory directly through the high-performance port (HP), which bypasses the CPU caches. Therefore, a decision to manually place the data can lead to poor system performance due to the complex design space. Regarding the CPU used in mobile applications, such as multimedia or medical systems, the increase of the working frequency is not an efficient solution, due to the increased energy consumption. Thus, by integrating on the same silicon chip a certain number of similar cores, or by designing new multi-threading and hyper-threading architectures with increasingly deeper pipelines, rigorous management of the CPU clock cycles was required. In this respect, to guarantee the robustness of the RTS, and to complete all system tasks before the deadlines, increased speed of task execution and also an enhanced method for real-time scheduling are needed.
In [25], a heterogeneous Multiprocessor On Chip (MPSoC) is controlled by a dynamic task scheduling unit called CoreManager (CM), applying both very long instruction word (VLIW) and single instruction multiple data (SIMD) architecture. The MPSoC consists of several blocks connected via a Network-on-Chip (NoC), targeted for embedded applications. In total, four digital signal processors (DSP), five general purpose cores (GP) and two processors with application-specific instruction sets (ASIP) are integrated, so the CM controls the MPSoC datapath. CM is responsible for dynamic data dependency verification, task scheduling, processing elements (PE) allocation, and data transfer management. In this context, the instruction set architecture is expanded to improve overall MPSoC performances. The results obtained show an improvement for the dynamic verification stage of data dependencies of up to 97%. The dynamic task scheduler can be implemented in hardware as an accelerator or in software, running on a general-purpose kernel. Implementation in hardware is characterized by a very short execution time of less than 100 cycles for a task. If we consider processors in the area of mobile applications, the i.MX 6SoloX core implementation proposed by Freescale is a reliable solution that increases the security of Internet of Things (IoT) applications. Thus, i.MX 6SoloX is the industry's first application processor that integrates an ARM Cortex-A9 core and an ARM Cortex-M4 core into a single chip. This processor has been designed to enable exceptional performance and energy efficiency to real-time devices. The processor offers the ability to run a user interface based on the Cortex-A9 core, while guaranteeing the real-time determinism characteristic of the Cortex-M4 core. These features are fundamental and mandatory for a wide range of industrial, automotive or medical applications, as they require a modern user interface. Above all, the cores 1 3 FPGA implementation of hardware accelerated RTOS based on… that control RTS must be reliable, secure and deterministic when communicating with other devices on the network.

Hardware accelerated RTOS based on real-time event handling module
The present paper gives an overview of the qualitative research in terms of task switching time based on HW_nMPRA_RTOS architectural model. Developing this concept in Verilog HDL improves performance related to multiple events handling, used intensively in real-time environments. A task running at a given time on the CPU is represented by a hardware instance of its associated thread (instPi for task i). The hardware resources for instPi, referred to as HW_thread_i with i = 0, …, n − 1, consist of Program Counter (PC) register, pipeline registers, register file (RF), and datapath control signals, represent the instructions execution context for the task i. Figure 1 shows the HW_nMPRA_RTOS (HW-RTOS) block diagram using a system-on-chip (SoC) design. During the validation process, it was really helpful to have a block scheme that illustrates all modules and labelled wires that interconnect CPU components. This way, it was handy to locate the signals source, the combinational elements and the CPU datapath registers. The real-time scheduler (nHSE) is the central element for minimizing the negative impact of the overhead of the OS over the RTS performances. By introducing a jitter of a maximum three clock cycles for the tasks context switching, the proposed CPU architecture proves a deterministic hardware implementation, due to the integrated hardware scheduler [26], shown in Fig. 1. Therefore, instPi scheduling, asynchronous handling of multiple events, and also context switching time can affect the WCET. These times are critical for systems with a high number of interrupts and a higher frequency of task switching if the CPU is loaded at its upper limit.
The real-time event handling module ( Fig. 2) includes the following functional blocks: 1. The n Events Block, with the role of arbitrating and signalling the instPi with the event attached, either directly, or through the hardware scheduler (static or dynamic). The events, representing the input signals for the n Events Block module, are the following: timer generated events, the event generated by deadline 1 limit (alarm), the event generated by deadline 2 limit (fault), the event generated by the watchdog timer, interrupts, mutex generated events and events generated by communication mechanism. This logic validates the command signals for each instPi. The scheduler registers validate, store, and prioritize the events expected by each instPi. 2. The static scheduler performs real-time management of tasks (instPi) based on fixed priorities. Thus, instP0 running on HW_thread_0 has the highest priority, and instPn-1, running on instPn-1 has the lowest priority. In case of the static scheduler, assigned task priorities cannot be changed. The static scheduler is activated at system reset and can be deactivated only by instP0 CPU instance, executed by HW_thread_0. 3. The dynamic scheduler module represents the hardware block for dynamic scheduling enabling the instPi priority change. This scheduler is disabled at reset and can be activated or deactivated only by instP0. 4. The ID register block contains registers with identifiers (IDs) equivalent to each instPi, a register with the priority for the instPi (mrPRIinstPi) used only by the dynamic scheduler, and a global register containing the active instPi ID that can be inhibited only once during the execution of the atomic instructions. In the case of instP0 priority (instP0 has the highest priorities), it will always have the highest priority (0 × 0). 5. The decoder generates activation signals (oi) for all instPi, and can be inhibited only under certain conditions by the logic of the n events [27].
Whenever an event is scheduled and its source is deleted, the current instPi may lose control of the CPU. The events listed in Fig. 2 can be validated using the following signals: lr_enInti, lr_enTi, lr_enWDi, lr_enD1i, lr_enD2i, lr_enMutexi and lr_enSyni, which are grouped into a special register called control Task Register (crTRi). Also, the lr_run_instPi bit is added to crTRi, which is used to avoid instPi dispatching. The resulting signals lr_IntEvi, lr_TEvi, lr_WDEvi, lr_D1Evi, lr_D2Evi, lr_MutexEvi, lr_SynEvi and lr_run_instPi are grouped into a register called control Event Register (crEVi), which can be accessed to select what events expected by any instPi. The static scheduler is task-oriented and the priority of each instPi is i, as is the instPi ID. This means that the priorities are constant during task execution and the static scheduler is disabled when the processor is connected to a power supply. Using these registers, each instPi can have a priority between 1 and n-1, where 1 is the highest and n-1 is the lowest priority. On the other hand the dynamic scheduler is provided with a priority register for each instPi, i = 1, …, n-1. instP0 is always assigned with the highest priority, i.e. the value 0, and this cannot be changed in any way. The priority of an instPi can be changed by a dynamic programming algorithm implemented either in software, at instP0 level, or in hardware.
The system becomes active when an event occurs, but if the attached instPi clears the event, then the task will auto-suspend. When the processor is connected to a power supply, the dynamic scheduler is disabled. If execution is to continue, the instPi must activate the self-suspend event (lr_run_instPi) before it clears the occurring event, as shown in Fig. 3. Thus, only instP0 remains active and the event logic attached to instPi is divided among n instances of the processor according to the assignment. So, if instPi has the highest priority, the nHSE scheduler will validate its execution through o i signals, based on the scheme in Fig. 4a and Fig. 4b. The validation and occurrence of an event is signalled at the level of an instPi. In this case, either the system will be taken out of the idle state or another instPi with lower priority will be stopped. Figure 4b shows the general design of the static scheduler for activating an instPi based on validated and active events. In the following we present details about the event generation and the static and dynamic nHSE design. The scheduler schematic shown in Fig. 4b contains the instPi_ready functional blocks, the register that stores the instPi ID with the highest priority, and a decoder that activates the instPi with the highest priority. The en_CPU signal can mainly be used for low power mode. The AND logic gate and flip-flop D in the scheme are activated when no other instPi is active. Enabling or disabling any instPi specific resources can be done with the en_pipe_instP0 ÷ en_pipe_instPn-1 signals. In this case, static priorities are identified by task IDs. Thus, the proposed scheme can be used for static scheduling if each task runs on instPi.
The dynamic scheduler shown in Fig. 4a and Fig. 4c provides the possibility to set the priority for instPi scheduling units, but does not implement any specific scheduling algorithm. Under certain conditions, such as when using a dynamic scheduling policy, some scheduling algorithms may bring performance improvements. We should specify that instP0 always has zero priority, making it the highest FPGA implementation of hardware accelerated RTOS based on… priority processor instance in the system. It stores the corresponding instPi priority for i = 1 through n-1. Consistent with the logic of the associated instPi priorities, we use a special register called PRIinstPi register. Thus, the priority is decoded, as can be seen in Fig. 4c. As mentioned, priority zero is reserved for instP0. The logic implemented in the FPGA generates one of the signals en_pri_instPi_1, …, en_pri_instPi_n-1. The output of the same register is used for selecting the output of the MUX multiplexer in Fig. 4c. This signal collects the result of the prioritization scheme from the inputs on the right side of the figure. The AND logic gate validates a particular priority, and flip-flop D is used for synchronization with the HW_nMPRA_RTOS clock. The logic gates AND in Fig. 4a allow priority selection for each instPi, i = 1, …, n. The multiplexer output validates the instPi_ID_TS value which is the instPi ID at the ID register input. The priority validation is activated by the instPi_Evi signal only if instPi is waiting for the event. The same ID register was described previously in Fig. 4b, the hardware structure in Fig. 4a is used in the same configuration for the dynamic scheduler. The en_CPU signal can be used as a global signal, part of a monitoring register that disables all instPi, except instP0.
The signals pri_1, …, pri_n-1 represent all n-1 possible priorities of instPi. Access can be direct, via instructions, or, in supervisor mode, instP0 can read/write registers. PRIinstPi register storing priority (Fig. 4c) can be accessed as local registers for any of the instPi units, i ≠ 0.
The PRIinstPi registers, abbreviated Priority Registers (PR), are shown in Fig. 4c. To use these registers, we propose the control instruction "wait Rj" which waits for the occurrence of any event marked in the Rj register, with bits set to 1, Rj is automatically transferred to the Task Register (TR) register. The events activated by the "wait Rj" instruction are loaded into the TR register. When the task is resumed these events are loaded into Rj register. Whenever an instPi resumes execution after a "wait" instruction, Rj will store the occurring events associated to instPi. These registers can be found in any instPi except instP0. A more efficient and faster method involves using a dedicated mnemonic that stores events as immediate value, such as "wait Rj, events".
The scheduler constantly monitors events that are associated with instPi. The hardware structure of the scheduler belonging to each instPi, embedded in the logic block of nHSE, is shown in detail in Fig. 3. The possible instPi events are: timer interrupts (TEvi), two interrupts used for preemptive deadline signaling (D1Evi and D2Evi), watchdog timer (WDEvi), attached interrupts (IntEvi), inter-task synchronization (SynEvi), mutexes (MutexEvi), and self-supported execution for the current instPi (lr_run_instPi). Whenever a source that generates an event/interrupt is deleted, the current instPi may lose CPU control. These signals must be stored in the special TR register. The above events can be validated with the lr_enTi, lr_enWDi, lr_enD1i, lr_enD2i, lr_enInti, lr_enMutexi and lr_enSyni signals. The only exception is lr_run_instPi, as can be seen in Fig. 3. The instP_evi signal, which is used to signal the occurrence of an expected event, is activated by the mr_stopinstPi signal. This is part of a monitoring register that is accessible only for instP0.
For synchronization, we use a flip-flop D that stores information about a pending event on the processor's rising clock. instP0 is the only execution unit capable of stopping the other instPi, i ≠ 0. The pending instPi signals involve the handler over the current instPi identifier (instPi_ID). The action is marked by the signals / instP_Ev0, …, /instP_Evi-1. A simplified block representation of the local scheduler, described above, is shown in Fig. 3. This action is performed by writing the value to the scheduler's arbitration bus, if there is no task running, having a higher priority than instPi.
The model proposed in this paper and described in Fig. 5 is similar to the "interrupts as threads" approach. The system has p interrupts, and for each of them there is a global register, called INT_IDi register, with n useful bits storing the task ID to which the interrupt is associated. In this new design, interrupts are treated as events that are attached to tasks (instPi) and therefore inherit their priority. Activating the INTi interrupt validates the decoder which activates one of the INT_i0, …, INT_in-1 signals, as can be seen in Fig. 5. The OR logical gate can collect all interrupts in the system. They can be attached to instPi if all p INT_IDi registers, i = 0, …, p-1, are written with the value i. The role of flip-flop D is to synchronize the random aspect of the INTi interrupt event that produces IntEvi. Correspondingly, no interrupt can be attached if none of the p INT_Idi registers, i = 0, …, p-1, is written with a value i. This is considered on the falling edge of the system clock.
The powerful and interesting features of the proposed design are as follows: a task can attach one, several, or even all p interrupts in the system; HW_nMPRA_ RTOS does not contain a specialized interrupt controller based on which interrupts inherit the priority of tasks (instPi); the scheduler is able to set the priority of interrupts attached to the same task (instPi); the interrupt can be a task or can be attached to a single task; an interrupt attached to a task may suspend a lower priority task; an interrupt may not suspend execution of the task to which it is attached, or of a higher priority task; all interrupts may be attached to a single task; interrupts do not affect the execution of other instPi units; interrupts may be nested, and interrupt priorities may be dynamic; the architecture does not require saving or restoring any context based on HW_thread_i multiplied CPU datapath resources. This is possible by reattaching another task or by changing the priority of the tasks to which they are attached.
The proposed solution also has some disadvantages, such as: If multiple interrupts are attached to an instPi, the handling order is assigned by the software, and this may lead to additional delays; Limited number of possible nested levels, limited to the number of instPi, or lack of interrupt handling vectors.
In the nHSE scheduler architecture there are four types of registers: • Control (cr) registers, specific to each instPi; • Local (lr) registers, that are part of the private space of each instPi; • Global (gr) registers, that can be accessed by all instPi; • Monitoring (mr) registers, that can only be accessed by instP0 and possibly by the monitored instPi.
Regarding the use of specific HW_thread_i private resources, if every instPi runs i task, the context switching from one task to another is also accomplished very rapidly, therefore minimizing the jitter effect produced by interrupts and asynchronous events handling [28]. The real-time event handling module constantly monitors the events associated with instPi CPU instances. The interrupts are handled individually and they may have the highest priorities among the events of an instPi, the surplus speed due to the context switching operation being successfully used to satisfy performant RTS constraints. Thus, interrupts are converted into threads, generating a limited kernel overhead. An instPi can have attached one, several, none, or all p interrupts in the system. In this manner, interrupts are handled separately, eliminating the times required for filtering and identifying the interrupt source, the handler code associated with these critical events being executed directly. When using low priority interrupts, the application must first perform a software filtering to detect the source or sources that generated the events. This technique does not affect the system predictability because the asynchronous events treatment in hardware requires very low response times, and their treatment is done based on real-time event handling module and prioritized instPi.
Algorithm 1 shows nHSE operation with details on the generation of time-related events. So the crTRi control register of the preemptive dynamic scheduler is used. The generated signals are used for the validation logic of various types of events at the level of each HW_thread_i. Thus, crEVi register is tested in the algorithm for signalling the occurrence of an event that is activated in crTRi, considering the mrWDEVi, mrTEVi, mrD1EVi and mrD2EVi registers as predefined limits.
There are three types of time-related events: periodic time events (TEvi), Watch Dog Timer events (WDEvi) and deadline events. For implementation, each instPi has two dedicated timers. Thus, D1Evi is equivalent to an alarm and D2Evi is equivalent to a fault. One of them has three comparators for TEvi, D1Evi and D2Evi, while the other has a single comparator used for WDEvi. For each of the two timers, the architecture has local registers. If the watchdog is not periodically refreshed, the WDEvi event can reset, if enabled, instPi (Fig. 3). These registers are implemented in the local memory of each instPi and accessed with normal memory access instructions (wait Rj). The deadline values can be calculated either with a local algorithm executed on instPi, with a global one that is executed on instP0, or even with a combination of the two. Furthermore, these registers can be seen as monitoring registers that can be accessed by instP0 with normal memory access instructions. The architecture includes two timers for counting CPU cycles when a task is executed or suspended. Access to these counters can be done in the same way as for the timers that were shown above. Therefore, a software function can closely monitor the execution of a task on instPi.

FPGA implementation of hardware accelerated RTOS based on…
A simultaneous multithreading (SMT) processor will perform with superior performance compared to a scalar pipeline, unless the program exposes ILP (instruction level parallelism) parallelism at any point during the execution of the application [29,30]. Otherwise, the architecture will generate additional over control and will introduce additional source of indeterminism. Unlike SMT architecture, HW_ nMPRA_RTOS propose a multiplexing of all pipeline registers to achieve dedicated instPi contexts. In the proposed CPU implementation, the "hardware contexts" that contain the internal signals of the datapath are also separated, according to the instructions and the executed task.
Using Verilog HDL for hardware modelling is particularly productive because it provides a formal system description and allows the use of specific description styles to cover different levels of abstraction (architectural, register transfer, and logic level). In embedded systems with software schedulers, total elimination of jitter is not possible, but there are a number of mechanisms by which it can be reduced. Techniques have been proposed that can be used to improve the determinism of code execution. Namely, a proprietary prioritization algorithm that can be easily implemented in commercial schedulers and the cache locking mechanism that has given excellent determinism in the critical regions execution.
Algorithm 2 presents the CPU cycle management logic for active/inactive instPi, with immediate effect on the monitoring registers mrCntRuni, mrCntSleepi and mr0CntSleep. In case of HW_nMPRA_RTOS, the hardware scheduler is controlled directly with COP2 instructions transmitted via 5-stage pipeline. Task context switching sequence is achieved very fast based on datapath remapping technique. The architecture provides basic mechanisms for inter-task communication and implements a wait Rj instruction that allows simultaneous multiple events pending. Usually, this is not possible in basic RTOSs, where there are individual functions for each event type. Implementing scheduling algorithms in hardware eliminates overhead due to the operating system, thereby improving the task set scheduling limit, WCET, and overall system performance.
Since the HW_nMPRA_RTOS processor architecture with integrated hardware scheduler is based on multiplexing of datapath resources, the memory consumption for the FPGA implementation varies almost proportionally with the instPi. It is important to specify that this architecture is intended for embedded industrial and automotive applications, where the number of tasks is included in range of 8, 16, and 32. Generally, in RTS the number of tasks varies around 16, sufficient for most applications of this type.

Timing performance
RTOS is a primordial section of software executed on the CPU, providing common services for RTS. In other words, RTOS and application software share the CPU in order to manage the hardware and software resources. Therefore, experimental tests show that if the RTOS requires more CPU execution time, the application performance is lower. The negative jitter that the RTOS has over the controlled system depends on the application type. For example, network protocol control requires frequent use of RTOS functions. This is because the implementation of multitasking network protocol control requires frequent use of RTOS's inter-task communication and synchronization functions. As a result, the over control rate of the processor due to RTOS during network protocol execution is high. The designers felt that network traffic seemed to remain low when using a low-end processor with low performance, this was mainly due to excessive RTOS functions over control.
The scheduler is responsible for designating the task for execution, taking into account the priorities and status of all tasks in the READY state. Table 1 describes the main states in which a task can exist, to which other states such as STOPPED or IDLE can be added depending on the scheduler version. The HW_nMPRA_RTOS real-time event handling module realizes in hardware several advanced scheduling schemes suitable for real-time applications. In the proposed hardware accelerated RTOS, there may be p interrupt type events, and for each interrupt, the grINT_IDi register allows its attachment to any of the n instPi hardware instances (ExtIntEv[0] is the interrupt with the highest priority, and ExtIntEv[p-1] is the interrupt with the lowest priority). The hardware scheduler treats interrupts as threads and uses a preemptive scheduling algorithm whereby a high-priority task cannot be interrupted by interrupts assigned to low-priority tasks. This algorithm guarantees deadlines for tasks, which must provide a WCET and a real-time response to external stimuli. Figure 6 illustrates the MIPS32 [31] instructions executed by instP3 and instP0, as captured by the Vivado Design Suite simulator. The multiplication of each ID_Instruc-tion_reg[0:3][31:0] pipeline register, one for each HW_thread_i, can be seen as the basic idea of the nMPRA architecture, patented in Germany, Munich [32].
To avoid unpredictable situations given by the task priority inversion, an optimal and robust scheduling scheme was realized by assigning appropriate priorities to the tasks through a correct system evaluation. The hardware implementation of this hardware-accelerated RTOS concept includes static and dynamic hardware scheduler based on real-time event handling. The hardware scheduler is based on the finite state machine (FSM) that reacts to external events and schedule the instPi on HW_thread_i CPU hardware resources. The contributions consist of validating the real-time hardware-implemented methods for enhancing the performance of  the hardware RTOS concept, and minimizing the jitter effect because the proposed hardware scheduler implements a unified space for tasks and events. At the time moment T1 (Fig. 6), we can see the occurrence of a time event attached to the hardware instance instP0. This event will determine the context switching between instP3 and instP0 at time moment T2, because the instP0 has a higher priority. The time required to change the task context is only one clock cycle, i.e. 30.303 ns (1 machine cycle), in the context where the processor runs at a frequency of 33 MHz. It can be noticed that the tasks context switching time is minimum because the proposed architecture is based on the multiplication of CPU resources (HW_thread_i) for each instPi. Choosing the appropriate processor frequency is necessary to ensure correspondence with the FPGA signals propagation time through the processor logic. Figure 7 shows a software section written in Verilog HDL for the scheduler FSM implementation. Depending on the current state (nHSE_FSM_state) of the scheduler, for example FSM_sCPU0, it is checked whether instP0 is enabled for execution (cr0MSTOP & Mask1_bit0). Then, the logic of the algorithm determines event execution with the highest priority (crEPRi) attached to instPi. In addition, the scheduled event must be validated by crTRi and active (crEVi). The crEPRi register contains on 3 bits the priority of each individual event for each instPi.
The command, control and status registers of HW_nMPRA_RTOS with direct or indirect effect on nHSE are presented and described in the nHSE real-time scheduler specifications and the patent proposal [32]. These registers are also defined in the nHSE scheduler implementation, using Verilog HDL and Vivado 2016.2 design environment. HW_nMPRA_RTOS scheduler constantly monitors all events validated and attached to any instPi. If several events are associated with an active task running on instPi, a CPU hardware instance must be scheduled. Consequently, there must be an algorithm to select the order in which these events are handled. To do this, each HW_thread_i has attached a control register called Event Priority Register (crEPRi) that contains the priority level for each type of event. These priorities are different, ranging from 0 to 6 because there are 7 types of events. Figure 8a shows the treatment of time events by the preemptive scheduler based on priorities, InstP0 being the most priority task. In case of external interrupts, they must be attached to different tasks, thus inheriting the priority of the instPi. In case of simultaneous occurrence of several interrupts attached to the same task, the highest priority for interrupts can be set through the crEPRi register. Figure 8b indicates the execution time for instP1, and also the preemption moments of instP0, instP1 and instP3. Due to the existence of private resources of threads, referred to as HW_thread_i, the time required to change the task contexts is from 1 to 3 machine cycles, to which 1 cycle is added for the FSM jitter. Thus, nHSE implements a strict rule of prioritization, allowing all events to be captured and handled according to the instPi priority. The advantage of interrupts that are not attached to tasks is that they are executed in their own ISR, without the need to switch contexts and test additional registers. Because the task code is executed in the shortest possible time, this method lends itself to high priority interrupts that require a minimum response time, while being very resource-intensive. Figure 9 illustrates the jitter for verifying hardware scheduler performance implemented at COP2 MIPS32 level. In this test, a PicoScope 2205MSO oscilloscope has been used for measuring the jitter of the preemptive scheduler in case of treating external asynchronous interrupts generated from Virtex-7 development kit. As can be seen in Fig. 9a, following the performed practical measurements, a response time of only 280.9 ns was obtained (cursor 1 is placed at time moment −9.541 ns and cursor 2 at 271.3 ns). The first signal from the oscilloscope represents the triggered interrupt assigned to the instP0, whereas the second one represents the processor response by switching LED [0] signal. Thus, the practical measurements validate both the waveforms obtained through the Vivado simulator and the performances of the innovative hardware scheduler. Considering software RTOSs (Real-Time Kernel) and proposed hardware accelerated RTOS implementation, Fig. 9b validate the efficiency and performance of the message synchronization mechanism (520.1 ns). The implementation of these mechanisms in hardware guarantees minimum jitter for real-time tasks execution, thus satisfying RTS imposed deadlines. Fig. 8 a Event handling in Verilog HDL based on individual priorities related to instP0 (cursor 1), instP1 (cursor 2) and instP2 (cursor 3) thread hardware instances (after hardware implementation in the FPGA circuit), b instP1 (cursor 2) time event execution (1,848us) Connections to peripheral components such as GPIO, UART, I2C and SPI are an integral part of the Top.v Verilog HDL module. A read/write register may also contain some read-only bits, in which case the operation of writing to read-only bits is ignored. Due to the reliability and flexibility it has, the HW_nMPRA_RTOS architecture is able to ensure a predictable execution of tasks while implementing the above-mentioned mechanisms in the hardware, thus increasing the efficiency of the implementation of the static or dynamic nHSE scheduler. From an architectural point of view, the implementation of synchronization and communication mechanisms is based on atomic instructions, guaranteeing outstanding performance.

Synthesis and implementation results
When embedded real-time systems include an RTOS, the influence of the overhead introduced by RTOS become a negligible jitter. Usually, this aspect does not affect WCET in systems tolerant of general delays of over a millisecond. When the delay is a few microseconds, jitter is no longer acceptable, and designers choose to modify the system or even replace the RTOS. Following the implementation in the FPGA, some studies could be performed on the architecture complexity, resource requirements, the impact on the working frequency and power consumption, as well as the execution of some test programs meant to highlight the performances of this new technology (the degree of technological readiness levels maturity). The data presented in Fig. 10 corresponds to an implementation of the SoC project containing the FPGA synthesized and implemented processor with 4, 8 and 16 HW_thread_i, mutexes, communication message registers and external interrupts. The architecture described in this paper does not use the stack concept as in existing processors; however, it uses the functionality of nested function calls and their order, based on XUM project described in [33]. This study argues that the hardware accelerated RTOS architecture is scalable and flexible and can be successfully used in small-scale RTS with 4, 8 or 16 HW_thread_i implementation. In the HW_nMPRA_RTOS architecture, the nHSE hardware scheduler is included in the processor and therefore does not require additional time for arbitrating the CPU buses, nor delays in the results due to data transfer between the scheduler and the processor, and can be directly controlled by the instructions transmitted to the pipeline. Since we have several pipeline registers, it can guarantee an isolation of contexts in hardware. The architecture has been designed taking into account the possibilities of practical applicability, so that it can be easily integrated into microcontrollers. Interrupts are considered events that can be attached to tasks and are treated as threads, not as interrupts as in classic mode. The development of a new application was necessary because the proposed architecture improves the context switch time without extending the instruction set of the MIPS processor [34]. The implementation uses the COP2 support, in accordance with MIPS CorExtend User Defined Instructions (UDI). This option allowed us to extend MIPS instruction set with user defined extensions (pre-emptive scheduler) which execute in parallel with the MIPS integer pipeline, instructions being executed sequentially.
After testing the functionalities of this processor, traditional MIPS compiler tools can be used to develop real-time applications. To achieve maximum performance, i.e. instructions per cycle (IPC) close to 1.0, it is necessary to change the instruction and data memory handshake. The frequency chosen for the implementation and validation of the nMPRA project is 33 MHz and the maximum working frequency is 100 MHz. It should be noted that the 33 MHz frequency was chosen because in the design and debugging phase of the processor we used the Integrated Logic Analyzer (ILA), this module requiring a clock with a frequency three times higher to be able to capture the monitored signals. In the proposed concept architecture, the multiplication of processor units per processor instance (instPi) has a direct effect on the critical path. Thus, switching from one instPi to another is done in one clock cycle due to the multiplication of flip-flops in the CPU datapath, the Vivado synthesizer inserting the corresponding sequential elements for the selection at a given time of only one instPi by the scheduler state machine.
In the proposed HW_nMPRA_RTOS processor, the jump and branch instructions have a so-called Delay Slot. This means that the instruction following a jump or branch is executed before the jump or branching occurs. Besides, there is a set of conditional jump instructions called Branch Likely for which the following instruction, which is in the Delay Slot, is executed only if the branching occurs. MIPS processors execute the jump or branch instruction and the one in the Delay Slot as an indivisible unit, adding an additional time for nHSE context switching. If there is an exception as a result of the Delay Slot instruction, the jump or branch statement is not executed, and the exception appears to be caused by the jump or branch statement. Table 2 illustrates a report on the power consumed by the HW_nMPRA_RTOS SoC project that includes the nHSE hardware scheduler with 4 HW_thread_i. With HW-RTOS, software application designers can define real-time performance at the design stage. Dynamic power includes the following elements: Clocks, Signals, Logic, DSP, MMCM, and I/O. Design Dynamic power is constant and does not change with changes in device temperature. Important factors in dynamic power calculation are the activity and the load capacitance that needs to be switched by each physical driver in the design. Some of the factors in determining the loading capacitance are fanout and interconnect distance. Design static represents additional power consumption for power-gated blocks when the device is configured, to function as a dynamic scheduler, but there is no switching activity. Total on-chip power represent the sum of the device's dynamic and device static power, also referred to as leakage. Design dynamic power represents additional power consumption from the designer logic resources use and clocking, routing, switching activity, nHSE logic, and instPi load. Using HW-RTOS can greatly reduce the software development stage and enable easy installation of highly reliable real-time systems. The over control from RTOS corresponds to the execution time of its functions. RTOS execution time refers to the time the RTOS runs between a system call and an interrupt occurring and interrupt handling starting, all of which lead to general RTOS overhead. At the same time, most periods when the RTOS is running result in RTOS periods being interrupted. However, because interrupts are allowed relatively infrequently, many RTOSs incorporate new concepts for reducing downtime. Table 3 shows the resources used by multiple scheduler implementations with various ISAs and softcore CPU pipeline stages. Thus, as can be seen from the analysis, the advantage of the HW_nMPRA_RTOS implementation consists of its own hardware context (HW_thread_i) for each thread which ensures a change of contexts in only one clock cycle. The disadvantage of this implementation compared to [5,9,23,33] and [36] is that more hardware LUTs and FFs are used for the implementation of the nHSE scheduler.  At HW_nMPRA_RTOS design stage, each task is assigned a priority stored by the mrPRIinstPi register which may be changed during execution. For the situation where the pipeline does not contain store word (sw) or load word (lw) MIPS32 atomic operation instructions, context switching can take place right from the next clock cycle. When it is desired to ensure the consistency of the modified data by means of transfer mechanisms between tasks (sw and lw instructions), the scheduler must allow the execution of the instruction to be completed with the memory, the context switch taking place after 2 clock cycles. CPU overhead is taken into account by examining the times for booting and configuration, read/write task parameter, task create/destroy, periodic task activation and prioritization, tasks context switch, task dispatch, and preemption, asynchronous event handling, mutex/semaphore enable/disable and lock/unlock.
In the implementation stage of the HW_nMPRA_RTOS concept, we multiplied the most important signals contained in the ID (Instruction Decode)/EX (Execute), EX/MEM (Memory) and MEM/WB (Write Back) pipeline registers, ID/EX consuming the most many resources. Thus, the following items are involved: Operand_A and Operand_B, provided to the logical and arithmetic unit, the Operation register, and the control signals needed in the execution stage. These signals are transmitted through datapath simultaneously with the data necessary for the operation execution dictated by instruction's opcode, thus ensuring the contexts consistency for a possible switch of selected instPi. In case of ID_ Instruction[31:0] instruction execution, the result provided by the first instruction is redirected to the EX stage, at the inputs of the arithmetic and logic unit, because the second instruction uses the r2 register from the RF, corresponding to HW_thread_0. In case of normal execution, the first instruction is considered completed when the content of the RF is modified, after the result of the operation has been previously memorized by the pipeline registers from the datapath and selected through the multiplexer from the WB stage. Without redirecting the result of the first instruction, the program execution was delayed with two clock cycles; this implies a decrease in the performances of the proposed processor. To test the data hazard, in the case of instructions dedicated to the hardware scheduler, a program written in machine code was used; therefore, the datapath, the hazard detection unit and the data redirection unit were also tested. Data hazard occurs when there is a conflict between the instructions using as source operands values that are not yet provided by the preceding instructions. Particular attention has been paid to this development stage, because the integrity of these signals must be guaranteed even when, in the presence of a hazard, a context switching occurs, when an instruction dedicated to the pre-emptive scheduler is executed, or an interrupt is handled. Thus, when returning to the execution of the interrupted task, the redirection of data must be resumed from the point where it remained, guaranteeing the integrity of the data and control signals. The purpose of this project is not to describe a complete solution for CPU datapath, but to validate the practical implementation based on HW_nMPRA_RTOS architecture and the integrated hardware scheduler using a FPGA circuit. The tests performed and presented in these chapters justify the use of the processor in embedded systems where a higher computing power is required to run a real-time applications. At the same time, very short response times must be ensured, so as to guarantee the real-time characteristic of the execution of tasks and the calculation of the WCET coefficient.
The graph in Figs. 11 and 12a shows the resources used for the implementation of different processors or their versions. The number of slice registers (FF) and Look-Up Tables (LUTs) for each processor is provided by authors; this number may vary depending on the FPGA implementation, hardware RTOS scheduler, the implemented functions at the level of the processor and the task context multiplication. Figure 12b shows the area results of implementing the proposed HW_nMPRA_RTOS system with both the nHSE and 4 HW_thread_i enabled for Verilog HDL instances corresponding to modules. Once all the modules have been made and tested, the Top module has been created where all the blocks in the project with the corresponding logic are connected. Once all modules were connected and tested together, in order to verify the functionality of the HW_nMPRA_RTOS processor it was necessary to run a testbench program. This testbench program uses instructions from all categories, so it was necessary to implement all MIPS instructions, including the dedicated instructions for the nHSE scheduler. The branch, not equal instruction was added later because initially, it was simple to test only the branch on equal instruction. Once all nHSE scheduler instructions were introduced, their functionality was tested using testbench programs written directly in machine code to ensure that adding support for one instruction did not cause any problems for the other instructions already implemented. Fig. 11 The FPGA FFs resource requirements for implementing different CPU cores (ARPA-MT [35], ARM Cortex-M3 microcontroller [36], Amber 23 [37], MIPS32 core with five pipeline stages [33], Flex-Pret [38], MicroBlaze [39], uRV based on RISC-V core with four-stage pipeline [23])

Discussion
Considering these practical data, we can say that even if HW_nMPRA_RTOS uses pipeline register multiplication and implements nHSE register in COP2, the implementation is an advantageous one because the tests performed validate minimum response times to events such as lr_enTi, lr_enWDi, lr_enD1i, lr_enD2i, lr_enInti, lr_enMutexi and lr_enSyni.
In [41] the causes of real-time performance degradation due to conventional RTOS are identified. Test results show that the impact on real-time performance is much greater than most software engineers indicate, finding it incredibly difficult to guarantee the most unfavourable execution metrics, namely WCET. On the other hand, performing the same measurements on RTOS implemented in the hardware (HW-RTOS) shows that the impact on real-time performance is minimal. Since HW-RTOS allows to define the most unfavourable execution values at the design stage, it reduces some design-related tasks for software developers, thus facilitating the real-time development of the embedded system and at the same time ensuring guaranteed performance.
The major advantage of HW_nMPRA_RTOS is that, although it was designed for the single thread mode, the tasks context are not affected by the remapping operation and implicitly the CPU performance is not degraded either. Unlike other implementations that use an external coprocessor to run the scheduling algorithm, HW_ nMPRA_RTOS uses an internal scheduler that does not induce over control due to interprocessor communication mechanism, nor does it require additional times for arbitrating interconnection buses. The most modern RTOSs have implemented separate software blocks of resource sharing, synchronization and communication between tasks, which can only be evaluated sequentially, but not simultaneously. The solution proposed in this study allows this mode of operation through implementation in hardware, which allows a parallel evaluation of events in the system.

Application fields
This chapter presents some examples of applications using the new processor architecture implemented for 4, 8 or 16 tasks, with a pre-emptive embedded scheduler based on priorities. Because of its real-time nature, the proposed concept is easy to use in the automotive industry for managing the steering management system for four-wheel drive cars, ABS (Antilock Braking System) or ESC (Electronic Stability Control).
To successfully use the HW-RTOS in the implementation of a SAFE & CON-TROL module in the industrial field, it is necessary to ensure the determinism of the Fig. 12 a FPGA utilization statistics for implementing different CPU architectures (NIOS core [40]), b Slice percentage occupation of the proposed HW_nMPRA_RTOS systems measured on the Xilinx Virtex-7 based on MIPS32 datapath [33] control system by organizing tasks and interrupts in a unified priority space. Thus, the implemented processor can successfully use a static scheduling scheme for managing linear and rotational measurement systems. Two tasks perform the reading and processing of signals that correspond to a single axis, and two are used to provide the system's safety function. HW_nMPRA_RTOS can even be integrated into a set of Building Internet of Things (BIoT)-based smart switches.
Another sectors where the proposed processor can be used in a 16 HW_thread_i configuration is the monitoring and control of industrial processes. As an example of a general-purpose application that can be implemented in practice, task grouping and interrupts make it convenient to organize the 16 instPi to statically or dynamically schedule a reasonable number of jobs. The interrupts are handled individually because they have the highest priority, the surplus speed due to the context switching operation being used to the maximum.

Conclusions and future work
Although it is a resource multiplexing architecture with minor changes to the SoC project, the HW_nMPRA_RTOS processor can be ported to various hardware platforms. To do this, the clock module, BRAM, UART, I/O, and the.xdc constraint file must be changed. The proposed hardware RTOS implementation can be improved by optimizing the real-time event handler module and the CPU datapath.
These being said, we need to mention the fact that the standards used in the automotive industry, such as ISO26262, require important aspects regarding the safety of control embedded systems. Replicated resource architecture for reconfigurable systems can be improved by designing the local layer of the scheduler as a coprocessor to take advantage of the professional compiler facilities. With respect to the novelty of the paper, we believe that the authors makes the following contributions: 1. A pre-emptive scheduler based interrupts system, mutexes, message events and deadlines has been implemented at the level of coprocessor 2. 2. The authors have taken into account that any pipeline storage element had to be multiplied as the other multiplied resources. 3. Model and test the SoC project and the hardware accelerated RTOS using the Virtex-7 development kit that consists of individual validation real-time event handling modules, including the multiplexing resources on which the processor is based.
In the case of safety-critical applications, the implementation of a Memory Protection Unit (MPU) could be a necessary extension for the proposed hardware RTOS support. This module should protect memory, stability, safety, and should guarantee the CPU performances in real-time applications. This ideal component for embedded applications must meet rigorous safety-critical standards and requires certain certifications necessary in industries such as automotive, medical electronics, or industry. In these fields, the cost-effectiveness performance is a substantial 1 3 FPGA implementation of hardware accelerated RTOS based on… advantage for saving resources used in the design process. The project allows future researchers to improve the datapath or the implementation of a quad-core version of the proposed processor.
Author contributions ZI contributed to software, data curation, writing original draft preparation, writing review and editing. GVG contributed to conceptualization, software, data curation, writing original draft preparation, writing review and editing.
Funding This research was funded by the project "119722/Centru pentru transferul de cunoștințe către întreprinderi din domeniul ICT-CENTRIC-Contract subsidiar 21773/04.10.2022/DIGI-TOUCH/ Fragar Trading", contract no. 5/AXA 1/1.2.3/G/13.06.2018, cod SMIS 2014+ 119722 (ID P_40_305), using the infrastructure from the project "Integrated Center for research, development, and innovation in Advanced Materials, Nanotechnologies, and Distributed Systems for fabrication and control", contract no. 671/09.04.2015, Sectoral Operational Program for Increase of the Economic Competitiveness cofunded from the European Regional Development Fund.

Availability of data and materials
The data and material used to support the findings of this study are available from the corresponding author upon request.