LExCI: A Framework for Reinforcement Learning with Embedded Systems

Advances in artificial intelligence (AI) have led to its application in many areas of everyday life. In the context of control engineering, reinforcement learning (RL) represents a particularly promising approach as it is centred around the idea of allowing an agent to freely interact with its environment to find an optimal strategy. One of the challenges professionals face when training and deploying RL agents is that the latter often have to run on dedicated embedded devices. This could be to integrate them into an existing toolchain or to satisfy certain performance criteria like real-time constraints. Conventional RL libraries, however, cannot be easily utilised in conjunction with that kind of hardware. In this paper, we present a framework named LExCI, the Learning and Experiencing Cycle Interface, which bridges this gap and provides end-users with a free and open-source tool for training agents on embedded systems using the open-source library RLlib. Its operability is demonstrated with two state-of-the-art RL-algorithms and a rapid control prototyping system.


Introduction 1.RL, Control Tasks, and Embedded Systems
In recent years, artificial intelligence (AI) has evolved into a scientific discipline with tangible effects on the lives of ordinary people.Not only does it allow for convenience features such as speech recognition or auto-completion when writing [1], but it is also increasingly being utilised to control complex devices and even safety-critical systems [2,3].Modern advanced driver-assistance systems (ADAS), not to mention autonomous driving, would be unimaginable without it [4].
Reinforcement learning (RL) is an especially useful area of AI when it comes to control tasks.Since it is based on agents that learn through their own interactions with the environment (i.e. they generate their own training data), RL has the potential to find optimal solutions to non-trivial problems with minimal input from experts.One problem engineers have to address, though, is the integration of the RL agent into the system it shall control.Industrial applications often come with a long list of strict requirements regarding their information technology (IT) ecosystems: physical space, power, and cooling capacity are usually limited [5].At the same time, devices need to be rugged enough to withstand vibrations or extreme fluctuations in temperature.Beyond such hardware-related matters, a great number of use-cases necessitate a realtime operating system (OS) which guarantees that computations are performed within a fixed time window [5].Then, there is the cost factor.High-performance components needlessly drive up the prices of commercial products if their potential is not fully harnessed.A cheaper device is therefore more favourable so long as it is adequate for its task [6].
As a consequence of these boundary conditions, traditional personal computers (PCs) are not suitable for a wide range of applications.Professionals choose embedded systems instead: dedicated computers that are integrated into a larger system for the purpose of controlling the same [6].Embedded systems are designed from the ground up to meet the requirements outlined above.Nonetheless, they can be incapable of running programs intended for conventional computers due to their inherent limitations.Established RL libraries like Ray/RLlib1 [7,8] or Stable-Baselines32 [9] further rely on third-party software (e.g.Python) which might take up too much data storage space or simply not be available on the target platform/OS.Part of that list of dependencies are libraries for machine learning (ML) models, i.e. the mathematical structures (most notably neural networks (NNs)) which, among other things, represent the behaviour of the agent.Prominent exemplars -for instance TensorFlow (TF)3 [10] or PyTorch4 [11] -suffer from the same problems, meaning that merely executing a trained agent on an embedded system may not be a straightforward endeavour [12].

Model Execution
Even if a ML library cannot be installed on an embedded device, there are still ways to put its agents to use.The simplest is to run them on external machines that are then contacted by embedded devices in order to retrieve actions for their observations.Due to the latency associated with this option, it is likely to be sub-optimal.Another detracting factor is that the agents are not executed on the actual controllers.A more fitting solution is to convert the models to a format that is suitable for the target hardware, possibly by translating them into a generic representation like Open Neural Network Exchange (ONNX) 5 or some other intermediate format first.
TensorFlow Lite Micro (TFLM) 6 [13], for example, condenses TF to its core functionality, optimises its code for micro-controllers [12], and reduces the number of third-party dependencies.TFLM can be thought of as a subset of TensorFlow Lite (TF Lite) 7 , a lean version of the full library geared towards mobile and edge devices.It is hence capable of reading TF Lite models as long as they are comprised of common operations.Conveniently, TF can natively convert full models to TF Lite.
cONNXr [14], on the other hand, is agnostic to the model's original framework as it is written to work with models defined in the ONNX format.Other libraries take a more puristic approach and implement their own model formats in C or C++ using either nothing but the respective standard library or just a handful of headeronly libraries.Projects in that category are Genann [15], KANN [16], tiny-dnn [17], or MiniDNN [18].End-users have to manually re-write and configure their models with those solutions, though, because they typically lack converters.
Besides the above, there are solutions that transpile existing model formats to pure C/C++ code which is then compiled for the target hardware [5].frugally-deep [19], keras2cpp [20], or onnx2c [21] follow that philosophy.Likewise, MATLAB8 is capable of generating code from imported ONNX models when using its Reinforcement Learning Toolbox [22].

Training the Model
Training -that is the act of updating an agent's model -is performed using RL libraries like the aforementioned Ray/RLlib.Given the limitations of most embedded systems, this step is usually outsourced to a powerful workstation or a cluster so as to merely deploy the agent on the target hardware [5].If not automated, this TinyML [23] strategy becomes tedious when learning with on-policy RL algorithms (cf.Sec.2.1) due to the fact that the deployment process must be repeated after each and every modification of the agent.To make matters worse, the generated training data usually cannot be passed directly to the algorithm either and requires post-processing.

Proposed Solution
Motivated by the shortcomings of RL software in this area, we developed the Learning and Experiencing Cycle Interface or LExCI for short.This general-purpose framework allows experts to easily train RL agents with Ray/RLlib when model execution happens on an embedded system and training takes place on another, conventional machine.All models are implemented in TFLM/TF Lite and TF, respectively.LExCI is open-source and freely available to the public through its official GitHub repository (see Sec. 5 for the link).Our contributions are: Earlier versions of the software have already proven themselves in academic research.In [24] and [25], an agent was trained to control the high-pressure exhaust gas recirculation (EGR) valve of a Euro 6d Diesel engine on different X-in-the-loop (XiL) virtualisation levels, in part by utilising LExCI's transfer learning (TL) capabilities.The resulting strategy led to lower NO x and soot emissions while maintaining the same performance as a virtual and a real engine control unit (ECU).Similarly, [26] applied the framework to learn a control strategy for the variable-geometry turbocharger (VGT) in the same setup which likewise achieved reductions in emissions and better performance than the reference.In [27], LExCI was embedded into a cloudbased service in order to train an agent to control the longitudinal acceleration of an electric vehicle.
To the best of our knowledge, there are no comparable solutions for bringing RL and embedded devices together.The only close contribution is [28] where the authors present a conceptually similar toolchain that employs a modified version of keras-rl's [29] Deep Deterministic Policy Gradient (DDPG) implementation to train an agent that is executed on a rapid control prototyping (RCP) system.Their program is designed such that it could interface various algorithm implementations from different libraries and it requires the third-party tool ControlDesk9 to access the embedded system.The NNs on the embedded side were hand-coded by the authors in MATLAB/Simulink10 and are limited to fully-connected feed-forward networks.In comparison, LExCI offers more flexibility regarding the control software, design of NNs, and the choice of RL algorithms.
The remainder of this paper is structured as follows: First, the foundations of RL and two state-of-the-art RL algorithms are expounded in Sec. 2. After describing LExCI and its inner workings in Sec. 3, Sec. 4 summarises the experiments that were conducted to showcase the viability of the framework and discusses the results.Finally, Sec. 5 recapitulates LExCI's performance, its strengths, and how it can be extended in the future.

Theoretical Background
In order to understand the manner in which LExCI operates, it is crucial to cover the theory behind RL.Along with the general concepts, this section delineates two state-of-the-art algorithms and their distinct requirements regarding the framework.

Reinforcement Learning
RL is a ML paradigm based on the concept of training an agent by letting it freely interact with its environment.The experiences that are generated in the process are collected and utilised to update the agent's policy such that the cumulated reward it receives for its behaviour is maximised.[30] The mathematical foundation of the environment is a time-discrete Markov decision process (MDP) defined by the four-tuple (S, A, P, R), that is • the set of all possible states S, • the action space A, • the transition probability function P : S × A × S → [0, 1], and During an interaction, the agent observes the current state s t ∈ S and chooses an action A ∋ a t ∼ π θ (•|s t ).This causes the environment to transition into the next state s ′ t = s t+1 ∈ S with a probability of P (s ′ t |s t , a t ) and the reward r t = R(s t , a t , s ′ t ) is given.The flag d indicates whether s ′ is a terminal state (d = 1) or not (d = 0).The action distribution π θ : S × A → [0, 1] with configurable parameters θ is the agent's policy and determines its strategy.An episode or trajectory is a sequence The goal of RL is to tweak θ in order to maximise the discounted return with a discount factor γ (Eq. 1) or the expected return (Eq.2).
One prominent optimisation method is gradient ascent which performs iterative update steps with a learning rate η ∈ R. The policy is typically implemented as an NN in which case the parameter set θ consists of its weights and biases.[30] There are three key metrics to quantify how well an agent fares in a certain situation: The value function (VF) V π θ (Eq.4) estimates the return at a state s ∈ S when acting on-policy (i.e. when choosing actions according to the current policy) from there on.Similarly, the action-value function or Q-function Q π θ (Eq.5) estimates the return when taking an action a ∈ A at a state s ∈ S on the assumption that all following actions are on-policy.The advantage function A π θ (Eq.6) is the difference of the two and measures how much better it is to take an action compared to what the policy would do.[30] V Approximations of the above are denoted as Vπ θ , Qπ θ , and Âπ θ , respectively.An important property that distinguishes RL algorithms is whether they insist that the actions in their training data be sampled using the current policy.Those that do are called on-policy, the rest off-policy.Furthermore, if the algorithm has access to a model of the environment or learns one for the purpose of predicting the outcome of actions, it is called model-based, otherwise model-free.It has to be noted that this model is distinct from the agent's behaviour model or any of its value function approximators.[30,31] 2.2 Algorithms

Proximal Policy Optimization
Proximal Policy Optimization (PPO) is a state-of-the-art model-free, on-policy RL algorithm for discrete and continuous action spaces.It features a surrogate loss function whose scaled advantages are clipped to avoid excessively large update steps that could destabilise the training.To that end, PPO trains a VF approximator in addition to the policy.[32,33] The algorithm first garners a train batch, i.e. a defined number of experiences, using its current parameters θ.When updating the agent, subsets known as mini-batches are drawn therefrom to perform multiple steps of stochastic gradient descent (SGD) to minimise the following loss function: ξ denotes the policy's parameter set after a SGD step.The individual components of Eq. 7 are the clipped surrogate objective with an adaptive coefficient β ∈ R, the squared error of the VF approximator and its coefficient c VF ∈ R, and an optional entropy bonus S(χ, ξ) with its coefficient c S ∈ R to encourage exploration.[32,33]

Deep Deterministic Policy Gradient
The DDPG algorithm is a modern model-free, off-policy RL method which extends the idea of the Deep Q-Network (DQN) algorithm to continuous action spaces.Since its policy is deterministic, exploration is achieved by adding random noise, e.g. from a Gaussian distribution or an Ornstein-Uhlenbeck (OU) process, to its output.[34,35] DDPG trains a NN with parameters θ Q as an approximation Qθ Q of the Q-function when acting greedily and another NN with parameters θ µ for the deterministic policy µ θµ that seeks to maximise Qθ Q .To stabilise training, target networks Qθ ′ Q and µ θ ′ µ with parameters θ ′ Q and θ ′ µ are employed.The Q-network is trained by minimising and the policy is updated by performing gradient ascent using ∇ θµ Qθ Q (s, µ θµ (s)).[34,35] The target networks are updated via polyak averaging, i.e.
for ρ ≪ 1.Also, batches are sampled from a replay memory buffer which can be supplemented with off-policy experiences.[34,35] 3 Software This section describes LExCI's components, how it operates, and the steps one has to take in order to set it up for a new RL problem.Furthermore, the RL Block, a plugand-play Simulink model that encapsulates all necessary parts to execute an agent's policy model and to store experiences, is presented.

Architecture and General Workflow
LExCI is logically divided into two domains as illustrated in Fig. 1: The first is the learning side of the framework with the LExCI Master at its head.Its counterpart is the data generation side where the LExCI Minion is located.To understand their roles and how they work together, it is best to have a look at the framework's modus operandi.As an aid, Fig. 2 complements Fig. 1 with the chronological order of the steps.The sections highlighted there shall be used as a guide.Section I The LExCI Master makes use of a slightly modified version of Ray/RLlib 1.13.011via the library's Python application programming interface (API).At program startup, it loads a JSON-formatted configuration file containing the parameters of the training.These include the characteristics of the problem (the dimensions of the observation and action space, whether actions are continuous or discrete, etc.), general settings (networking details, where to store logs and results, and the like) as well as the algorithm's hyperparameters (the architecture of the agent's NN(s), the learning rate (LR)/schedule, or batch sizes to name a few).The Master initialises the agent based on the settings above before proceeding to its main loop for training.In addition to being the gateway to the RL library, the Master acts as a server and listens for incoming TCP/IP connections from LExCI Minions.Established connections are constantly monitored for their status and closed if the opposite side stops sending heartbeats (e.g. after a program crash) or takes too long to finish its task.Thus, the system is able to cope with unforeseen events.
Sections II & III Training is carried out by completing so called cycles.At the beginning of a cycle, the LExCI Master retrieves the agent's current policy from RLlib and converts it from its original TF format to TF Lite.This model, along with all relevant training parameters (e.g. the number of experiences to generate), is broadcast to the connected LExCI Minions using a custom JSON-based protocol.Upon receipt, each Minion utilises the API of its control software to overwrite the policy on the embedded device which is then prompted to generate experiences.Additional pieces of hardware can be part of the data generation domain and interact with the embedded device.Besides the closed-loop control system, those include physical actuators or sensors.
Section IV Once enough data has been collected, the Minion uses the control software again to get the raw experiences and post-processes them.For instance, a domain expert could define auxiliary penalties that are added to the reward in situations where the agent's actions were clearly nonsensical.
Section V The experiences are sent to the LExCI Master and arranged into training batches, i.e. the data format RLlib expects for training.During that process, experiences are supplemented with additional information if the algorithm calls for it.For example, PPO requires the predicted value of the VF approximator (see Eq. 4), the action distribution, and the probability of the action on top of the standard quantities.After the training batch has been assembled, it is given to RLlib for training the agent and the cycle starts anew.
Section VI When learning with off-policy algorithms, the LExCI Master does not remain idle while the Minions are doing their part.Instead, the Master continues training with experiences drawn from its replay memory buffer.The size of the buffer, the number of replay training steps per cycle, and the extent to which the buffer must be filled before replay training starts are set in the configuration file.
Section VII Apart from training runs, LExCI can be configured to conduct validation episodes with a defined frequency.They differ in that actions are always set to the mean of the action distribution and are hence deterministic during validations rather than being sampled stochastically.Thus, the results are more comparable and lend themselves better to assessing the agent's performance.Further, validations are conducted by a single Minion.
The master-minion architecture has the added benefit that it enables easy parallelisation.In light of the fact that embedded devices usually operate in real-time, this feature can speed up the data generation process dramatically.When there are multiple LExCI Minions available, the Master splits the workload between them so each only has to generate a fraction of the required number of experiences.

Setup
When employing the framework for a new use-case, the Master and the Minion must first be set up.LExCI is shipped with what is called a universal Master for each RL algorithm.Those are ready-to-use Python programs that function as described in Sec.3.1, so one merely has to select the right algorithm and adjust the parameters in the configuration file.Alternatively, users can write their own custom Master programs which create an instance of the Master class and call its main loop.The Minion is always tailor-made for the problem by writing a program that instantiates the Minion class and invokes its main loop.There, the logic for preparing the embedded system, overwriting the agent's model, running episodes, post-processing experiences, etc. is programmed.The class expects callback functions for generating training and validation data.To this effect, LExCI offers helper classes that facilitate interacting with the embedded system via a control software.At the time of writing, there are helpers for ControlDesk, MATLAB/Simulink, and ECU-TEST 12 .
Another significant facet of the setup process involves the software that shall be running on the embedded system itself.After all, it is responsible for executing the policy NN of the agent.Users are free to implement the inference of actions in whatever way they deem fit.Having said that, it is of paramount importance that they distinguish between what are called normalised and denormalised spaces.Normalised observations and actions are the raw quantities passed to and received from NNs. Denormalised quantities, on the other hand, are the ones that the environment provides or expects.It is standard practice to, for example, min-max normalise observations (from the environment) to the range [−1, +1] (which would then be the agent's normalised observation space) to stabilise and expedite training [30].By the same token, the normalised actions of the agent must be mapped to the allowed (denormalised) range in the environment, e.g.via a hyperbolic tangent and scaling or simply by clipping.The data that the Minion retrieves from the embedded system must always be normalised.To aid users, LExCI comes with software modules that can be used to execute the agent (neural_network_module) and to transform quantities between the spaces.
Considering how widely used MATLAB/Simulink are in the engineering domain, especially in control prototyping, LExCI's RL Block (Fig. 3) plays a prominent role in that regard.It is a ready-to-use Simulink subsystem that houses the RL-based controller such that employing it becomes as simple as copying it into the plant model, connecting its ports, and setting some basic parameters.Inside, the RL Block min-max normalises observations, feeds them to the policy NN of the agent, samples an action from the inferred action distribution, and denormalises the same before returning it.Its centrepieces are the S-Function containing the C++-code to execute the agent using TFLM and the internal experience buffer which can be accessed via the control software.The RL Block is externally triggered so that the agent can be executed at a different (i.e.slower) sample rate than the surrounding model.

Experiments
For this paper, LExCI was applied to the inverted pendulum swing-up problem which is a standard benchmark for continuous control.To highlight its versatility, multiple trainings were performed with the framework, each with a different RL algorithm and target system.

Pendulum Environment and Setup
In the inverted pendulum swing-up environment, a rod of length l = 1 m and mass13 m = 1 kg is mounted to a wall on one end with a single rotational degree of freedom (cf.Fig. 4).The objective is to apply a torque M at the pivot point in every time step such that it stands upright, i.e. the angle ϕ ∈ (−π, +π] between the rod and the vertical axis as well as its angular velocity φ become 0. The time step length is ∆t = 0.05 s.Using x = l•cos(ϕ) and y = l•sin(ϕ), Tab. 1 summarises the environment's observation and action space while Eq. 14 describes its reward function.Episodes are 200 time steps long and start at a random position ϕ 0 ∈ (−π, +π] and with a random angular velocity φ0 ∈ [−1 rad s −1 , +1 rad s −1 ]. [36][37][38] 1 The observation and action space of the pendulum swing-up problem.
The pendulum problem was tackled three times with LExCI: Python First, purely in Python using the gym implementation of the environment [36] and LExCI's neural_network_modules (cf.Sec.3.2) to execute the agent's policy.Simulink Second, with the pendulum environment running in Simulink using the RL Block (see Sec. 3.2) and a custom model that is identical in behaviour to gym's implementation.MABX III Third, with the environment running on a dSPACE MicroAutoBox (MABX) III14 , a RCP system commonly used for embedded control by the automotive industry.This run, too, used a custom model of the pendulum environment and the RL Block (cf.Sec.3.2).With each target system, one agent was trained with PPO and one with DDPG for a total of six training runs.The choice of algorithms was motivated by their wide-spread use in engineering and the fact that one is on-policy while the other is not.Observations were min-max normalised and the real-valued actions were mapped via a scaled hyperbolic tangent to the boundaries of the environment (see Sec. 3.2).The hyperparameters were chosen based on RLlib's pre-tuned configurations for the respective algorithms and extended by LExCI's custom ones.App.A lists the most important parameters.
Validations were performed every five cycles so that the agent's performance was tested frequently enough without creating too much overhead.For that purpose, the pendulum environment was initialised with ϕ 0 = π and φ0 = 0 rad s −1 , i.e. with the rod hanging still at the six o'clock position.

Results
Given the definition of the pendulum environment and the hyperparameters that were chosen, three episodes were generated in every cycle.Fig. 5 and Fig. 6 plot their smoothed average returns over the cycle number while the unfiltered quantities can be found in Please note that the variations in maximum return are merely a result of the stochastic nature of exploration paired with the random initialisation of the environment at the beginning of every episode.They do not mean that one target system performs better than the others.Second, all target systems display a similar course of training progression for each algorithm which proves that i) our Simulink and MABX III models of the pendulum environment are equal to gym's implementation in terms of behaviour and ii) that LExCI is able to train well on various platforms.Furthermore, Fig. 5 and Fig. 6 are in accord with the results of [39] though the author used a different set of hyperparameters.Third, all agents remain stable after convergence.The oscillations in the average cycle returns are mainly caused by the random initialisation of the pendulum which sometimes starts in more and sometimes in less advantageous states.To further validate the results, the pendulum environment was also trained without LExCI, i.e. using Ray/RLlib only.These trainings shall be referred to as native.For the sake of comparability, the environment was configured such that observations are min-max normalised and actions are mapped with a scaled hyperbolic tangent.When analysing the results in Fig. 9 and Fig. 10 and comparing them to the ones above, one has to consider two things: 1) Ray/RLlib's iterations do not directly correspond to LExCI's cycles.Because of that, the hyperparameters from App.A had to be slightly varied so as to best replicate LExCI's behaviour.This mainly affected the DDPG settings that govern how many samples are generated and how often replay data is used for training.2) Native Ray/RLlib utilises an OU process for exploration and not Gaussian noise for DDPG.With that in mind, the average training returns display the same general progress and -more important -have the same minima and maxima.This proves that LExCI interfaces RLlib correctly and that the framework is able to train agents to the same level of quality as the original library setup.

Conclusion
This paper explained the importance of RL for developing today's and tomorrow's control functions and highlighted the difficulties engineers face during training and deployment of RL agents with/on embedded devices.The LExCI framework was presented as an open-source solution and its performance has been demonstrated across various target systems, including a state-of-the-art RCP system, for a classic control task.Not only did LExCI succeed in integrating those platforms into the process, the results were also on a par with what the underlying RL library can produce natively on a conventional PC.The framework enables users to apply RL to real-world engineering problems on professional hardware as has been shown in prior works.Considering that one had to resort to specialised solutions to do so in the past, LExCI facilitates the process many times over because of its generic interface to embedded devices.Moreover, the fact that it relies on free, established libraries means that end-users are not forced to content themselves with proprietary implementations.Instead, they can leverage the full expertise of the open-source communities behind said libraries and thus obtain better results.
In the future, LExCI will be updated to the latest RLlib release as the latter has since undergone a major version change.Additionally, support for more algorithms will be implemented as well as features that aid in exercising advanced techniques.For instance, the framework shall have a more extensive repertoire of TL functionalities.

Fig. 1
Fig. 1 Software architecture of the LExCI framework with the eponymous cycle as a light green arrow.There are multiple independent instances of the data generation domain when the process is parallelised.

Fig. 2
Fig. 2 Simplified flowchart of the LExCI framework.The grey, dashed arrows indicate communication/data exchange between the Minion and the Master.The blue areas tagged with Roman numerals serve as references for the textual description of the figure.

Fig. 3
Fig. 3 LExCI's RL Block in Simulink.The ports observation and action are in the denormalised space of the environment.
Fig. B.1 and Fig. B.2 of App.B. The plots show some noteworthy characteristics of the trainings: First, every combination of RL algorithm and target system converged towards the optimum where the agent exhibits good performance.To substantiate this claim, Fig. 8 shows the best validation run of the DDPG-training on the MABX III where the agent swings the pendulum to the 12 o'clock position (x = 1 m and y = 0 m) within the first 50 time steps (i.e. in just 2.5 s) and holds it there for the remainder of the episode ( φ = 0 rad s −1 ).The same is true for the best PPO validation on that platform (cf.Fig. 7).Other combinations performed analogously once the training had converged (see Fig. B.3, Fig. B.4, Fig. B.5, and Fig. B.6 in App.B).

Fig. 5 Fig.Fig.
Fig. 5 Average LExCI PPO training returns with three episodes per cycle.The data has been smoothed with a moving average filter of size 11.

Fig. 8
Fig. 8 Best validation at cycle 45 of the LExCI DDPG training with the MABX III.The return of the episode was −367.32.

Fig. B. 3
Fig. B.3 Best validation at cycle 745 of the LExCI PPO training with Python.The return of the episode was −560.49.

Fig.
Fig. Best validation at cycle 45 of the LExCI DDPG training with Python.The return of the episode was −348.05.

Fig. B. 5
Fig. B.5 Best validation at cycle 710 of the LExCI PPO training with Simulink.The return of the episode was −397.86.

Fig. B. 6
Fig. B.6 Best validation at cycle 45 of the LExCI DDPG training with Simulink.The return of the episode was −382.50.