Learning-based Parameter Optimization for a Class of Orbital Tracking Control Laws

This paper presents a machine learning approach for tuning the parameters of a family of stabilizing controllers for orbital tracking. An augmented random search algorithm is deployed, which aims at minimizing a cost function combining convergence time and fuel consumption. The main feature of the proposed learning strategy is that closed-loop stability is always guaranteed during the exploration of the parameter space, {a property that allows one to streamline the training process by restricting the search domain to well-behaved control policies.} The proposed approach is tested on two case studies: an orbital transfer and a rendezvous and docking mission. It is shown that in both cases the learned control parameters lead to a significant improvement of the considered performance measure.


I. INTRODUCTION
Learning control laws from data has always been a primary objective of the control research community, leading to a vast body of results in areas such as adaptive control [1], [2], iterative learning control [3], [4], direct control estimation [5], [6] and reinforcement learning [7], [8].Recently, this research field has seen a renewed interest due to the impressive progress of reinforcement learning techniques [9], [10].The application of such techniques to continuous control problems have indeed proven to be successful in addressing complex tasks [11], with specific contributions in different areas including robotic manipulation [12], [13], mobile robotics [14], locomotion [15], power systems [16] and many others.
Learning-based control approaches have recently been applied to guidance and control problems in the aerospace field, see, e.g., [17], [18].In particular, rendezvous and docking (RVD) problems have been tackled by machine learning techniques in combination with model-based methods in [19], [20], [21], as well as by reinforcement learning [22], [23], [24].These methods allow one to optimize a variety of performance indexes, related to different aspects of the mission to be accomplished.
The widespread application of learning-based approaches has also raised a number of fundamental questions related to issues such as stability, performance guarantees, and robustness.In fact, safety-critical applications like those involving RVD maneuvers require special care to ensure robust stability and performance of the designed guidance and control schemes.In this respect, two alternative approaches can be taken.One is the recent research effort towards results that guarantee stability of feedback schemes including neural controllers (see e.g.[25], [26], [27], [28]).The other main line of research exploits machine learning tools to learn controllers belonging to pre-specified families, whose structure is designed in order to guarantee the desired properties.Among the large number of contributions, [29] is one of the first works enforcing specific parameterizations of the controller (including the Youla-Kucera one) and learning its parameters using the REINFORCE algorithm [30].The Youla parameterization is also adopted in [31], while PID controllers are considered in [32], [33].Learning within a family of robustly stabilizing controllers has been addressed in [34].
Herein, we leverage the latter approach, to address the problem of learning the parameters of a control law for orbital tracking.A major source of complexity for such type of nonlinear control problems is the requirement to ensure closed-loop stability while minimizing a nonsmooth performance index describing a trade-off between fuel consumption and maneuver completion time.To address this challenge, in this paper we restrict our attention to the family of nonlinear stabilizing controllers introduced in [35] and propose a learning strategy for tuning the controller parameters.The learning algorithm can be seen as a specialized version of REINFORCE and requires only the computation of the cost value associated to a simulation of the closed-loop control system (episode).The main benefit of the proposed approach is that closed-loop stability is always guaranteed for each episode during the exploration of the parameter space.Besides ensuring that the learned control policy is stabilizing, this feature also significantly speeds up the learning process.Simulated case studies show that the resulting controller provides the desired trade-off between settling time and fuel consumption, in two relevant RVD missions.
The paper is organized as follows.Section II reviews the orbital tracking model, and introduces the class of stabilizing controllers along with the associated optimal control problem.The learning algorithm is presented in Section III.The case studies concerning an orbital transfer and a rendezvous mission are reported and discussed in Section IV, while Section V contains conclusions and future developments.

Notation
R n is the real n−space, and Z denotes the set of integer numbers; for a real vector or matrix x, x T denotes its transpose and ||x|| its Euclidean norm.The symbol 0 n×m denotes a null n × m matrix, while the identity matrix of order n is denoted by I n .The partial derivative ∂f /∂x is expressed as a row vector.To save space, cos(•) and sin(•) are abbreviated with c(•) and s(•), respectively.Moreover, we define the rotation matrix

II. PROBLEM FORMULATION
In this paper, the dynamics of an orbiting spacecraft are described in terms of the six Equinoctial Orbital Element ψ = [ψ 1 . . .ψ 6 ] T = [L, p, e X , e Y , h X , h Y ] T , where L is the true longitude, p is the orbit semi-parameter, e X , e Y are the components of the eccentricity vector, and h X , h Y are the components of the inclination vector [36].The dynamics are given by ψ = f (ψ) + g(ψ)u, T is the control vector (radial, transverse and normal forcing accelerations, respectively), , and µ is the gravitational parameter of the central body.On any unforced orbit, only the true longitude ψ 1 varies in time.
The considered control task is to track a target reference trajectory T where ψ r (t) satisfies the unforced periodic dynamics ψr = f (ψ r ) with given initial conditions ψ r (0).The dynamics of the tracking error ψ = ψ − ψ r are modeled as in [35] using the transformed variables: where T .The transformation (1) is such that x = 0 if and only if ψ = 0.The corresponding dynamic model is then given by: where , and G 41 = ψ r 2 µ .The above vector fields are periodically time-varying with the same period as the reference trajectory.
In [35], a class of stabilizing control laws for system (2) is proposed.In this paper, we consider a parametric family of controllers that falls within the class introduced in [35], given by and The expressions of ξ and ∂V ∂x H in (3) are omitted for brevity.By using (4) as a Lyapunov function, it has been proved that the control law (3) ensures almost global asymptotic stability of the origin of the closed-loop system (2), (3) for all K i > 0, i = 1, . . ., 5 (see [35] for details).However, assessing performance for such a design is not trivial.
The goal of this paper is to tune the parameters K of the stabilizing controller (3) so as to optimize the performance of the control system in terms of a trade-off between the settling time and the fuel consumption.To this purpose, let us denote by y the distance between the actual and reference spacecraft position, expressed in Cartesian coordinates.This can be seen as an output signal of system (2), defined as where the mapping Y is obtained from (1) and the transformation which relates the satellite Equinoctial elements to the corresponding inertial cartesian states [37].System (2), ( 5) with control law (3) is simulated over a horizon of length T e (each simulation is called an episode).The input and output values collected at sampling times kT s , k = 0, . . ., H, with T e = HT s , are denoted as u(k) and y(k), respectively.The performance index to be minimized is then designed as where is the number of samples necessary to achieve practical convergence and ǫ is a suitable threshold depending on the mission objective.The parameter ρ is used to trade-off the two conflicting requirements of minimizing completion time and fuel.
In order to optimize (6) with respect to the controller parameter vector K, a learning-based approach is pursued, as detailed in the next section.

III. LEARNING ALGORITHM
Learning the parameter vector K ∈ R q that optimizes the cost J in (6) can be cast as a random exploration of the parameter space.A classical approach is the so-called random search, which is based on a finite difference approximation of the function along a random search direction δ ∈ R q , i.e.

J(K
where J(K (i) ) denotes the value of the cost (6) in an episode with parameter vector K (i) , and σ is a positive constant.Then, the parameter vector K is updated by taking a step along the direction δ, proportional to the finite difference (8).
An improved version of this approach is the Augmented Random Search (ARS) algorithm proposed in [38], in which multiple random search directions δ j ∈ R q , j = 1, . . ., N , are selected in order to enhance the exploration of the parameter space.In this paper, vectors δ j are drawn from a normal distribution with zero mean and covariance matrix Σ δ .For each direction, two perturbed parameter vectors K (j) + , K (j) − , j = 1, . . ., N are generated (in opposite directions).Then, the system is simulated for 2N episodes, one for each perturbed parameter, and the corresponding costs J (j) + , J (j) − , j = 1, . . ., N , are computed according to (6).Finally, the parameter vector is updated along a direction which is a weighted average of the random search vectors, according to the cost variation along each δ j .The update step is scaled by the standard deviation σ J of the cost values associated Algorithm 1 Augmented Random Search (ARS) 1: Hyperparameters: number M of iterations, stepsize α, number N of sampled directions per iteration, perturbation step σ, maximum length H of each episode, covariance matrix Σ δ of standard distribution for sampling vectors δ j .2: Initialize: parameter vector K (1) ∈ R q .3: for i = 1, 2, . . ., M do 4: Sample independent vectors δ j ∈ R q , j = 1, . . ., N , from normal distribution with zero mean and covariance matrix Σ δ 5: for j = 1, . . ., N do 6: Define perturbed parameter vectors Simulate system with parameters K For each simulation, compute costs J Update control parameters as − )δ j 12: end for to the 2N episodes.The entire procedure is summarized in Algorithm 1.The outcome of the learning procedure is the value of the parameter vector at the last iteration, marked as

IV. NUMERICAL SIMULATIONS
To demonstrate the benefit of the proposed methodology, in this section, Algorithm 1 is employed for tuning the parameter vector K = [K 1 , . . ., K 5 ] T of the control law proposed in Section II, for two different case-studies.In particular, an orbital transfer from a geostationary transfer orbit (GTO) to a geostationary Earth orbit (GEO), and a rendezvous mission performed in a low Earth orbit (LEO), are considered.Algorithm 1 is implemented in C++ on a 3.10 GHz CPU with 16 cores, exploiting parallel computing.The hyperparameters utilized in the learning algorithm are reported for each scenario in the corresponding subsection.

A. Orbital transfer
In this case study, the objective is to steer the satellite from an initial GTO (semi-major axis a = 24364 km, eccentricity e = 0.7306, inclination i = 63 deg, right ascension of the ascending node Ω = 75 deg, argument of periapsis ω = 52 deg, initial true longitude L = π/6) to a circular equatorial GEO (semi-major axis a r = 42165 km).The sampling time is T s = 45 min and the parameters in ( 6)-( 7 T and the covariance matrix for the search directions is taken as The evolution of the parameter vector K (i) is depicted in Fig. 1, while Fig. 2 displays the corresponding cost J defined by (6).The algorithm converges in less than 1000 iterations and leads to a cost reduction of about 82% with respect to the initial maneuver cost, which is a remarkable improvement.Fig. 3 shows the distance y in (5) as a function of time, for all the iterations generated by the learning algorithm, where the black and red lines correspond to the first and last iteration, respectively.It can be noticed that optimizing the control parameters leads to a significant reduction of the flight time and that all the closed-loop system trajectories achieve asymptotic convergence, as expected.In order to further assess the effectiveness of the learning procedure, a more comprehensive analysis in which the satellite starts from 50 random initial conditions is performed.In particular, the chaser orbit shape is left unaltered, while the In Fig. 4, the trajectories optimized through Algorithm 1 (red) are compared with those corresponding to the initial choice of the controller parameters (black), for the considered set of initial conditions.It can be seen that the optimized trajectories achieve a much better convergence time as well as a smaller dispersion than the initial ones.The performance improvements in terms of total cost, convergence time, and fuel efficiency are reported in Table I.These results clearly show the effectiveness of the proposed learning technique in improving the performance of the closed-loop control system.

B. Rendezvous
As a second case study, we consider a terminal rendezvous scenario, in which the target initially lies on a near-circular LEO with an altitude of 1000 km above the Earth, inclination of 81 deg and initial true longitude of 45 deg.The chaser initial state is assumed to lie in the neighborhood of the target one.In practice, this situation arises after a preliminary orbit injection maneuver.A set of 50 initial conditions is generated through a normal distribution centered at the target equinoctial elements ψ r using the covariance matrix This corresponds to specifying an initial inter-satellite distance in the order of 60 km.The sampling time is T s = 3 min, the parameters in ( 6)-( 7) are set to ρ = 400 and ǫ = 1 km and the number of iterations is M = 5000.The other hyperparameters are set equal to those of the previous case study.
The purpose of this setup is to assess the performance of a mean parameter vector K obtained by the learning process over all random initial conditions.This study is motivated by the fact that the initial condition of a terminal rendezvous mission is unknown beforehand and is the result of a previous transfer mission.In such a scenario, the controller tuning should ideally be performed on-board.However, the low amount of computational resources typically available on a spacecraft may preclude this possibility.To overcome this issue while still achieving an acceptable performance, precomputing a mean parameter vector over a wide range of initial conditions turns out to be an effective alternative.A performance analysis of this approach is discussed in the following.
Figure 5 shows the evolution of the parameter vector K (i) during the learning.The resulting final mean parameter vector is equal to K = [1.22,5.41, 0.72, 5.29, 0.40] T .Fig. 6 reports a comparison among the trajectories y obtained by the control laws (3) with initial parameter K (1) , optimal parameter K * and mean parameter K, respectively, for one of the considered initial conditions.It can be seen that the mean controller leads to a reduction of the convergence time which is comparable to the one obtained by the optimal parameters K * for that initial condition.The statistics of this experimental campaign are summarized in Table II, indicating that the two cost reductions achieved are comparable.This finding suggests that, despite the large variability achieved by the learned parameters (see Fig. 5), utilizing the mean parameter vector K in the controller (3) for terminal rendezvous maneuvers is a reasonable option, allowing for very good performance without demanding on-board computations.Conversely, it is evident that a coarse parameter tuning entails a severe performance degradation.Therefore, the use of pre-computed mean parameter vectors can be an effective strategy for achieving reliable performance in scenarios where the initial conditions are not known a priori and the computational resources are limited.V. CONCLUSIONS Optimization of performance measures in orbital tracking is a challenging task due to the complexity of the dynamic models and the necessity to guarantee fundamental requiresuch as stability, robustness and constraint satisfaction.This work has shown that machine learning techniques can be successfully employed to tune the parameters of a family of stabilizing controllers for orbital tracking, in order to optimize a cost function accounting for both settling time and fuel consumption.In perspective, this approach can be also useful to analyze the sensitivity of the performance metrics with respect to the control parameters.Besides the considered augmented random search algorithm, future investigation may concern other learning approaches (e.g., policy gradient) and address the inclusion of state/input constraints or parametric uncertainties in the optimization problem.

Fig. 1 .Fig. 2 .
Fig. 1.Scenario A. Evolution of the parameter vector K (i) during the learning phase.

Fig. 5 .Fig. 6 .
Fig. 5. Scenario B. Parameter vectors during the learning for all the initial conditions.