1 Introduction

Recent years have witnessed an increasing interest towards the use of learning techniques in aerospace applications. The steadily growing research activity in this area is testified by several surveys, classifying a variety of solutions for guidance [1, 2], navigation [3] and control [4]. In particular, rendezvous and docking (RVD) problems have been tackled by machine learning techniques in combination with model-based methods [5,6,7], as well as by reinforcement learning approaches [8,9,10,11]. A common feature of these works is that the control scheme includes an artificial neural network, possibly coupled with other types of controllers, which is trained by using experimental or simulation data. Among the motivations behind these techniques, there is the potential of neural networks to approximate complex maps and the possibility of designing the controller even without an explicit model of the physical system. Moreover, they allow one to optimize meaningful cost functions involving state and input variables. Unfortunately, providing a rigorous stability analysis of Neural Feedback Loops (NFLs), i.e., closed-loop schemes including neural networks as feedback controllers, is a hard task. In most works, stability and performance are evaluated only a posteriori, by means of simulation campaigns. Furthermore, the training of the neural controller may not consider some relevant points of the flight envelope, thus leading to unexpected behaviors of the control scheme or even instability.

Current research trends are tackling the above problem from different perspectives. Several works apply Lyapunov analysis to guarantee closed-loop stability of a control scheme combining a nonlinear feedback controller (based, e.g., on sliding mode or backstepping) and a neural network. This type of approach has been explored for aerospace control problems like attitude control [12], formation flying [13] and rendezvous and docking [14]. However, in these works the neural network is used only to adapt the controller parameters against model uncertainty and external disturbances. More recently, a remarkable effort has been devoted to studying closed-loop stability of NFLs, by resorting to classical control analysis paradigms (see, e.g., [15,16,17]). A limitation of these approaches is that the involved computational burden tends to grow considerably with the number of neurons and layers of the neural controller. A third line of research exploits learning tools to select the parameters of a controller belonging to a pre-specified class, whose structure is designed in order to guarantee the desired stability properties. In this context, [18] is one of the first works enforcing specific parameterizations of the controller (including the Youla-Kucera one) and then estimating the control parameters using the REINFORCE algorithm [19], a classical tool in machine learning. The Youla parameterization is also adopted in [20], while PID controllers are considered in [21]. Learning within a family of robustly stabilizing controllers has been addressed in [22].

A key feature of spacecraft control systems is that well-established and reliable models of the orbital dynamics are available [23, 24]. Therefore, a large body of literature is focused on the design of model-based control schemes for such problems (see, e.g., [25,26,27,28,29,30,31] and references therein). A common challenge these techniques have to face is that it is by no means trivial to tune the controller parameters in order to optimize specific performance indexes, such as fuel consumption and maneuver completion time. This motivates the adoption of a two-step design procedure, along the lines suggested in works such as [18] and [21]: first, a class of control laws guaranteeing closed-loop stability is chosen; then, learning techniques are employed to tune the parameters of the control law so as to optimize performance. This type of strategy has been already adopted in the aerospace field, either to optimize the parameters of feedback control laws for powered descent landing [32] or to tune a Lyapunov-based Q-law for trajectory design [33]. Such works adopt actor-critic reinforcement learning algorithms, whose training process is usually computationally demanding.

In this paper, the approach outlined above is adopted in the context of orbital tracking. The objective is to design an optimal control law that achieves closed-loop stability while minimizing a mixed time-fuel performance index. This is a challenging problem, being the orbital tracking dynamics nonlinear and the cost function nonsmooth. To this aim, the family of almost globally stabilizing feedback controllers proposed in [34] is considered. A specialized version of the REINFORCE algorithm, known as Augmented Random Search (ARS) [35], is employed to learn the values of the controller parameters which minimize the desired cost function. The learning procedure requires only the computation of the cost value associated to each episode within a batch of simulations of the closed-loop control system. The novelty of the proposed approach with respect to control schemes based on NFLs is that closed-loop stability is always guaranteed during the exploration of the parameter space, and hence also for the optimized controller. This allows one to remarkably speed-up the training process, by avoiding to consider parameter combinations that would lead to system instability. Numerical simulations on three different missions, involving orbital transfer and rendezvous maneuvers, confirm that the learning algorithm converges to a control law that optimizes a trade-off between settling time and fuel consumption. In particular, it is shown that the proposed technique can be exploited to tune the control system performance with respect to a set of initial mission configurations. Moreover, thanks to the simplicity of the ARS learning algorithm, the parameter tuning process takes seconds for a single mission, thus making the proposed approach computationally attractive with respect to other learning techniques proposed in the literature.

The paper is organized as follows. Section 2 reviews the dynamic model used for orbital tracking. The considered class of stabilizing controllers along with the performance optimization problem is introduced in Sect. 3, and the learning algorithm is presented in Sect. 4. The results of numerical simulations are discussed in Sect. 5, while Sect. 6 contains conclusions and future developments.

1.1 Notation

The symbol \({0}_{n\times m}\) denotes a null \(n\times m\) matrix, while the identity matrix of order n is denoted by \({I}_n\). The partial derivative \({\partial f}/{\partial x}\) is expressed as a row vector. To save space, \(\textrm{cos}(\cdot )\) and \(\textrm{sin}(\cdot )\) are abbreviated with \(\text {c}(\cdot )\) and \(\text {s}(\cdot )\), respectively. Moreover, we define the rotation matrix

$$\begin{aligned} R(\phi )= \left[ \begin{array}{l r} \text {c}(\phi ) &{} -\text {s}(\phi )\\ \text {s}(\phi ) &{} \text {c}(\phi ) \end{array} \right] . \end{aligned}$$

Finally, for \(v \in \mathbb {R}^n\), \(\epsilon \in \mathbb {R}\), \(\max \{v,\epsilon \}\) denotes the vector whose components are the maximum between the components of v and \(\epsilon\).

2 Orbital Tracking

In this paper, the dynamics of an orbiting spacecraft are described in terms of the six Equinoctial Orbital Elements

$$\begin{aligned} \psi = \left[ \psi _1\,\ldots \, \psi _6\right] ^T = \left[ L,\,p,\,e_X,\,e_Y,\, h_X,\,h_Y\right] ^T, \end{aligned}$$

where L is the true longitude, p is the orbit semi-parameter, \(e_X\), \(e_Y\) are the components of the eccentricity vector, and \(h_X\), \(h_Y\) are the components of the inclination vector [36]. The dynamics are given by

$$\begin{aligned} \dot{\psi } = f(\psi ) + g(\psi )u, \end{aligned}$$

where \(u = \left[ u_{r},\,u_{\theta },\,u_{h}\right] ^T\) is the control vector (radial, transverse and normal forcing accelerations, respectively),

$$\begin{aligned} f(\psi )=\sqrt{\frac{\mu }{\psi _2^3}}\left[ \begin{array}{cccccc} (1+\zeta _X)^2&0&0&0&0&0 \end{array} \right] ^T,\\ g(\psi )=\frac{\sqrt{\psi _2}}{\sqrt{\mu }(1 + \zeta _X)} \left[ \begin{array}{ccc} 0&{} 0 &{} {\eta } \\ 0 &{}{2}\,\psi _2 &{}0 \\ \;\,({1 + \zeta _X}) {\text{s}}(\psi _1) &{} {q_X}&{} - { \eta \, \psi _4} \\ -({1 + \zeta _X}) {\text{c}}(\psi _1) &{} {q_Y} &{} \;\;\, {\eta \, \psi _3} \\ 0&{} 0&{} \dfrac{(1 + h^{ 2})}{2} {\text{c}}(\psi _1) \\ 0&{} 0 &{}\dfrac{(1 + h^{ 2})}{2} {\text{s}}(\psi _1) \end{array}\right] , \end{aligned}$$

with

$$\begin{aligned} \begin{array}{lcl} \zeta _X &{}=&{} \psi _3 {\text{c}}(\psi _1)+\psi _4 {\text{s}} (\psi _1),\\q_X &{}=&{} \psi _3 +(2+\zeta _X){\text{c}} (\psi _1),\\ q_Y &{}=&{} \psi _4 + (2+\zeta _X){\text{s}} (\psi _1),\\ \eta &{}=&{} \psi _5 {\text{s}} (\psi _1)-\psi _6 {\text{c}}(\psi _1),\\ h^2&{}=&{}\psi _5^2+ \psi _6^2, \end{array} \end{aligned}$$

and \(\mu\) is the gravitational parameter of the central body. On any unforced orbit, only the true longitude \(\psi _1\) varies in time.

The considered control task is to track a target reference trajectory \(\psi ^r(t)=[\psi _1^r(t),\,\psi _2^r,\,\psi _3^r,\,\psi _4^r,\,\psi _5^r,\,\psi _6^r]^T\) where \(\psi ^r(t)\) satisfies the unforced periodic dynamics \(\dot{\psi }^r= f(\psi ^r)\) with given initial conditions \(\psi ^r(0)\). In order to ease the control design, the dynamics of the tracking error \(\tilde{\psi }=\psi -\psi ^r\) are modeled as in [34] using the transformed variables:

$$\begin{aligned} \begin{array}{rcl} x_1 \;\,&{}=&{}\tilde{\psi }_1 \\ x_2 \;\,&{}=&{}\sqrt{1+\frac{\tilde{\psi }_2}{\psi _2^r}}-1 \\ \left[ \begin{array}{l}x_3\\ x_4\end{array}\right] &{}~=~&{} \left[ \begin{array}{c c} \frac{\psi _2^r}{\tilde{\psi }_2+\psi _2^r} &{} 0\\ 0 &{} \sqrt{\frac{\psi _2^r}{\tilde{\psi }_2+\psi _2^r}} \end{array}\right] R(\tilde{\psi }_1 + \psi _1^r) \left[ \begin{array}{r} \tilde{\psi }_3 + \psi _3^r \\ -\tilde{\psi }_4 - \psi _4^r \end{array}\right] +\left[ \begin{array}{c} -\frac{\tilde{\psi }_2}{\tilde{\psi }_2+\psi _2^r} \\ 0\end{array}\right] - \left[ \begin{array}{l} \zeta ^r_X\\ \zeta ^r_Y \end{array} \right] \\x_5 \;\,&{}=&{}\tilde{\psi }_5 \\ x_6 \;\,&{}=&{}\tilde{\psi }_6, \end{array} \end{aligned}$$
(1)

where \([\zeta _X^r,\ \zeta _Y^r]^T=R(\psi _1^r)[\psi _3^r,\ -\psi _4^r]^T\). The transformation (1) is such that \(x=0\) if and only if \(\tilde{\psi }=0\). The corresponding dynamic model is given by:

$$\begin{aligned} \dot{x}= \left[ \begin{array}{c} F(\chi ,\psi ^r) \\ 0_{2\times 1} \end{array} \right] + \left[ \begin{array}{c} G (\chi ,\psi ^r)\\ 0_{2\times 2} \end{array}\right] \left[ \begin{array}{c} u_r\\ u_\theta \end{array}\right] + H(x,\psi ^r)\, u_h, \end{aligned}$$
(2)

where \(\chi =[x_1 \dots x_4]^T\),

$$\begin{aligned} {F}(\chi ,\psi ^r)= \left[ \begin{array}{cccc} 0 &{} F_{12} &{} F_{13} &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} -F_{33} &{} -F_{12} \\ 0 &{} F_{42} &{} F_{12} + F_{43} &{} 0 \end{array}\right] \chi ,~ {G}(\chi ,\psi ^r) = \left[ \begin{array}{cc} 0&{}0 \\ 0 &{} G_{22} \\ 0&{}0 \\ G_{41} &{} 0 \end{array}\right] ,\\ \begin{array}{l} H(x,\psi ^r)=\\ \dfrac{G_{22}}{(x_2+1)} \left[ \begin{array}{c} (x_5+\psi _5^r) {\text{s}} (x_1+\psi _1^r)-(x_6+\psi _6^r){\text{c}} (x_1+\psi _1^r)\\ 0_{3\times 1}\\\dfrac{1+(x_5+\psi _5^r)^2+(x_6+\psi _6^r)^2}{2} {\text{c}} (x_1+\psi _1^r)\\ \dfrac{1+(x_5+\psi _5^r)^2+(x_6+\psi _6^r)^2}{2} {\text{s}} (x_1+\psi _1^r) \end{array} \right] , \end{array} \end{aligned}$$

with

$$\begin{aligned} \begin{array}{lcl} F_{12} &{}=&{} \sqrt{\frac{\mu }{({\psi _2^r})^3}}\left( x_3+1+{\zeta }_X^r\right) ^2,\\ F_{13} &{}=&{} \sqrt{\frac{\mu }{({\psi _2^r})^3}}\left( x_3+2+2{\zeta }_X^r\right) ,\\ F_{42} &{}=&{} \sqrt{\frac{\mu }{({\psi _2^r})^3}}\left( x_2+2\right) \left( x_3+1+{\zeta }_X^r\right) ^3,\\ F_{33} &{}=&{} F_{13}\,\zeta ^r_Y,\\ F_{43} &{}=&{} F_{13}\,\zeta ^r_X,\\G_{22} &{}=&{} \sqrt{\frac{\psi _2^r}{\mu }}\frac{1}{(x_3+1+\zeta _X^r)},\\ G_{41} &{}=&{} \sqrt{\frac{\psi _2^r}{\mu }}. \end{array} \end{aligned}$$

It is worth noticing that the vector fields in (2) are periodically time-varying, with the same period as the reference trajectory.

3 Controller Class and Performance Assessment

By using the results in [34], we define a parametric family of stabilizing controllers for system (2), as follows:

$$\begin{aligned} \begin{array}{rcl} u_r(x,\psi ^r;K)&{}=&{}\displaystyle {-\frac{1}{G_{41}} \left( F_{43}\, x_3- \dot{\xi } \right) - K_4(x_4-\xi ) } \\u_\theta (x,\psi ^r;K)&{}=&{}-\displaystyle {K_1\frac{G_{41}F_{12}}{G_{22}} {\text{s}} (x_1) -\frac{F_{42}}{G_{22}}(x_4 - \xi ) -K_2 \frac{G_{41}}{G_{22}} x_2 }\\u_h(x,\psi ^r;K)&{}=&{}\displaystyle {- K_5 \frac{1}{G_{41}}\frac{\partial V}{\partial x} H }, \end{array} \end{aligned}$$
(3)

where \(K=[K_1,\dots ,K_5]^T\) is a vector of constant parameters,

$$\begin{aligned} V(x)=K_1 G_{41}(1- {\text{c}}(x_1))+\frac{1}{2}({x}_2^2+{x}_3^2+(x_4-\xi )^2+x_5^2 + x_6^2), \end{aligned}$$
(4)

and

$$\begin{aligned} \xi =\frac{1}{F_{12}}\left( K_1 G_{41}{F_{13} {\text{s}} (x_1)} -F_{33} x_3+{K_3 G_{41} x_3}\right) . \end{aligned}$$
(5)

The explicit expressions of \(\dot{\xi }\) and \(\tfrac{\partial V}{\partial x} H\) in (3) are not reported for brevity. The control law introduced above exploits backstepping and damping control techniques. In particular, \(\xi\) in (5) plays the role of a virtual input to the dynamics of \(\chi\) in (2), while the last equation in (3) represents a damping term. The following result states the stability properties of the control law (3), which can be proved by adopting V(x) in (4) as a Lyapunov function (see [34] for details).

Proposition 1

Let \(K_i>0,~i=1,\dots ,5\). Then, the origin of closed-loop system (2)-(5) is almost globally asymptotically stable.

This result defines the set of parameters K guaranteeing that the proposed control law stabilizes the tracking error system. However, it is well known that tuning the performance of nonlinear control laws is far from trivial. Indeed, a misguided choice of the control parameters of the closed-loop system (2)-(5) may lead, for example, to extremely slow tracking of the reference trajectory or to an excessive control effort. The goal of this paper is to tune the parameters K of the stabilizing control law (3) so as to optimize the performance of the closed-loop system in terms of a trade-off between the settling time and the fuel consumption. To this purpose, we denote by y the distance between the actual and reference spacecraft position, expressed in Cartesian coordinates. This can be seen as an output signal of system (2), i.e.,

$$\begin{aligned} y=Y(x,\psi _r), \end{aligned}$$
(6)

where the mapping Y is obtained from (1) and the transformation which relates the satellite Equinoctial elements to the corresponding inertial cartesian states [37]. In order to learn the controller parameters K from the input–output behavior, system (2), (6) with control law (3) is simulated over a horizon of length \(T_e\) (each simulation is called an episode). The input and output values collected at sampling times \(kT_s\), \(k=0,\dots ,H\), with \(T_e=H T_s\), are denoted as u(k) and y(k), respectively. Then, the performance index to be minimized is specified as

$$\begin{aligned} J(x(0);\,K)=H_{conv} + \rho \sum _{k=0}^{H_{conv}-1} ||u(k)||, \end{aligned}$$
(7)

where x(0) denotes the initial state vector,

$$\begin{aligned} H_{conv}= \min \{\bar{k}:~y(k) \le \epsilon ,~~\forall k \ge \bar{k}\}, \end{aligned}$$
(8)

and \(\epsilon\) is a threshold assessing practical convergence. The parameter \(\rho\) is used to trade-off the two conflicting requirements of minimizing the maneuver completion time \(H_{conv}\) and the fuel consumption.

In the following, the problem of minimizing the cost (7) with respect to the controller parameter vector K is addressed. Being (7) a discontinuous function of K, a gradient-free optimization method is required. In the next section, a learning-based approach is proposed.

4 Learning Procedure

A classical approach to minimize a function J(K) with respect to \(K\in \mathbb {R}^q\) is the so-called random search, which amounts to computing a numerical approximation of the function gradient along a random search direction.

Recently, an enhanced version of this approach, namely the Augmented Random Search (ARS) method, has been proposed in [35]. It is a derivative-free stochastic optimization method which explores the parameter space of a family of deterministic control policies, by simulating episodes with randomly perturbed parameter vectors K. The ARS algorithm improves with respect to the basic random search, by adopting several heuristics which have proven to be effective in speeding up the learning process. First, multiple random search directions \(\delta _j \in \mathbb {R}^q\), \(j=1,\dots ,N\), are selected in order to enhance the exploration of the parameter space. This is done by generating N random vectors \(\delta _j\) sampled from a normal distribution with zero mean and covariance matrix \(\Sigma _{\delta }\). Notice that the latter plays a significant role in scaling the exploration appropriately for each element of the parameter space. Then, the parameter vector is updated along a direction which is a weighted average of the random search vectors, according to the cost variation along each \(\delta _j\). The update step is scaled by the standard deviation \(\sigma _J\) of the 2N cost values \(J_+^{(j)}=J(K_+^{(j)})\), \(J_-^{(j)}=J(K_-^{(j)})\), \(j=1,\dots ,N\), evaluated by simulating the closed-loop system with the corresponding control law parameter values

$$\begin{aligned} K_+^{(j)}= & {} \max \{K+\sigma \delta _j,\,\epsilon _K\} \end{aligned}$$
(9)
$$\begin{aligned} K_-^{(j)}= & {} \max \{K-\sigma \delta _j,\,\epsilon _K\}, \end{aligned}$$
(10)

until practical convergence of the trajectory y is achieved. In (9)-(10), \(\sigma\) is a positive scaling constant and \(\epsilon _K>0\) is a small quantity, instrumental to guaranteeing positivity of the controller parameters, as required by Proposition 1. The scaling by \(\sigma _J\) is useful to adapt the step sizes according to the local sensitivity of the cost with respect to perturbations of the control parameters [35]. Then, the parameter update step is performed as

$$\begin{aligned} K^{(i+1)}= \max \{ K^{(i)} - \frac{\alpha }{N \sigma _J} \sum _{j=1}^N (J_+^{(j)} - J_-^{(j)}) \delta _j,\,\epsilon _K\}. \end{aligned}$$
(11)

The update (11) is repeated iteratively for \(i=1,\dots , M\), where M is the total number of iterations (note that, each iteration requires to perform 2N episodes and cost evaluations). The outcome of the learning procedure is the final parameter vector \(K^* = K^{( M)}\). The overall procedure is summarized in Algorithm 1. Note that, rather than employing a predefined maximum number of iterations, alternative stopping criteria can be adopted for the proposed algorithm. For instance, the learning procedure can be terminated when the cost J does not decrease significantly anymore. This is typically done by smoothing the cost value with a moving average and then checking if its decrease is below a given threshold. In the simulations presented in Sect. 5, we let the learning procedure evolve over a predefined number of iterations in order to test the numerical stability of the method.

It is worth stressing that asymptotic convergence of the closed-loop system trajectories is guaranteed for all learning episodes by the global stability property of the control law (3) and the fact that in Algorithm 1 all the generated \(K_+\), \(K_-\) and updated \(K^{(i)}\) are strictly positive. This feature turns out to be crucial to streamline the learning procedure. Indeed, the occurrence of divergence or other unstable behaviors would prevent a meaningful computation of the costs \(J_+^{(j)}\), \(J_-^{(j)}\), thus leading to high variance of the local cost values and, in turn, of the parameter updates.

Algorithm 1
figure a

Augmented Random Search (ARS)

5 Numerical Simulations

In this section, Algorithm 1 is exploited to tune the parameter vector \(K=\left[ K_1,\ldots ,K_5\right] ^T\) of the control law (1) within different case-studies, in order to demonstrate its suitability for performance optimization in the context of space applications. In particular three different scenarios are considered: (A) an orbital transfer from a low Earth orbit (LEO) to a geostationary transfer orbit (GTO); (B) an orbital transfer from a GTO to a geostationary Earth orbit (GEO); (C) a rendezvous mission in LEO. The first two case studies are representative of orbit control problems characterized by strong nonlinearities, which raise the challenge of optimizing a complex transient response. The latter application focuses on a scenario in which feedback control is essential to achieve a sufficient level of mission autonomy.

The implementation of the proposed algorithm utilizes the C++ programming language and runs on a 3.10 GHz CPU with 16 cores, using OpenMP constructs to enable parallel computing. The parallelization is applied to both episodic exploration directions and initial conditions so as to improve the computational efficiency. The hyperparameters chosen for the learning algorithm are reported for each scenario in the corresponding subsection.

Fig. 1
figure 1

Scenario A. Evolution of the parameter vector \(K^{(i)}\) during the learning process

5.1 Orbital Transfer: LEO-GTO

In this transfer mission, which is inspired by [38], the initial orbit is an equatorial circular orbit with a semi-major axis equal to 6778 km, while the reference orbit is a higher altitude elliptic orbit. The initial and reference orbital elements are reported in Table 1.

Table 1 Scenario A: orbital elements of the initial and reference orbits

The sampling time is \(T_s=16\) min and the parameters characterizing the performance index in (7)-(8) are set to \(\rho = 50\) and \(\epsilon = 10\) km. The algorithm hyperparameters are chosen as follows: \(M=2000\), \(\alpha = 5\cdot 10^{-3}\), \(\sigma = 2\cdot 10^{-3}\), \(N=16\), \(H = 320\), corresponding to 10 consecutive target orbits and \(T_e = 86.4\) hours. The initial parameter vector is selected as \(K^{(1)} = \left[ 0.1,\,1,\,1,\,1,\,10\right] ^T\). The covariance matrix is specified as \(\Sigma _{\delta }= \text {diag}\{0.1,\,1,\,1,\,1,\,10\}\), which ensures an appropriate scaling of the perturbation direction for the vector K. The choice of the scaling values is the outcome of a trial-and-error selection process based on the experimentation on different datasets. In this setting, the computation time required for tuning the controller parameters amounts to 12 s. This corresponds on average to 6 ms per iteration.

Fig. 2
figure 2

Scenario A. Evolution of the maneuver cost J

The evolution of the parameter vector \(K^{(i)}\) is depicted in Fig. 1, while Fig. 2 displays the overall cost J defined by (7). A cost reduction of about 18% with respect to the initial non-optimized cost is achieved in less than 1000 iterations. It can be observed that the cost is not monotonically decreasing during the learning process, due to the stochasticity of the search algorithm. In fact, the finite number of the explored directions in the parameter space may lead to a local increase in the cost at some iterations. Fig. 3 shows the output y(t) defined by (6) for all the iterations of the learning algorithm. The black and red lines denote the trajectories corresponding to the initial and the final parameter vector of the controller, respectively. It can be seen that optimizing such parameters leads to a remarkable reduction of the flight time and that all the trajectories generated during the learning phase achieve converge towards the origin. To qualitatively illustrate these results, Fig. 4 shows a three-dimensional plot of the resulting spacecraft trajectories in the Earth centered inertial (ECI) frame. It can be seen that the learned trajectory accomplishes notably less revolutions than the initial one, confirming the aforementioned cost reduction. The control input signals obtained in the first and last iterations of the learning process are reported in Fig. 5. The optimization of the selected cost function allows for a reduction of the peak value of the normal acceleration \(u_h\) and a shorter activation of the radial one \(u_r\).

Fig. 3
figure 3

Scenario A. Evolution of the output signal y(t) resulting from the application of Algorithm 1: first iteration (black line) and final iteration (red line)

Fig. 4
figure 4

Scenario A. Three-dimensional trajectories corresponding to the initial (black) and optimized (red) parameters of the control law (3). The black circle marks the intial condition, while the target orbit is colored green

Fig. 5
figure 5

Scenario A. Radial, transverse and normal control signals: first iteration (black line) and final iteration (red line)

5.2 Orbital Transfer: GTO-GEO

In this case-study, the proposed approach is tested on a GTO to GEO transfer. The target orbit is an equatorial GEO with semi-major axis of 42,165 km, while the initial GTOs are characterized by a semi-major axis of 24364 km, eccentricity of 0.7306 and initial true longitude of \(\pi /6\). The inclination i, right ascension of the ascending node (RAAN) \(\Omega\), and argument of periapsis \(\omega\) of the GTO are randomly drawn from a uniform distribution on the interval \(\left[ \pi /4,\,\pi /2\right]\). In particular, a set of 50 different initial GTO configurations has been considered. The hyperparameters of the learning algorithm are chosen as in the previous case study, except for \(T_s = 45\) min, \(H = 1280\) (corresponding to 40 target orbits), and \(T_e = 957.4\) hours. The computation time required by the proposed approach to optimize the controller parameters for the entire set of initial configurations amounts to about 40 min.

Fig. 6 shows the trajectories obtained for the considered set of initial configurations, before and after the optimization performed by Algorithm 1. In particular, the black and red trajectories represent the evolution of the output y(t) resulting from the application of the control law (3) with parameters \(K^{(1)}\) and \(K^{*}\), respectively. It can be seen that the optimized trajectories display a much better envelope profile than the initial ones, especially in terms of convergence time. Table 2 summarizes the results obtained by applying the proposed learning algorithm (Algorithm 1), with respect to the total cost, the convergence time and the fuel efficiency. These show a remarkable improvement in the cost-related metrics.

Fig. 6
figure 6

Scenario B. Trajectories of the output y(t) for the considered set of initial conditions, before (black) and after (red) the optimization performed using Algorithm 1

Table 2 Results for Scenario B

5.3 Rendezvous

In this case study, we consider a terminal rendezvous scenario, in which a controlled spacecraft (referred to as the chaser) must intercept an uncontrolled target. The purpose of this study is to assess the performance obtained by using a mean parameter vector \({\hat{K}}\) computed by averaging the results of the learning process over a sufficiently representative set of initial conditions. The motivation is the potential application to rendezvous missions. In these scenarios, the initial condition is not known accurately beforehand, being the result of a previous transfer mission. Moreover, online learning of the best controller tuning for a specific initial condition may not be possible due to computational constraints. To overcome this limitation while still achieving an acceptable performance, pre-computing mean tuning parameters turns out to be a viable option. A performance analysis of the controller tuned in this way is presented hereafter.

In the considered setting, the target moves along a near-circular LEO with an altitude of 1000 km above the Earth, an inclination of 81 deg and an initial true longitude of 45 deg. The chaser is assumed to initially lie in the neighborhood of the target following a preliminary coarser orbit injection maneuver. To account for this feature, a set of 50 random initial conditions x(0) are generated through a normal distribution centered at the target equinoctial elements \(\psi ^r\), using the covariance matrix \(\sigma _{\psi }=\text {diag}\,\{0.5 \text { deg},\,20 \text { km},\,3\cdot 10^{-5},\,3\cdot 10^{-5},\,2\cdot 10^{-3},\,2\cdot 10^{-3}\}\). The resulting initial inter-satellite separation is about 60 km on average. The hyperparameters used in the learning procedure are as follows: \(T_s=3\) min, \(\epsilon = 1\) km, \(\rho = 400\) and \(M = 5000\). The overall computation time required to apply Algorithm 1 to the entire set of initial conditions is about 90 min.

Fig. 7
figure 7

Scenario C. Evolution of the parameter vector during learning for a subset of the considered initial conditions

Fig. 7 shows the evolution of parameter vector \(K^{(i)}\) during the iterations of the learning process, for a selected subset of the considered initial conditions. The final mean parameter vector resulting from the optimization is equal to \({\hat{K}}=\left[ 1.22,\,5.41,\,0.72,\,5.29,\,0.40\right] ^T\).

Figure 8 depicts the trajectories of the output y provided by the control law (3) with the parameters \(K^{(1)}\), \(K^{*}\) and \({\hat{K}}\), for a single realization of the initial conditions. It can be seen that the controller employing mean tuning parameters achieves a considerable reduction of the convergence time, which is comparable to the one provided by the dedicated tuning \(K^*\). An equally good behavior is observed for the entire set of initial states x(0). Table 3 presents statistics on this experiment and confirms that the performance achieved by the parameters \(K^{*}\) and \({\hat{K}}\) is on a similar level. It is concluded that tuning the controller (3) with the learned mean parameter vector \({\hat{K}}\) is an advantageous strategy for terminal rendezvous maneuvers, allowing for achieving near-optimal performance whenever on-board optimization is not viable.

In Fig. 9, the cost evolutions, smoothed by a 50-sample moving average, are reported for the considered initial conditions. It can be seen that the cost converges in all the learning tests, even in the cases in which some parameter value does not reach a steady state value (thus suggesting a low sensitivity of the cost with respect to such parameters). By using the stopping criterion discussed in Sect 4, i.e., terminating the learning procedure when the smoothed cost does not decrease more than a predefined threshold (here set to \(10^{-5}\)), one has that the procedure converges on average after approximately 1800 iterations.

Fig. 8
figure 8

Scenario C. Distance output y obtained with the control parameter vectors \(K^{(1)}\) (black line), \(K^{*}\) (red line) and \({\hat{K}}\) (yellow line)

Fig. 9
figure 9

Scenario C. Evolution of the smoothed cost J for the considered initial conditions

Table 3 Results for Scenario C

6 Conclusions

Optimization of performance measures in orbital tracking is a challenging task due to the complexity of the dynamic models and the necessity to guarantee fundamental requirements such as stability, robustness and constraint satisfaction. This work has shown that a simple learning technique, based on the ARS algorithm, can be successfully employed to tune the parameters of a family of stabilizing controllers for orbital tracking, in order to optimize a cost function accounting for both settling time and fuel consumption. The approach combines the benefits of model-based control design to those of simulation-based learning techniques. A major advantage of the proposed approach lies in its computational efficiency, which makes it compatible with on-board implementation. It is believed that the proposed learning procedure can be successfully employed to optimize the parameters of other families of control laws, while guaranteeing specific stability/performance properties, during the parameter exploration phase. In perspective, the proposed methodology can also be useful to analyze the sensitivity of the performance metrics with respect to the control parameters. Future research may concern the comparison of the proposed algorithm with other learning approaches (e.g., policy optimization) and the inclusion of state/input constraints or parametric uncertainties in the control synthesis problem.