1 Introduction

Within the next decade, we will witness an exponential increase in the use of artificial intelligence to support the operations of all sorts of flying machines. The interest will range from military applications to the growing civil aviation market, which will also include the new category of personal flying vehicles. This research makes use of consolidated results in artificial intelligence and deep reinforcement learning research to demonstrate the AI-based flight control of a high-performance fixed-wing aircraft, taken as an example of an aerial platform exhibiting nonlinear behaviour.

1.1 Nonlinear dynamics in fixed-wing aircraft and established engineering approach to flight control

Many current fixed-wing aerial vehicles display some nonlinear dynamics, such as inertial coupling, or aerodynamic and propulsive nonlinearities. Future aircraft are likely to have even more significant nonlinearities in their flight dynamics. For example, to enhance their agility some unmanned aerial vehicles (UAV) designs will feature bio-inspired morphing wings that significantly change their shapes. High-altitude long-endurance UAVs as well as future commercial airliner designs will have a high aspect ratio, and highly flexible lifting surfaces to improve their efficiency. Airbreathing hypersonic vehicles (AHV) are yet another example of plants exhibiting significant cross-coupling, nonlinearity, non-minimum phase behaviour, and model uncertainty, even more than traditional aircraft in terms of dynamics. All these designs make flight control a challenging problem.

An established engineering approach to design flight controllers is to rely on gain scheduling, to switch between different linear controllers that were tuned for a known set of specific operating points [1]. Because of their dependence on linearized plant dynamics, these gain scheduling techniques must be carefully designed to accomplish their tasks across the whole mission envelope. In practice, the classical linear controllers are generally used in a conservative fashion by setting some constraints to the flight envelopes and limiting the operability of the aerial vehicles. On the other hand, nonlinear control algorithms can be designed to attain the advantage of using the full flight envelope, especially for unconventional and disruptive aircraft configurations. Model-free adaptive control as well as intelligent control techniques provide the possibility to replace the classical controller structure with a more general approach, which turns out to be also fault-tolerant to changes in model dynamics [2].

1.2 Intelligent control approaches

Many authors propose intelligent control approaches that combine artificial intelligence and automatic control methodologies such as adaptive neural control [3] and fuzzy control [4]. Examples of application to flying vehicles include adaptive neural control of AHVs, integral sliding mode control, and stochastic adaptive attitude control [5], as well as prescribed performance dynamic neural network control of flexible hypersonic vehicles [6]. Yet these methods assume a simplified motion dynamic model that considers only the longitudinal dynamics and an affine form of the control input. An alternative adaptive neural control method has been proposed in [7] in order to better reflect the characteristics of the actual plant model ensuring the reliability of the designed control laws, still considering a reduced 3-DoF motion model.

Reinforcement learning (RL) is a branch of machine learning that offers an approach to intelligent control inspired by the way biological organisms learn their tasks. RL is quite a large subject and provides the means to design a wide range of control strategies. In essence, an RL-based control task is performed by an agent—that is, a piece of software—able to learn a control policy through trial and error. The agent learns by interacting with its environment, which includes the plant under study [8]. In its early implementations, RL was based on tabular methods with discrete action and state spaces and gained quite a lot of popularity for its applications to gaming environments. One notable early usage example of RL in the aviation domain has been the control of glider soaring [9]. This study demonstrated that a flight control policy to effectively gain lift from ascending thermal plumes could be successfully learnt.

1.3 Deep reinforcement learning as a promising field for nonlinear control

To address the curse of dimensionality encountered in RL approaches when making the action and state spaces larger, function approximators (typically neural networks, NNs) have been used in the past decade in various fields to enable continuous control, with agents implemented as structures known as actor-critic. The controller agent has a policy function and a value function to interact with its environment. By observing the state of the environment, the agent performs an action based on its current policy to change the environment state, and consequently, receives a reward. Both the policy function and the value function are represented by neural networks that have to be made optimal by the end of the learning process. The policy function is updated in a gradient ascent manner according to the estimated advantage of performing the action; the value function is updated in a supervised fashion to estimate the advantage based on the collected rewards. The nonlinear activation functions in neural networks can represent a highly nonlinear transfer from inputs to outputs, and deep reinforcement learning (DRL) is in fact a promising field for nonlinear control. Through the actor-critic scheme, an agent can learn to control an aircraft model by maximizing the expected reward, and the reward calculation function is a crucial aspect of the agent training process. Particular forms of possible reward functions for the aircraft control scenarios studied in this work are discussed in the article.

Several novel implementations of the actor-critic paradigm have been recently introduced, and most of them fall under the category of DRL approaches. Their ability to perform end-to-end offline learning has made it possible to design agents that surpass human performance, as demonstrated with different Atari games [10]. In this study, the agent observes an image of the game, a high-dimensional space, and learns a policy using a deep Q-network (DQN) for its discrete action space. DRL was extended to continuous-time control applications with the off-policy deep deterministic policy gradient (DDPG) algorithm [11]. For its actor-network policy update, the DDPG algorithm uses a critic network that provides a value function approximator. This method also estimates the temporal difference of the action-state value function as well as the state value function during learning and is capable of reducing the variance of the gradient estimation.

DDPG flight control applications have been limited to small-scale flying wings [12] and quadcopters [13]. An on-policy RL algorithm known as proximal policy optimization (PPO) [14] has been introduced in order to reduce DDPG’s learning instability. The PPO showed improved policy convergence when applied to the flight control of an unmanned flying-wing aircraft [15]. A refined algorithm was proposed in [16] to control underwater glider attitude based on a combination of an active disturbance rejection control (ADRC) algorithm and a natural actor-critic (NAC) algorithm. A recently developed actor-critic agent structure for closed-loop attitude control of a UAV is discussed in [17], which addresses the optimal tracking problem of continuous-time nonlinear systems with input constraints. The work introduces a novel tuning law for the critic’s neural network parameters that can improve the learning rate.

1.4 Applications to unmanned combat aerial vehicles (UCAVs)

DRL has a distinct advantage over other approaches to flight control in its ability to perceive information and learn effectively. This makes it an ideal method for solving complex, high-dimensional sequential decision-making problems, such as those encountered in air combat [18]. In modern aerial combat with unmanned combat aerial vehicles (UCAVs), short-range aerial combat (dogfight) is still considered an important topic [19, 20]. The USA has planned a roadmap by 2030 for the formation flight of UCAVs [21]. In the roadmap, such a UAV mission is expected to be complex and multifaceted, requiring precise waypoint-line tracking and advanced formation flight algorithms to enable UCAVs to execute it effectively. In complex formation flights such as aerial combat situations, the ability of a UCAV to quickly adjust its flight attitude and path is crucial. This ability is often referred to as flight agility, which involves the rapid and seamless transition between different flight states.

To simulate aerial combat scenarios involving unmanned vehicles, Wang et al. [22] extended their basic manoeuvre library to account for the enhanced manoeuvring capabilities of UCAVs. They also proposed a robust decision-making method for manoeuvring under incomplete target information. Additionally, reinforcement learning [19, 23] was explored as a machine learning approach to generate strategies, and McGrew [24] utilized the approximate dynamic programming (ADP) approach to solve a two-dimensional aerial combat problem involving a single UCAV acting as an agent and learning near-optimal actions through interaction with the combat environment.

Researchers have been exploring as well the potential of combining DRL with UCAVs’ air combat decision-making [18, 20, 25]. For instance, Li [20] proposed a model for UCAV autonomous manoeuvre decision-making in short-range air combat, based on the multi-step double deep Q-network (MS-DDQN) algorithm. To address the limitations of traditional methods such as poor flexibility and weak decision-making ability, some researchers have proposed the use of deep learning for manoeuvring [18]. Liu et al. [25] proposed a multi-UCAV cooperative decision-making method based on a multi-agent proximal policy optimization (MAPPO) algorithm. DRL has been used by Hu et al. [26] to plan beyond-visual-range air combat tactics and by Wang et al. [27] to quantify the relationship between the flight agility of a UCAV and its short-range aerial combat effectiveness. A similar approach has been also presented by Yang et al. [28].

While the aforementioned approaches have demonstrated some success in simulating aerial combat scenarios, they have primarily focused on generating action strategies, and only a few have applied a six-degree-of-freedom (6-DoF) nonlinear aircraft model to the simulations. The mass point model that was used in some studies fails to accurately represent the flight performance of a high-order UCAV. Additionally, directly applying these complex algorithms to a 6-DoF nonlinear model has been often discarded because of its demanding computational requirements. To address this limitation, Shin et al. [29] proposed a basic flight manoeuvre (BFM)-based approach to guide the attacker and simulate a 6-DoF nonlinear UCAV model in an aerial combat environment. These approaches combine a 3D point mass model with a control law design based on nonlinear dynamic inversion (NDI) to control the three-dimensional flight path of a 6-DoF nonlinear UCAV model [22].

1.5 Research contribution

This research introduces an effective nonlinear control application based on the DDPG actor-critic approach to the full-blown nonlinear, high-fidelity flight dynamics model (FDM) of a fighter jet. The use of this type of controller has been documented only for small-scale UAVs, and the present article addresses the application to the three-dimensional navigation of a high-performance UCAV. For this kind of application, the research does not assume a reduced order model to train the agent—such as the methods presented in [22] and [29]. In addition, unlike the BFM-based approaches, the proposed AI-based strategy generates the piloting commands to control directly the nonlinear 6-DoF UCAV model. Finally, extensive validation is carried out by a well-known realistic simulator. A high-fidelity FDM of the General Dynamics F-16 Fighting Falcon has been used, paving the way for future applications to other similar aircraft. The chosen FDM is publicly available and implemented for the JSBSim flight dynamics software library [30]. The simulation results presented in this work were obtained using JSBSim within the MATLAB/Simulink computational environment.

Moreover, the present effort deals with some modelling aspects in the aircraft state equations that are not commonly found in the literature (when DRL is applied to flight control). The underlying mathematical formulation to the 6-DoF aircraft-controlled dynamics presented here is based on the attitude quaternion; hence, it is general and suitable for all kinds of flight task simulations.

This work validates an AI-based flight control approach in a complex and high-performance requirement scenario, such as that of the target following. The control is accomplished by an agent trained with a DRL algorithm, whose applications have been documented in literature so far only for simpler models and less dynamic operative scenarios. In fact, the FDM implementation of the chosen aeroplane is highly nonlinear and realistic. The application examples shown at the end of the article demonstrate simulation scenarios where significant variations occur in the state space (for instance, altitude and flight Mach number changes) and multiple nonlinear effects come into play.

Finally, an additional contribution to the works available in the literature is the use of an established engineering simulation environment to validate the proposed approach. This, combined with the fact that the FDM includes realistic aerodynamics, propulsion, and FCS models, makes the agent validation closer to those learning applications occurring in experimental environments.

1.6 Article organization

The next section summarizes the main concepts regarding deep reinforcement learning and discusses how this approach has been employed to control an aircraft in atmospheric flight.

The mathematical background of the article is summarized in Sect. 2. The foundations of DRL applied to flight control and the chosen control approach are explained in Sect. 3, where the details about the reward shaping are also presented. Simulation test scenarios that validate the proposed control strategy are presented in Sect. 4. A discussion of the presented results is given in Sect. 5. Finally, the conclusions are presented in Sect. 6. The article includes also three appendices A, B, and C with further details on the aerodynamic, propulsive, and flight control system models of the F-16 aircraft.

Fig. 1
figure 1

Earth-based NED frame \({\mathcal {F}}_\textrm{E}\) and aircraft body-fixed frame \({\mathcal {F}}_\textrm{B}\). Velocity vector \({\varvec{V}}\) of aircraft gravity centre G with respect to the inertial observer. Ground track point \(P_\textrm{GT}\) of the instantaneous gravity centre position \((x_{\textrm{E},G}, y_{\textrm{E},G}, x_{\textrm{E},G})\). Standard definitions of aircraft Euler angles \((\psi ,\theta ,\phi )\) and flight path angles \((\gamma , \chi )\)

2 Mathematical background

2.1 Rigid aeroplane nonlinear 6-DoF flight dynamics model

The 6-DoF atmospheric motion of a rigid aeroplane is governed by a set of nonlinear ordinary differential equations. These equations are standard; therefore, the complete derivation is herein omitted for sake of brevity—deferring the interested reader to the related references [1]. The derivation of such a system assumes a clear identification of a ground-based reference frame and an aircraft-based reference frame. The former is treated as an inertial frame, neglecting the effects of the rotational velocity of the Earth, which is a good approximation when describing the motion of aeroplanes in the lower regions of the Earth’s atmosphere. This approximation is acceptable for subsonic as well as supersonic flight and fits the fighter jet control example considered in this research. These two main reference frames, besides other auxiliary frames, are shown in Fig. 1. The Earth-based inertial frame is named here \({\mathcal {F}}_\textrm{E} = \left\{ O_\textrm{E}; x_\textrm{E}, y_\textrm{E}, z_\textrm{E}\right\} \), having its origin fixed to a convenient point \(O_\textrm{E}\) on the ground and its plane \(x_\textrm{E} y_\textrm{E}\) tangent to the assumed Earth geometric model; the axis \(x_\textrm{E}\) points towards the geographic North, the axis \(y_\textrm{E}\) points towards the East; the axis \(z_\textrm{E}\) points downwards, towards the centre of the Earth. For this reason, the frame \({\mathcal {F}}_\textrm{E}\) is also called tangent NED frame (North-East-Down). The aircraft body-fixed frame \({\mathcal {F}}_\textrm{B} = \left\{ G; x_\textrm{B}, y_\textrm{B}, z_\textrm{B}\right\} \), has its origin located at the centre of gravity G of the aircraft; the roll axis \(x_\textrm{B}\) runs along the fuselage and points out of the nose; the pitch axis \(y_\textrm{B}\) points towards the right wing tip; the yaw axis \(z_\textrm{B}\) points towards the belly of the fuselage.

The motion equations are derived from Newton’s second law applied to the flight of an air vehicle, leading to six core scalar equations (the conservation laws of the linear and angular momentum projected onto the moving frame \({\mathcal {F}}_\textrm{B}\)), followed by the flight path equations (used for navigation purposes, for tracking the aircraft flight in terms of centre-of-gravity position with respect to the Earth-based frame \({\mathcal {F}}_\textrm{E}\)), and by the rigid-body motion kinematic equations (providing a relationship for the aircraft attitude quaternion, used for expressing the orientation of the body axes with respect to the inertial ground frame).

The full set of equations in closed form is introduced in this section. The JSBSim flight dynamics library implements them in a more general form, which is beyond the scope of this article. However, the following presentation serves to highlight the intricacies and the nonlinearities inherent to the control problem explored by the present research.

2.1.1 Conservation of the linear momentum equations (CLMEs)

The conservation of the linear momentum equation (CLMEs) for a rigid aeroplane of constant mass provides the following three core scalar equations [1]:

$$\begin{aligned} {\dot{u}}= & {} \,r\,v - q\,w + \dfrac{1}{m}\left( W_x + F_x^{\mathrm {(A)}} + F_x^{\mathrm {(T)}} \right) \end{aligned}$$
(1a)
$$\begin{aligned} {\dot{v}}= & {} -\,r\,u + p\,w + \dfrac{1}{m}\left( W_y + F_y^{\mathrm {(A)}} + F_y^{\mathrm {(T)}} \right) \end{aligned}$$
(1b)
$$\begin{aligned} {\dot{w}}= & {} \,q\,u - p\,v\, + \dfrac{1}{m}\left( W_z + F_z^{\mathrm {(A)}} + F_z^{\mathrm {(T)}} \right) \end{aligned}$$
(1c)

where the \({\varvec{W}}\) is the aircraft weight force, \({\varvec{F}}^{\mathrm {(A)}}\) is the resultant aerodynamic force, and \({\varvec{F}}^{\mathrm {(T)}}\) is the resultant thrust force. Their components in the body frame \({\mathcal {F}}_\textrm{B}\) are conveniently expressed to obtain a closed form of Eqs. (1a)–(1b)–(1c).

The aircraft weight is a vertical force vector, i.e. always aligned to the inertial axis \(z_\textrm{E}\), of constant magnitude mg, that is expressed in terms of its components in body axes as follows:

$$\begin{aligned} \left\{ \begin{array}{c} W_x \\ W_y\\ W_z \end{array}\right\} {=} \left[ T_{\textrm{BE}}\right] \left\{ \begin{array}{c} 0 \\ 0\\ mg \end{array}\right\} {=} \left\{ \begin{array}{c} 2\big (q_z\, q_x{-}q_0\, q_y\big ) \\ 2\big (q_y\, q_z{+}q_0\, q_x\big ) \\ q_0^2{-}q_x^2{-}q_y^2{+}q_z^2 \end{array} \right\} mg\nonumber \\ \end{aligned}$$
(2)

where \([T_{\textrm{BE}}]\) is the direction cosine matrix representing the instantaneous attitude of frame \({\mathcal {F}}_\textrm{B}\) with respect to frame \({\mathcal {F}}_\textrm{E}\). The entries of \([T_{\textrm{BE}}]\) are functions of the aircraft attitude quaternion components \((q_0,q_x,q_y,q_z)\) [1]:

$$\begin{aligned} \left[ T_{\textrm{BE}}\right] {=} \left[ \begin{array}{ccc} q_0^2{+}q_x^2{-}q_y^2{-}q_z^2 &{} 2\big (q_x\, q_y{+}q_0\, q_z\big ) &{} 2\big (q_z\, q_x{-}q_0\, q_y\big ) \\ 2\big (q_x\, q_y{-}q_0\, q_z\big ) &{} q_0^2{-}q_x^2{+}q_y^2{-}q_z^2 &{} 2\big (q_y\, q_z{+}q_0\, q_x\big ) \\ 2\big (q_z\, q_x{+}q_0\, q_y\big ) &{} 2\big (q_y\, q_z{-}q_0\, q_x\big ) &{} q_0^2{-}q_x^2{-}q_y^2{+}q_z^2 \end{array} \right] \nonumber \\ \end{aligned}$$
(3)

The instantaneous resultant aerodynamic force \({\varvec{F}}^{\mathrm {(A)}}\) acting on the air vehicle, when projected onto \({\mathcal {F}}_\textrm{B}\), is commonly expressed as follows [1]:

$$\begin{aligned} \left\{ \begin{array}{c} F_x^{\mathrm {(A)}} \\ F_y^{\mathrm {(A)}}\\ F_z^{\mathrm {(A)}} \end{array}\right\}= & {} \left[ T_{\textrm{BW}}\right] \left\{ \begin{array}{c} -D \\ -C\\ -L \end{array}\right\} \nonumber \\= & {} \left\{ \begin{array}{c} {-} D \cos \alpha \cos \beta {+} L \sin \alpha {+} C \cos \alpha \sin \beta \\ {-} C \cos \beta {-} D \sin \beta \\ {-} D \sin \alpha \cos \beta {-} L \cos \alpha {+} C \sin \alpha \sin \beta \end{array}\right\} \end{aligned}$$
(4)

where the aerodynamic drag D, the aerodynamic cross force C and the aerodynamic lift L are involved to conveniently model the effect of the external airflow, and where

$$\begin{aligned} \left[ T_{\textrm{BW}}\right] = \left[ \; \begin{matrix} \cos \alpha &{} 0 &{} -\sin \alpha \\ 0 &{} 1 &{} 0 \\ \sin \alpha &{} 0 &{} \cos \alpha \end{matrix} \;\right] \left[ \; \begin{matrix} \cos \beta &{} \sin (-\beta ) &{} 0 \\ -\sin (-\beta ) &{} \cos \beta &{} 0 \\ 0 &{} 0 &{} 1 \end{matrix} \;\right] \nonumber \\ \end{aligned}$$
(5)

is the coordinate transformation matrix from the standard wind frame \({\mathcal {F}}_\textrm{W}=\left\{ G; x_\textrm{W}, y_\textrm{W}, z_\textrm{W}\right\} \), commonly used by aerodynamicists (see Fig. 2), to \({\mathcal {F}}_\textrm{B}\).

Fig. 2
figure 2

(Left) Aerodynamic angles, aerodynamic (or stability) frame \({\mathcal {F}}_\textrm{A}\). (Right) Wind frame \({\mathcal {F}}_\textrm{W}\) and aerodynamic forces (DCL)

Equations (1a)–(1b)–(1c) are actually set in closed form because the aerodynamic angles \((\alpha , \beta )\) and the aerodynamic force components (DCL) appearing in expressions (4) are modelled as functions of aircraft state variables and of external inputs. According to Fig. 2, the state variables (uvw), being defined as components in \({\mathcal {F}}_\textrm{B}\) of the aircraft gravity centre velocity vector \({\varvec{V}}\), are expressed in terms of \((\alpha , \beta )\) as follows:

$$\begin{aligned} u= & {} V \cos \beta \cos \alpha , \quad v = V \sin \beta , \nonumber \\ w= & {} V \cos \beta \sin \alpha \end{aligned}$$
(6)

with

$$\begin{aligned} V= \sqrt{u^2+v^2+w^2} \end{aligned}$$
(7)

Consequently, the instantaneous angles of attack and of sideslip are given by

(8)

The aerodynamic force components are expressed in terms of their aerodynamic coefficients according to the conventional formulas:

$$\begin{aligned} D= & {} \dfrac{1}{2}\rho V^2 \, S \, C_D,\quad C = \dfrac{1}{2}\rho V^2 \, S \, C_C,\nonumber \\ L= & {} \dfrac{1}{2}\rho V^2 \, S \, C_L \end{aligned}$$
(9)

where the external air density \(\rho \) is a known function of the flight altitude \(h=-z_{\textrm{E},G}\) (along with other gas properties, such as the sound speed a) [31], S is a constant reference area, and coefficients \((C_D, C_C, C_L)\) are modelled as functions of aircraft state variables and external inputs. Appendix A presents the details of a nonlinear aerodynamic model for the F-16 fighter jet.

Fig. 3
figure 3

Thrust vector, thrust magnitude T, thrust line angle \(\mu _T\), thrust line eccentricity \(e_T\)

Finally, according to Fig. 3, the resultant thrust force \({\varvec{F}}^{\mathrm {(T)}}\), which is a vector of magnitude T, can be expressed in terms of its body-frame components, in the case of symmetric propulsion (zero y-component), as follows:

$$\begin{aligned} \left\{ \begin{array}{c} F_x^{\mathrm {(T)}} \\ F_y^{\mathrm {(T)}}\\ F_z^{\mathrm {(T)}} \end{array}\right\} = \delta _T \, T_{\max }(h, M) \left\{ \begin{array}{c} \cos \mu _\textrm{T} \\ 0\\ \sin \mu _\textrm{T} \end{array}\right\} \end{aligned}$$
(10)

where \(\mu _T\) is a known constant angle formed by the thrust line in the aircraft symmetry plane with the reference axis \(x_\textrm{B}\), \(T= \delta _T \, T_{\max }(h, M)\), \(\delta _T\) is the throttle setting (an external input to the system), and \(T_{\max }(h, M)\) is the maximum available thrust, i.e. a known function of altitude and flight Mach number \(M=V/a\). Appendix B presents the details of a nonlinear thrust model for the F-16 fighter jet.

2.1.2 Conservation of the angular momentum equations (CAMEs)

The conservation of the angular momentum equations (CAMEs) for a rigid aeroplane of constant mass are the following [1]:

$$\begin{aligned} {\dot{p}}= & {} \big (C_{1}\,r + C_{2}\,p\big )q + C_{3}\,{\mathcal {L}} + C_{4}\,{\mathcal {N}} \end{aligned}$$
(11a)
$$\begin{aligned} {\dot{q}}= & {} C_{5}\,p\,r - C_{6}\,\big ( p^{2} - r^{2} \big ) + C_{7}\,{\mathcal {M}} \end{aligned}$$
(11b)
$$\begin{aligned} {\dot{r}}= & {} \big (C_{8}\,p - C_{2}\,r \big )\,q + C_{4}\,{\mathcal {L}} + C_{9}\,{\mathcal {N}} \end{aligned}$$
(11c)

where

$$\begin{aligned} C_{1}= & {} \dfrac{1}{\Gamma }\big [ (I_{yy} - I_{zz})I_{zz} - I^2_{xz} \big ],\nonumber \\ C_{2}= & {} \dfrac{1}{\Gamma }\big [ (I_{xx} - I_{yy} + I_{zz})I_{xz}\big ] \end{aligned}$$
(12a)
$$\begin{aligned} C_{3}= & {} \dfrac{I_{zz}}{\Gamma },\quad C_{4} = \dfrac{I_{xz}}{\Gamma },\quad C_{5} = \dfrac{I_{zz}-I_{xx}}{I_{yy}} \end{aligned}$$
(12b)
$$\begin{aligned} C_{6}= & {} \dfrac{I_{xz}}{I_{yy}},\quad C_{7} = \dfrac{1}{I_{yy}} ,\nonumber \\ C_{8}= & {} \dfrac{1}{\Gamma } \big [(I_{xx}-I_{yy})I_{xx} + I^2_{xz} \big ],\quad C_{9} = \dfrac{I_{xx}}{\Gamma } \end{aligned}$$
(12c)

and \(\Gamma = I_{xx}I_{zz} - I^2_{xz}\) are constants of the model known from the rigid aeroplane inertia matrix calculated with respect to the axes of \({\mathcal {F}}_\textrm{B}\).

The instantaneous resultant external moment \({\varvec{M}}\) about the pole G acting on the air vehicle is the sum of the resultant aerodynamic moment \({\varvec{M}}^\mathrm {(A)}\) and of the resultant moment \({\varvec{M}}^\mathrm {(T)}\) due to thrust line eccentricity with respect to G. When \({\varvec{M}}\) is projected onto \({\mathcal {F}}_\textrm{B}\), it is commonly expressed as:

$$\begin{aligned} \left\{ \begin{array}{c} {\mathcal {L}} \\ {\mathcal {M}}\\ {\mathcal {N}} \end{array}\right\} = \left\{ \begin{array}{c} {\mathcal {L}}^{\mathrm {(A)}} \\ {\mathcal {M}}^{\mathrm {(A)}}\\ {\mathcal {N}}^{\mathrm {(A)}} \end{array}\right\} + \left\{ \begin{array}{c} 0 \\ {\mathcal {M}}^{\mathrm {(T)}}\\ 0 \end{array}\right\} \end{aligned}$$
(13)

in the case of symmetric propulsion (thrust line in the aircraft symmetry plane).

Equations (11) are actually set in closed form because the aerodynamic moments about the roll, pitch and yaw axes \(({\mathcal {L}}^{\mathrm {(A)}}, {\mathcal {M}}^{\mathrm {(A)}}, {\mathcal {N}}^{\mathrm {(A)}})\) are modelled as functions of aircraft state variables and of external inputs. The same applies to the pitching moment \({\mathcal {M}}^{\mathrm {(T)}}\).

The body-axis components of \({\varvec{M}}^{\mathrm {(A)}}\) are commonly expressed in terms of their coefficients according to the conventional formulas:

$$\begin{aligned} \left\{ \begin{array}{c} {\mathcal {L}}^{\mathrm {(A)}} \\ {\mathcal {M}}^{\mathrm {(A)}}\\ {\mathcal {N}}^{\mathrm {(A)}} \end{array}\right\} = \dfrac{1}{2}\rho V^2 \, S \, \left\{ \begin{array}{c} b\,C_{\mathcal {L}} \\ {\bar{c}}\,C_{\mathcal {M}}\\ b\,C_{\mathcal {N}} \end{array}\right\} \end{aligned}$$
(14)

where b and \({\bar{c}}\) are reference lengths known from the aeroplane’s geometry. Appendix A presents a high-fidelity nonlinear model of the aerodynamic roll-, pitch-, and yaw-moment coefficients \((C_{\mathcal {L}},C_{\mathcal {M}},C_{\mathcal {N}})\) for the F-16 fighter jet.

The pitching moment due to thrust \({\mathcal {M}}^{\mathrm {(T)}}\) is given by the direct action of the thrust vector:

$$\begin{aligned} {\mathcal {M}}^{\mathrm {(T)}} = T \, e_T \end{aligned}$$
(15)

where the eccentricity \(e_T\) is a known parameter, positive when the thrust line is located beneath the centre of gravity (\(e_T<0\) as shown in Fig. 3).

2.1.3 Flight path equations (FPEs)

Systems (1a)–(1b)–(1b)–(11a)–(11b)–(11c) of CLMEs and CAMEs projected onto the moving frame \({\mathcal {F}}_\textrm{B}\) must be necessarily augmented with two additional systems of equations to solve the aircraft dynamics and propagate its state in time. One such system is needed for expressing the trajectory of the aircraft with respect to the Earth-based inertial frame. The flight path equations (FPEs) are specifically used for this purpose. The outputs of this system of differential equations form the instantaneous position \(\{x_{\textrm{E},G}(t),y_{\textrm{E},G}(t),z_{\textrm{E},G}(t)\}\) of the aircraft centre of gravity G in the space \({\mathcal {F}}_\textrm{E}\). The reduced 2D version \(\{x_{\textrm{E},G}(t),y_{\textrm{E},G}(t)\}\) of the FPEs provides the so-called ground track relative to the aircraft flight.

The FPEs are simply derived from the component transformation of vector \({\varvec{V}}\) from frame \({\mathcal {F}}_\textrm{B}\) to frame \({\mathcal {F}}_\textrm{E}\)

$$\begin{aligned} \left\{ \begin{array}{c} {\dot{x}}_{\textrm{E},G} \\ {\dot{y}}_{\textrm{E},G} \\ {\dot{z}}_{\textrm{E},G} \end{array} \right\} = \left[ T_{\textrm{EB}}\right] \left\{ \begin{array}{c} u \\ v \\ w \end{array}\right\} \end{aligned}$$
(16)

knowing that \(\left[ T_{\textrm{EB}}\right] = \left[ T_{\textrm{BE}}\right] ^{\textrm{T}}\) and accounting for definition (3). The FPEs are then written in matrix format as follows:

$$\begin{aligned}{} & {} \left\{ \begin{array}{c} {\dot{x}}_{\textrm{E},G} \\ {\dot{y}}_{\textrm{E},G} \\ {\dot{z}}_{\textrm{E},G} \end{array} \right\} = \left[ \begin{array}{ccc} q_0^2{+}q_x^2{-}q_y^2{-}q_z^2 &{} 2\big (q_x\, q_y{-}q_0\, q_z\big ) &{} 2\big (q_z\, q_x{+}q_0\, q_y\big ) \\ 2\big (q_x\, q_y{+}q_0\, q_z\big ) &{} q_0^2{-}q_x^2{+}q_y^2{-}q_z^2 &{} 2\big (q_y\, q_z{-}q_0\, q_x\big ) \\ 2\big (q_z\, q_x-q_0\, q_y\big ) &{} 2\big (q_y\, q_z+q_0\, q_x\big ) &{} q_0^2{-}q_x^2{-}q_y^2{+}q_z^2 \end{array} \right] {}\nonumber \\{} & {} \qquad \left\{ \begin{array}{c} u \\ v\\ w \end{array} \right\} \end{aligned}$$
(17)

The inputs for the FPEs are the aircraft attitude quaternion components along with the components (uvw), which are provided by the solution of the combined (CLMEs)-(CAMEs) system.

The quaternion components instead are derived from the body-frame components (pqr) of the aircraft angular velocity vector \(\varvec{\Omega }\) through the solution of another set of equations to be introduced next.

2.1.4 Kinematic equations (KEs)

The rigid-body kinematic equations (KEs) based on the aircraft attitude quaternion components [1] are written in matrix format as follows:

$$\begin{aligned} \left\{ \begin{array}{c} {\dot{q}}_0 \\ {\dot{q}}_x \\ {\dot{q}}_{y} \\ {\dot{q}}_{z} \end{array} \right\} = \frac{1}{2}\, \left[ \begin{array}{cccc} 0 &{} -p &{} -q &{} -r \\ p &{} \,\,\,0 &{} \,\,\,r &{} -q \\ q &{} -r &{} \,\,\,0 &{} \,\,\,p \\ r &{} \,\,\,q &{} -p &{} \,\,\,0 \end{array} \right] \left\{ \begin{array}{c} q_0 \\ q_x \\ q_{y} \\ q_{z} \end{array} \right\} \end{aligned}$$
(18)

The inputs to the KEs are the angular velocity components (pqr) in \({\mathcal {F}}_\textrm{B}\). Their solution provides the numerical values of the kinematic state variables \((q_0,q_x,q_y,q_z)\).

Fig. 4
figure 4

Aircraft Euler angles and attitude with respect to the Earth frame

Fig. 5
figure 5

Simplified scheme of a General Dynamics F-16 flight control system. See Appendixes B and C for more details on the flight controller and engine controller

The above system is the set of differential equations of choice, nowadays, in large-scale simulations because it prevents the singularity known as ‘gimbal lock’ associated with the alternative formulation based on aircraft Euler angles. However, at each instant t when a quadruplet of quaternion components is known, the Euler angles \((\psi ,\theta ,\phi )\), shown in Fig. 4, are calculated according to a standard algorithm [32].

2.1.5 Summary of the equations and of system inputs

The system of (CLMEs)-(CAMEs)-(FPEs)-(KEs), that is, (1)–(11)–(17)–(18), is the full set of 13 coupled nonlinear differential equations governing the 6-DoF, rigid-body dynamics of atmospheric flight. They are in a closed form once the aerodynamic as well as the propulsive external forces and moments are completely modelled as functions of 13 state variables

$$\begin{aligned} {\varvec{x}}= \big [u,v,w,\,p,q,r,\,x_{\textrm{E},G},y_{\textrm{E},G},z_{\textrm{E},G},\,q_0,q_x,q_y,q_z\big ]^{\textrm{T}}\nonumber \\ \end{aligned}$$
(19)

i.e. of a state vector \({\varvec{x}}\), and of a number of external inputs grouped into an input vector, commonly known as vector \({\varvec{u}}\).

The F-16 public domain model used for this research features a quite articulated and high-fidelity flight control system (FCS), whose simplified scheme is depicted in Fig. 5. The FCS, which receives state feedback from the aircraft dynamics block, includes the following channels: (i) Roll command \({\tilde{\delta }}_\textrm{a}\) (acting on right aileron deflection angle \(\delta _\textrm{a}\) and on the antisymmetric left aileron deflection), (ii) Pitch command \({\tilde{\delta }}_\textrm{e}\) (acting on elevon deflection angle \(\delta _\textrm{e}\)), (iii) Yaw command \({\tilde{\delta }}_\textrm{r}\) (acting on rudder deflection angle \(\delta _\textrm{r}\)), (iv) Throttle lever command \({\tilde{\delta }}_T\) (acting on throttle setting \(\delta _T\), with the possibility to trigger the jet engine afterburner), (v) Speed brake command \({\tilde{\delta }}_\textrm{sb}\) (acting on speed break deflection angle \(\delta _\textrm{sb}\)), (vi) Wing trailing-edge flaps command \({\tilde{\delta }}_\textrm{f,TE}\) (acting on trailing-edge flap deflection angle \(\delta _\textrm{f,TE}\)) and (vii) Wing leading-edge flap command \({\tilde{\delta }}_\textrm{f,LE}\) (acting on leading-edge flap deflection angle \(\delta _\textrm{f,LE}\)).

Some of these channels are associated with actual pilot’s input command signals, such as the primary flight controls \({\tilde{\delta }}_\textrm{a}\), \({\tilde{\delta }}_\textrm{e}\), \({\tilde{\delta }}_\textrm{r}\), \({\tilde{\delta }}_T\). Some other signals are mainly actuated and controlled by the FCS, for instance, \({\tilde{\delta }}_\textrm{f,TE}\), \({\tilde{\delta }}_\textrm{f,LE}\) and \({\tilde{\delta }}_\textrm{b}\). Additional details on the FCS including a summary of the control logic are reported in Appendix C.

Fig. 6
figure 6

General Dynamics F-16 Fighting Falcon. Aerodynamic control surface deflections

A selected number of the above inputs form the vector \(\varvec{{\tilde{u}}}_\textrm{agt}\) of normalized commands used by the agent to interact with the system in all training sessions:

$$\begin{aligned} \varvec{{\tilde{u}}}_\textrm{agt} = \big [{{\tilde{\delta }}}_{\textrm{a}},{{\tilde{\delta }}}_{\textrm{e}},{{\tilde{\delta }}}_{\textrm{r}},{\tilde{\delta }}_T\big ]^{\textrm{T}} \end{aligned}$$
(20)

The full input vector to the nonlinear aircraft flight dynamics model during all simulations performed in this work—resulting from the action of the agent in combination with the FCS logics—is then the following:

$$\begin{aligned} {\varvec{u}} = \big [\delta _{\textrm{a}},\delta _{\textrm{e}},\delta _{\textrm{r}},\delta _{T},\delta _\textrm{f,TE},\delta _\textrm{f,LE}, \delta _\textrm{sb}\big ]^{\textrm{T}} \end{aligned}$$
(21)

The deflection angles of the F-16 aerodynamic control surfaces are depicted in Fig. 6.

3 Deep reinforcement learning approach applied to flight control

There is a growing interest, nowadays, in AI-based pilot models, which are going to augment the manned legacy fighters’ capabilities and might bring them to compete with the next-generation air dominance systems. This is the main motivation for the present research. A related motivation lies in the possibility of implementing more effective AI-assisted pilot training procedures for high-precision tasks, such as dogfights and formation flights [20, 25,26,27].

DRL is a promising approach to be combined with established flight control design techniques, and it provides the following advantages: (i) it can make a flight control system learn optimal policies, by interacting with the environment, when the aircraft nonlinear dynamics are not completely known or difficult to model; (ii) it can deal with changing situations in uncertain and strongly dynamic environments. In this section, we recall the basic concepts to explain how DRL, in particular the DDPG idea, is applicable to the chosen flight control scenarios presented later in the article.

3.1 The reinforcement learning framework

RL is a significant branch of machine learning that is concerned with how to learn control laws and policies from experience, which make a dynamical system interact with a complex environment according to a given task. Both control theory and machine learning fundamentally rely on optimization, and likewise, RL involves a set of optimization techniques within an observational framework for learning how to interact with the environment. In this sense, RL stands at the intersection of control theory and machine learning. This is explained rigorously in a well-known introductory book by Sutton and Barto [8]. On the other hand, in their recent textbook on data-driven science and engineering Brunton and Kutz [33] use a modern and unified notation to present an overview of state-of-the-art RL techniques applied to various fields, including a mathematical formulation for DRL. The reader may refer to these bibliographical references for a comprehensive explanation of the theory behind all variants of RL approaches.

RL methods apply to problems where an agent interacts with its environment in discrete time steps [8, 33], as depicted in Fig. 7. At time step k, the agent senses the state \({\varvec{s}}_k\) of its environment and decides to perform an action \({\varvec{a}}_k\) that brings to a new state \({\varvec{s}}_{k+1}\) at the next time step, obtaining a reward \(r_{k}\). The scheme is reiterated with advancing time steps, \(k\leftarrow k+1\), and the RL agent learns to take appropriate actions to achieve optimal immediate or delayed rewards.

Fig. 7
figure 7

Schematic of reinforcement learning, in the particular case of an agent’s policy \(\pi \) represented by a deep neural network (adapted from [33]). In the case of flight control, the action \({\varvec{a}}\) is the input vector \(\varvec{{\tilde{u}}}_\textrm{agt}\) defined by Eq. (20); the observed state is the vector \({\varvec{s}} = [{\varvec{x}}, \dot{{\varvec{x}}},{\varvec{u}},\dot{{\varvec{u}}}]^{\textrm{T}}\), where \({\varvec{x}}\) and \({\varvec{u}}\) are defined by Eqs. (19) and (21), respectively

The agent’s behaviour over time is given by the finite sequence of states and actions \({\varvec{e}}= \big [({\varvec{s}}_0,{\varvec{a}}_0), ({\varvec{s}}_1,{\varvec{a}}_1), \ldots ,({\varvec{s}}_{\mathcal {T}},{\varvec{a}}_{\mathcal {T}})\big ]\), also called an episode (or trajectory), where \({\mathcal {T}}\) is the episode’s final time step. An episode may end when a target terminal state is reached, or else when \({\mathcal {T}}\) becomes the maximum allowed number of time steps.

In RL, the environment and the agent are dynamic. The goal of learning is to find an optimal policy, i.e. a control law (or behaviour) that generates an action sequence able to obtain the best possible outcome. A behaviour is optimal when it collects the highest cumulated reward. This is accomplished by allowing a piece of software—that is, the agent—to explore the environment, interact with it, and learn by observing how it evolves over time. Whenever the agent takes an action, it affects the environment, which transitions to a new state, or even when the agent takes no action the environment still might change. Within an entire episode, the evolutionary environment produces a sequence of rewards, and using this information, the agent can adjust its future behaviour and learn from the process. In the particular scheme of Fig. 7, for the sake of example, the agent is a deep neural network that receives a large number of input signals and then adjusts its parameters \(\varvec{\vartheta }\) during training so that it can control the environment as requested. In the specific environment of an aeroplane in controlled flight, the action \({\varvec{a}}\) is the control input vector \(\varvec{{\tilde{u}}}_\textrm{agt}\) defined by Eq. (20), while the observed state \({\varvec{s}} = [{\varvec{x}}, \dot{{\varvec{x}}},{\varvec{u}},\dot{{\varvec{u}}}]^{\textrm{T}}\) incorporates vectors \({\varvec{x}}\) and \({\varvec{u}}\) (Eqs. (19) and (21), respectively), and their time derivatives. More details are given in Sect. 3.7 and in Fig. 9.

An important point to observe is that in RL the environment is everything that exists outside of the agent. Practically speaking, it is where the agent sends actions and what generates rewards and observations. In this context, the environment is different from what is accepted by control engineers, who tend to think of the environment as everything outside of the controller and the plant. In classical flight control approaches things like wind gusts and other disturbances that impact the system represent the environment. In RL, the environment is everything outside the controller as shown in Fig. 5, and this includes the plant dynamics as well. The agent is just a bit of software that is generating the actions and updating the control laws through learning. In the present case study, the agent acts like the ‘brain’ of the pilot that learns to control the aircraft.

The state \({\varvec{s}}\) is a collection of variables observed by the agent while interacting with the environment, which is taken from the list \(({\varvec{x}}, {\varvec{u}})\) defined by (19) and (21). For the agent that controls the F-16 flight dynamics, the action \({\varvec{a}}\) is given by the inputs \(\varvec{{\tilde{u}}}_\textrm{agt}\) defined by (20).

An important characteristic of reinforcement learning is that the agent is not required to know in advance the environment’s dynamics. It will learn on its own only by observation, provided that it has access to the states and that rewards are appropriately calculated. This makes RL a powerful approach that can lead to an effective model-free control design. In flight control, for instance, the learning agent does not need to initially know anything about a flying vehicle. It will still figure out how to collect rewards without knowing the aircraft weight, how the aerosurfaces move, or how effective they are. The training technique used in this research applies a method designed originally for model-free applications to a problem of flight control where a high-fidelity model of the environment is known.

An RL framework can be applied to a simulated environment, that requires to be modelled, or even to a physical environment. In the present study, the agent is trained within a high-fidelity model of the environment; hence, the learning experience is carried out with flight simulations. Learning is a process that often requires millions or tens of millions of samples, which means a huge number of trial simulations, error calculations and corrections. The obvious advantage of a simulated environment is that the learning process can be run at faster than real time and that simulations may be executed in parallel on multiprocessor machines or GPU clusters.

An important step in RL-based control design workflows is the deployment of the control algorithm on the target hardware or within a chosen simulation environment. When learning ends successfully, the agent’s policy can be considered optimal and frozen, and a static instance of the policy can be deployed onto the target as any other form of control law. Examples of use of an optimized agent deployed in different simulation environments are presented in Sect. 4.

3.2 Rewards

Given the environment, the RL has to be based on how the agent should behave and how it will be rewarded for doing what an assigned task prescribes. The reward can be earned at every time step (immediate), or be sparse (delayed), or can only come at the very end of an episode after long periods of time. In the F-16 flight simulations, an immediate reward is available at each environment transition from time \(t_k\) to \(t_{k+1}\) as a function \(r({\varvec{s}}_k, {\varvec{a}}_k, {\varvec{s}}_{k+1})\) of the two consecutive states and of the action that causes the transition.

In RL, there is no restriction on creating a reward function, unlike LQR in the theory of optimal control where the cost function is quadratic. The reward can be calculated from a nonlinear function or using thousands of parameters. It completely depends on what it takes to effectively train the agent. For instance, if a flight attitude is requested, with a prescribed heading \(\psi _\textrm{c}\), then one should intuitively tend to give more rewards to the agent as the heading angle \(\psi \) gets closer to the commanded direction. If one wants to take controller effort into account, then rewards should be subtracted as actuator use increases.

A reward function can really be any function one can think of. But making an effective reward function, on the other hand, requires ingenuity. Unfortunately, there is no straightforward way of crafting a reward to guarantee the agent will converge on the desired control. The definition of a reward calculation function, appropriate to a prescribed control task, is called reward shaping and is one of the most difficult tasks in RL. Section 3.11 introduces a convenient reward calculation function, which can be adopted in a flight control task where a target altitude and heading are commanded.

3.3 Policy function

Besides the environment that provides the rewards, an RL-based control must have a well-designed agent. The agent is comprised of its policy and of the learning algorithm, two things that are closely interlaced. Many learning algorithms require a specific policy structure and choosing an algorithm depends on the nature of the environment.

A policy \(\pi ({\varvec{s}},{\varvec{a}})\) defines the behaviour of the agent, in order to determine which action should be taken in each state. Formally speaking, it is a function \(\pi :({\varvec{s}},{\varvec{a}}) \mapsto [0,1]\) that maps a state–action pair to a scalar value between 0 and 1, regarded as a conditional probability density function of taking the action \({\varvec{a}}\) while observing the state \({\varvec{s}}\) (random policy). But a policy function can also be deterministic, in that case, given a state \({\varvec{s}}\) the action \({\varvec{a}}\) is nonrandom, i.e. \(\pi : {\varvec{s}} \mapsto {\varvec{a}}\).

The term ‘policy’ means that the agent has to make a decision based on the current state. Training an agent to accomplish a given task in the simulated environment means generating several simulations and updating the policy parameters while episodes are generated by maximizing the policy performance. The policy function tells the agent what to do, and learning a policy function is the most important part of an RL algorithm.

As shown in Fig. 7, the policy is a function that takes in state observations and outputs actions; therefore, any function with that input and output relationship can work. Environments with a continuous state–action space have a continuous policy function, and it makes sense to represent \(\pi \) with a general-purpose function approximator, i.e. something that can handle a continuous state \({\varvec{s}}(t)\) and a continuous action \({\varvec{a}}(t)\), without having to set the nonlinear function structure ahead of time. This is accomplished with deep neural networks, a function approximation approach which forms the basis of DRL. The training algorithm selected for this research, known as DDPG and available in MATLAB/Simulink Reinforcement Learning Toolbox, adopts a deterministic policy, i.e. a function \(\pi _{\varvec{\vartheta }}({\varvec{s}})\) that is parameterized by a finite vector \(\varvec{\vartheta }\). With this structure, hundreds of time samples of the environment state are used, gathering several multi-dimensional observations of the F-16 in simulated flight as the input into \(\pi _{\varvec{\vartheta }}\), which outputs the actuator commands that drive the aerosurface movements and the thrust lever. Even though the function might be extremely complex, there will be a neural network of some kind that can achieve it.

3.4 Value function

The policy is used to explore the environment to generate episodes and calculate rewards. A performance index \(R_k\) of the policy at time step k is also called discounted return, i.e. the weighted sum of all rewards that the agent is expected to receive from time \(t_k\) onwards:

$$\begin{aligned} R_k = \sum _{i=0}^{{\mathcal {T}}} \gamma ^i \, r_{k+i} \end{aligned}$$
(22)

where \(\gamma \in \,]0,1]\) is called as the discount rate. The discount rate is a hyperparameter tuned by the user, and it represents how much future events lose their value in terms of policy performance according to how far away in time they are. Future rewards are discounted, reflecting the economic principle that current rewards are more valuable than future rewards. In general, \(R_k\) is a random variable resulting from future states and actions that are unknown at time \(t_k\) during learning.

Given a policy \(\pi \), a value function is defined as the function that quantifies the desirability of being in a given state:

$$\begin{aligned} V_{\pi }({\varvec{s}}) = {\mathbb {E}}_{\pi }\big (\,R_k \mid {\varvec{s}}_k = {\varvec{s}}\,\big ) \end{aligned}$$
(23)

where \({\mathbb {E}}_{\pi }\) indicates the expected reward over the time steps from k to \({\mathcal {T}}\), when the state at time step k is \({\varvec{s}}\). When the subscript \(\pi \) is omitted from the notation \(V_\pi \), one refers to the value function for the best possible policy:

$$\begin{aligned} V({\varvec{s}}) = \max _{\pi } V_{\pi }({\varvec{s}}) \end{aligned}$$
(24)

One of the most important properties of the value function is that the value \(V({\varvec{s}}_k)\) at a time step k—also called V-value—is given by an elegant recursive formula known as Bellman equation for V:

$$\begin{aligned} V({\varvec{s}}) = \max _{\pi }\, {\mathbb {E}}_{\pi }\big [\,r_0 + \gamma V({\varvec{s}}')\,\big ] \end{aligned}$$
(25)

where \({\varvec{s}}' = {\varvec{s}}_{k+1}\) is the next state after \({\varvec{s}} = {\varvec{s}}_k\) given the action \({\varvec{a}}_k\) rewarded with \(r_0=r_k\), and the expectation is over actions selected by the optimal policy \(\pi _\star \). The value function V is the unique solution to its Bellman equation, which forms the basis of a number of ways to compute, approximate, and learn V. This formula, besides its implications to modern RL methods, is derived and extensively explained in [8].

Learning the value function V and jointly the optimal policy:

$$\begin{aligned} \pi _\star = \mathop {\mathrm{arg\,max}}\limits _{\pi }\, {\mathbb {E}}_{\pi }\big [\,r_0 + \gamma V({\varvec{s}}')\,\big ] \end{aligned}$$
(26)

is the central challenge in RL. All methods in RL that find an optimal value function V and the corresponding optimal policy \(\pi _\star \) in a two-step procedure based on (25) and (26), are called policy iteration learning algorithms. In some cases, a large number of trials must often be evaluated in order to determine an optimal policy by iteration and, in practice, reinforcement learning may become very expensive to train. Yet, RL is well suited for the flight control problem stated in this research, where there is a known model for the environment, and where evaluating a policy is relatively affordable as there are sufficient resources to perform a near brute-force optimization.

In the present research, the agent learns through a hybrid technique that mixes policy-iteration learning with a strategy known as Q-learning (see Sect. 3.6). Moreover, all complexities in prediction functions are delegated to function approximators based on deep neural networks, to learn the policy \(\pi _\star \), the value function V, and the quality function Q. The latter is introduced in the next subsection.

3.5 Quality function

The quality function, or Q-value, of a state–action pair \(({\varvec{s}},{\varvec{a}})\) is defined as the expected value

$$\begin{aligned} Q({\varvec{s}}, {\varvec{a}}) = {\mathbb {E}}\big [\,r({\varvec{s}},{\varvec{a}},{\varvec{s}}') + \gamma V({\varvec{s}}')\,\big ] \end{aligned}$$
(27)

when at time step k the state is \({\varvec{s}}\), the agent is assumed to follow the optimal policy, and the generic action \({\varvec{a}}\) is taken. This brings to the next state \({\varvec{s}}'\), resulting in the immediate reward \(r({\varvec{s}},{\varvec{a}},{\varvec{s}}')\) and a discounted future cumulated reward \(\gamma V({\varvec{s}}')\).

While the value function V tells what is the value of being in a current state \({\varvec{s}}\), the function Q—also called action-value function—is a joint quality or value of taking an action \({\varvec{a}}\) given a current state \({\varvec{s}}\). The quality function is somehow richer and it contains more information about what is the quality of being in a given state for any action it might take. For this reason, sometimes Q is also called ‘critic’ because it can look at the possible actions and be used to criticize the agent’s choices. This function can be approximated by a deep neural network, as well.

The optimal policy \(\pi _\star ({\varvec{s}},{\varvec{a}})\) and the optimal value function \(V({\varvec{s}})\) contain redundant information, as one can be determined from the other via the quality function \(Q({\varvec{s}},{\varvec{a}})\):

$$\begin{aligned} \pi _\star ({\varvec{s}}, {\varvec{a}}) = \mathop {\mathrm{arg\,max}}\limits _{{\varvec{a}}} Q({\varvec{s}}, {\varvec{a}}),\qquad V({\varvec{s}}) = \max _{{\varvec{a}}} Q({\varvec{s}}, {\varvec{a}})\nonumber \\ \end{aligned}$$
(28)

This formulation is used to define the Q-learning strategy, which is recalled in the next subsection.

3.6 Temporal difference and Q-learning

In an approach to learning from trial-and-error experience, the value function V or quality function Q is learned through a repeated evaluation of many policies [8, 33]. In this work, the chosen learning process is not episodic (does not wait for the end of a control trajectory to update the policy), but instead, it is implemented in such a way as to learn continuously by bootstrapping. This technique is based on current estimates of V or Q, which are then repeatedly updated by scanning the successive states in the same control trajectory. In the simplest case, at each iteration of this learning technique, the value function is updated by means of a one-step look ahead, namely a value prediction for the next state \({\varvec{s}}'\) given the current state \({\varvec{s}}\) and action \({\varvec{a}}\). This approach relies on Bellman’s principle of optimality, which states that a large multi-step control policy must also be locally optimal in every subsequence of steps [33].

The temporal difference learning method known as TD(0) method simply approximates the expected discounted return with an estimation based on the reward immediately received summed to the value of the next state. Given a control trajectory generated through an optimal policy \(\pi _\star \), for the Bellman’s optimality condition the V-value of state \({\varvec{s}}_k\) is given by:

$$\begin{aligned} V({\varvec{s}}_k) = {\mathbb {E}}_{\pi _\star }\big [\,r_k + \gamma V({\varvec{s}}_{k+1})\,\big ] \end{aligned}$$
(29)

where \(r_k + \gamma V({\varvec{s}}_{k+1})\) acts as an unbiased estimator for \(V({\varvec{s}}_k)\), in the language of Bayesian statistics. For non-optimal policies \(\pi \), this same idea is used to update the value function based on the value function one step ahead in the future, thus approximating the expected return as:

$$\begin{aligned} R_k \approx r({\varvec{s}}_k, {\varvec{a}}_k, {\varvec{s}}_{k+1}) + \gamma \, V_{\pi }({\varvec{s}}_{k+1}) \end{aligned}$$
(30)

or, using the Q-value, as:

$$\begin{aligned} R_k \approx r({\varvec{s}}_k, {\varvec{a}}_k, {\varvec{s}}_{k+1}) + \gamma \, Q_{\pi }({\varvec{s}}_{k+1}, {\varvec{a}}_{k+1}) \end{aligned}$$
(31)

These approximations bring to the following definitions of the learning rules or update equations:

$$\begin{aligned} \begin{aligned} V_{\pi }({\varvec{s}}_{k})&\leftarrow V_{\pi }({\varvec{s}}_{k}) + \eta \, \delta _\textrm{RPE}({\varvec{s}}_{k}, {\varvec{a}}_{k}, {\varvec{s}}_{k+1}) \\ Q_{\pi }({\varvec{s}}_{k}, {\varvec{a}}_{k})&\leftarrow Q_{\pi }({\varvec{s}}_{k}, {\varvec{a}}_{k}) + \eta \,\delta _\textrm{TDE}({\varvec{s}}_{k}, {\varvec{a}}_{k}, {\varvec{s}}_{k+1}) \end{aligned} \end{aligned}$$
(32)

where \(\eta \) is a learning rate between 0 and 1, and the quantities:

$$\begin{aligned} \begin{aligned} \delta _\textrm{RPE}&= r({\varvec{s}}_{k}, {\varvec{a}}_{k}, {\varvec{s}}_{k+1}) + \gamma \, V_{\pi }({\varvec{s}}_{k+1}) - V_{\pi }({\varvec{s}}_{k}) \\ \delta _\textrm{TDE}&= r({\varvec{s}}_{k}, {\varvec{a}}_{k}, {\varvec{s}}_{k+1}) + \gamma \, Q_{\pi }({\varvec{s}}_{k+1}, {\varvec{a}}_{k+1}) \\&\quad - Q_{\pi }({\varvec{s}}_{k}, {\varvec{a}}_{k}) \end{aligned} \end{aligned}$$
(33)

are called reward-prediction error (RPE) and temporal difference error (TDE), respectively. If the error is positive, the transition was positively surprising: one obtains more reward or lands in a better state than expected: the initial state or action was actually underrated, so its estimated value must be increased. Similarly, if the error is negative, the transition was negatively surprising: the initial state or action was overrated, and its value must be decreased. All methods based on update equations (32) are called value-iteration learning algorithms.

Fig. 8
figure 8

The Q-learning algorithm for estimating the optimal policy \(\pi _\star \)

TD-based learning offers the advantage that after each state transition, the V-value or Q-value updates can be immediately applied, that is, there is no need to wait until an entire episode is completed. This process allows very fast learning and is called online learning. The policy updates may be applied at every single transition (TD(0) or 1-step look ahead) or the learning may proceed from batches of consecutive state transitions (TD(n) or n-step look ahead).

Q-learning is a technique that derives from TD learning and is particularly suitable for model-free RL. In Q-learning, the Q function is learned directly only by observing the evolutionary environment, in an approach that post-processes the generated control trajectories. It can be seen as a generalization of the many available model-based learning strategies, applicable to all those control scenarios that are difficult or impossible to model. As seen from (27), the Q function incorporates in its definition the very concept of a one-step look ahead, and does not need an environment’s model. Therefore, the learned Q function, the optimal policy, and the value function may be extracted as in (28).

In Q-learning, the Q function update equation is [8, 33]:

$$\begin{aligned} Q({\varvec{s}}_{k}, {\varvec{a}}_{k}) \leftarrow Q({\varvec{s}}_{k}, {\varvec{a}}_{k}) + \eta \, {\hat{\delta }}_\textrm{TDE}({\varvec{s}}_{k}, {\varvec{a}}_{k}, {\varvec{s}}_{k+1}) \end{aligned}$$
(34)

where

$$\begin{aligned} {\hat{\delta }}_\textrm{TDE}({\varvec{s}}_{k}, {\varvec{a}}_{k}, {\varvec{s}}_{k+1})= & {} r({\varvec{s}}_{k}, {\varvec{a}}_{k}, {\varvec{s}}_{k+1}) \nonumber \\{} & {} + \gamma \max _{{\varvec{a}}}Q({\varvec{s}}_{k+1}, {\varvec{a}})\nonumber \\{} & {} - Q({\varvec{s}}_{k}, {\varvec{a}}_{k}) \end{aligned}$$
(35)

is called the off-policy TDE. Due to the fact that the optimal \({\varvec{a}}\) is used in (35) to determine the correction \({\hat{\delta }}_\textrm{TDE}\) based on the current estimate for Q, while taking a different action \({\varvec{a}}_{k+1}\) based on a different behaviour policy for the next state transition, Q-learning is called an off-policy technique. Thus, Q-learning takes sub-optimal actions to explore future states but uses the optimal action \({\varvec{a}}\) to improve the Q function at each state transition. The Q-learning algorithm for estimating the optimal policy \(\pi _\star \) is summarized in Fig. 8. In practice, rather than finding the true Q-value of the state–action pair in one go at each time step \(t_k\), through the Bellman equation the agent improves the approximation of the function Q over time.

Q-learning was formalized for solving problems with discrete action and state spaces governed by a finite Markov decision process. Yet, when the action and state spaces are continuous, as occurs in the flight control problems, time is discretized with an appropriate frequency and Q-learning becomes equally applicable. Its off-policy character has made Q-learning particularly well suited to many real-world applications, enabling the RL agent to improve when its policy is sub-optimal. In deep Q-learning, the Q function is conveniently represented as a neural network.

In all RL approaches, the concept of learning rate is very important: it determines to what extend, for each episode during agent training, newly acquired information overrides old information. A sufficient level of exploration has to be ensured in order to make sure that the estimates converge to the optimal values: this is known as the exploration-exploitation problem. At successive time steps within the generic episode, if the agent selects always the same action policy from the beginning (exploitation), it will never discover better (or worse) alternatives. On the other hand, if the policy is updated in such a way that it picks random actions (exploration), this random sub-optimal policy might have a chance to bring new information to the learning process. A trade-off between exploitation and exploration is ensured by the available Q-learning techniques: usually, a lot of exploration happens at the beginning of the training to accumulate knowledge about the environment and the control task, less towards the end to actually use the acquired knowledge and perform optimally. Generally, Q-learning will learn a more optimal solution faster than alternative techniques.

Fig. 9
figure 9

Actor-critic architecture scheme

3.7 The actor-critic architecture

In RL, an actor-critic method consists in simultaneously learning a policy function and a value function by conveniently mixing value-iteration as well as policy-iteration learning. As shown in Fig. 9, in this agent architecture there is an actor-network, which is policy-based, and a critic network, which is value-based. The temporal difference signal from the critic is used to update the actor’s policy parameters. In this particular case, the actor tries to take what it considers is the best action according to the current state (just as in a simpler policy function method, Fig. 7); the critic estimates the Q-value associated to the state and to the action that the actor just took (as in Q-learning methods).

The actor-critic scheme works for continuous action spaces because the critic—which is supposedly the optimal Q function evaluator in the current condition—only needs to consider a single action, the one that the actor just took. In fact, when the actor selects an action, it is applied to the environment, the critic estimates the value of that state–action pair, and then it uses the reward from the environment as a metric to determine how good its Q-value prediction was. The error is the difference between the new estimated value of the previous state and the old value of the previous state from the critic network. The critic uses the error to update itself so that it has a less sub-optimal prediction the next time it is in that state. The actor-network also updates its parameters with the response from the critic and the error term so that it can improve its future behaviour. This research uses the DDPG training algorithm which is based on the actor-critic architecture [12]. The algorithm can learn from environments with continuous state and action spaces, and since it estimates a deterministic policy, it learns much faster than those techniques based on stochastic policies.

To compute the prediction errors, usually, many successive samples (single transitions) are gathered and concatenated in mini-batches, so that the critic’s neural network could learn to minimize the prediction error from these chunks of data. Yet, the successive transitions concatenated inside a mini-batch are not independent of each other but correlated: \(({\varvec{s}}_k, {\varvec{a}}_k, r_{k+1}, {\varvec{s}}_{k+1})\) will be followed by \(({\varvec{s}}_{k+1}, {\varvec{a}}_{k+1}, r_{k+2}, {\varvec{s}}_{k+2})\), and so on, which is not a distribution of random samples. This is a major problem that promotes the tendency of the involved neural networks to overfit and fall into local minima.

Another major problem occurs with the actor and critic implemented as neural networks because their loss functions have non-steady targets. As opposed to classification or regression problems where the desired values are fixed throughout the network update iterations, in Q-learning, the target \( r({\varvec{s}}_{k},{\varvec{a}}_{k},{\varvec{s}}_{k+1}) + \gamma \)\( \max _{{\varvec{a}}} Q_{\varvec{\vartheta }}({\varvec{s}}_{k+1}, {\varvec{a}}) \) will change because the function approximator \(Q_{\varvec{\vartheta }}\) depends on the weights \(\varvec{\vartheta }\). This circumstance can make the actor-critic pair particularly inefficient, especially if they are implemented as feedforward networks and the control task features a moving reference.

3.8 Deep Q-networks

The problem in DRL caused by correlated samples within mini-batches has been solved by introducing a learning technique called deep Q-network (DQN) algorithm [34]. The approach relies on data structures called experience replay memory (ERM), or replay buffers, which are huge buffers where hundreds of thousands of successive transitions are stored. The agent is trained by randomically sampling mini-batches from the ERM, which is cyclically emptied and refilled with new samples.

On the other hand, to fix the problem related to the loss function target inherent unsteadiness, a DQN algorithm uses a cyclically frozen version of the agent called target network, which computes the transitions intended to feed the ERM. Only every few thousand iterations the target network is updated so that the loss function targets remain stationary.

The learning rate of DQNs is known to be lower than those of other available approaches. This is due to the sample complexity, i.e. to the fact that the agent must inevitably enact a huge number, in the order of millions, of transitions to obtain a satisfying policy.

The function of the DQN within the agent’s behaviour is depicted in the left part of Fig. 9 showing how the DQN trains the critic network to estimate the future rewards and to update the actor’s policy (see also Fig. 8). In the case of flight control, the updated policy determines the action on aerosurfaces and engine throttle that maximizes the expected reward. The reward shaping approach for the control problem introduced by this work is presented in Sect. 3.11.

3.9 Policy gradient methods

In policy-based function approximation—which is the part of the actor-critic scheme that updates the actor—when the policy \(\pi \) is parameterized by \(\varvec{\vartheta }\), it is possible to use gradient optimization on the parameters to improve the policy much faster than other iterative methods.

The objective is to learn a policy that maximizes the expected return of each transition. The goal of the neural network is to maximize an objective function given by the return (22) calculated over a set of trajectories E selected by the policy:

$$\begin{aligned} {\mathcal {J}}(\varvec{\vartheta }) = {\mathbb {E}}_E\big [ R_k \big ] = {\mathbb {E}}_E\left[ \;\sum _{i=0}^{{\mathcal {T}}} \gamma ^i r_{k+i} \;\right] \end{aligned}$$
(36)

The algorithm known as policy gradient method applies a gradient ascent technique to the weights in order to maximize \({\mathcal {J}}\) in the space of \(\varvec{\vartheta }\)’s. All it is actually necessary with this technique is the gradient of the objective function with respect to the weights \( \nabla _{\!\varvec{\vartheta }} {\mathcal {J}} = \partial {\mathcal {J}}/\partial \varvec{\vartheta }\). As a suitable estimation of this policy gradient is obtained, the gradient ascent formula is straightforward:

$$\begin{aligned} \varvec{\vartheta } \leftarrow \varvec{\vartheta } + \eta \, \nabla _{\!\varvec{\vartheta }} {\mathcal {J}}\big (\varvec{\vartheta }\big ) \end{aligned}$$
(37)

The reader is referred to the article by Peters and Schaal [35] for an overview of policy gradient methods, with details on how to estimate the policy gradient and how to improve the sample complexity.

3.10 Deep deterministic policy gradient

In the application presented here, the DRL agent learns a deterministic policy. Deterministic policy gradient (DPG) algorithms form a family of methods developed to approximate the gradient of the objective function when the policy is deterministic [36]. The DPG approach has been improved in order to work with nonlinear function approximators [11] resulting in the method known as deep deterministic policy gradient. DDPG combines concepts from DQN and DPG obtaining an algorithm able to effectively solve continuous control problems with an off-policy method. As in DQN, an ERM (to store the past transitions and then learn off-policy) and target networks (to stabilize the learning) are used. However, in DQN the target networks are updated only every couple of thousands of steps: they change significantly between two updates, but not very often. Anyway, during the DDPG development, it turned out to be more efficient to make the target networks slowly track the trained networks. For this reason, both the target networks and the trained networks are updated together, by using the following update rule:

$$\begin{aligned} \varvec{\vartheta }_{k+1} \leftarrow \tau \, \varvec{\vartheta }_k + (1-\tau ) \, \varvec{\vartheta }_{k+1} \qquad \text {with }\tau \le 1 \end{aligned}$$
(38)

where \(\tau \) is a smoothing factor (usually much less than 1) that defines how much delay exists between target networks and trained networks. The above update strategy improves the stability in the Q function learning process.

The key idea taken from DPG is the policy gradient for the actor. The critic is learned by using regular Q-learning. Anyway, as the policy is deterministic, it can quickly produce always the same actions: this is an exploration problem. Some environments are naturally noisy, improving the exploration, but this cannot be assumed in the general case. To deal with that, DDPG perturbs the deterministic action with an additive noise \(\varvec{\xi }\) generated with a stochastic noise model, such that \({\varvec{a}}_k \leftarrow {\varvec{a}}_k + \varvec{\xi }_k\), in order to force the exploration of the environment.

Fig. 10
figure 10

Penalty functions tested in this research. (Left) Logarithmic barrier function. See Eq. (39), with \(x_{\min }=-2\), \(x_{\max }=2\), \(r_{\min } = -5\), \(C= 1\). (Right) Hyperbolic function. See Eq. (40), with \(x_{\min }=-2\), \(x_{\max }=2\), \(\lambda = 1\), \(\tau = 0.5\)

3.11 Reward function shaping

Finally, we define a basic reward calculation function r that determines the agent’s reward \(r_k\) at time \(t_k\). Generally speaking, the reward is based on current values of the aircraft state and observable variables and is defined for specific control tasks. The formulation proposed here, which might appear limited to a given particular control scenario, in practice has proven to be flexible and widely applicable.

The reward signal is the way of communicating to the agent what has to be achieved, not how it has to be achieved. Therefore, the reward function is constructed in such a way as to properly guide the agent to the desired state, making sure not to impart a priori knowledge about how to achieve what the agent is supposed to accomplish. Moreover, a reward should not be set to be able to achieve subgoals and prefer them to the ultimate control aim. A properly rewarded agent does not find ways to reach a subgoal without achieving the real end target. The reward calculation is therefore accomplished by evaluating some properly designed simple functions of one or more scalar variables. These have the role of penalty functions because their output is usually much higher when their independent variables lie in given ranges.

A logarithmic barrier penalty reward function was first used in this work to test the default settings of the MATLAB Reinforcement Learning Toolbox. This function is defined as follows:

$$\begin{aligned} r(x) {=}\! \left\{ \begin{array}{ll} \max \Bigg \{ r_{\min }, \, C\Big [\,\log \big (\frac{1}{4}(x_{\max }-x_{\min })^2\big ) \\ \qquad {-}\log \big ((x{-}x_{\min })(x_{\max }{-}x_{\min })\big )\Big ] \Bigg \} &{} \text {if} \; x \in [x_{\min },x_{\max }] \\ r_{\min } &{} \text {if} \; x \not \in [x_{\min },x_{\max }] \end{array}\right. \nonumber \\ \end{aligned}$$
(39)

where \(r_{\min }<0\) is the minimum allowed reward, C is a nonnegative curvature parameter, and \([x_{\min },x_{\max }]\) is the range where the reward cannot be as low as \(r_{\min }\).

After some investigation, a second basic reward calculation method, named hyperbolic penalty function, turned out to give better results, in terms of flight control, as opposed to the logarithmic barrier penalty. The hyperbolic penalty evaluates to nearly constant values inside a given range of the independent variable and exhibits a nearly linear varying behaviour outside that range. It is defined as follows:

$$\begin{aligned} \begin{aligned} r(x) =&\lambda (x - x_{\min }) - \sqrt{\lambda ^2(x-x_{\min })^2+\tau ^2} \\&+\lambda (x_{\max }-x) - \sqrt{\lambda ^2(x_{\max }-x)^2+\tau ^2} \end{aligned} \end{aligned}$$
(40)

where \(\tau \) and \(\lambda \) are nonnegative shape parameters. In particular, \(\pm \lambda \) are the slopes of the linear segments outside of the interval \([x_{\min }, x_{\max }]\). Two examples of logarithmic and hyperbolic penalty functions are shown in Fig. 10.

The above-defined functions are used to construct a reward \(r_{k}\) in terms of all or a subset of variables in the triplet \(({\varvec{s}}_{k},{\varvec{a}}_{k},{\varvec{s}}_{k+1})\). Care is taken when linking a higher reward to good behaviour because for a poorly defined \(r_{k}\) the agent may prefer to maximize its reward at the cost of not reaching the desired state. For instance, for a waypoint following task assigned to an aircraft controller, the agent might approach the target point in space and fly around the given waypoint in order to accumulate as much reward as possible instead of passing through the target and then finishing the episode (which will of course result in a lower total reward). For this reason, the maximum of the chosen function r(x) will not be positive.

Therefore, assuming we have an array \(\varvec{\chi }=(\chi _1, \)\( \chi _2,\ldots ,\chi _n)\) of n scalars, where each \(\chi _i\) is a function of variables in \(({\varvec{s}}_{k},{\varvec{a}}_{k},{\varvec{s}}_{k+1})\), the total reward of a transition from time step k to \(k+1\) will be:

$$\begin{aligned} r_{k} = \sum _{i=1}^{n} r\big ( \chi _i \big ) \end{aligned}$$
(41)

where r can be the logarithmic (39) or the hyperbolic barrier penalty function (40). The number n will depend on the type of control task under study.

Section 4 introduces the flight dynamics model used in all simulations and then presents the main learning experiment, with its set of hyperparameters, that trains an agent to follow a given combination of heading and altitude. Successively, some additional test cases are also reported that validate this DRL-based control strategy.

4 Control strategy validation

The flight dynamics model used for this research is provided by the JSBSim software library. JSBSim is a multi-platform, general-purpose, object-oriented FDM written in C++. The FDM is essentially the physics/math model that defines the 6-DoF movement of a flying vehicle, subject to the interaction with its natural environment, under the forces and moments applied to it using the various control mechanisms. The mathematical engine of JSBSim solves the nonlinear system (1)–(11)–(17)–(18) of differential equations starting from a given set of initial conditions, with prescribed input laws or determined by a pilot-in-the-loop operating mode. The FDM implements fully customizable, data-driven flight control systems, aerodynamic models, propulsive models, and landing gear arrangement through a set of configuration text files in XML format.

The software can be run as an engineering flight simulator in a standalone batch mode (no graphical displays), or it can be integrated with other simulation environments. JSBSim includes a MATLAB S-function that makes it usable as a simulation block in Simulink. This feature, besides the MATLAB/Simulink Reinforcement Learning Toolbox, has made possible all simulations and learning strategies performed in this research.

A validation of JSBSim as a flight dynamics software library has been reported by several authors [30, 37].

4.1 Agent training

Fig. 11
figure 11

Simulation scheme in Simulink

Several learning experiments were carried out with various control goals, in order to assess the appropriate tuning of hyperparameters related to the DDPG algorithm and to the reward calculation function. A representative example of agent training is presented here, whose Simulink overall scheme is reported in Fig. 11, while a selection of successful control examples in different scenarios is presented in the next section.

The agent training process was tuned for a flight control task where a reference F-16 FDM was: (i) set in flight at 30,000ft of altitude and at an assigned speed, with a randomly generated initial heading, and (ii) required to reach a target commanded heading \(\psi _\textrm{c}={0}{\hbox { deg}}\) and a target commanded altitude \(h_\textrm{c}={27{,}000}{\hbox { ft}}\) within a time \(t_{\mathcal {T}}={30}{\hbox { s}}\), (iii) with a final wings level and horizontal fuselage attitude—that is, following zero commanded roll and elevation angles \(\phi _\textrm{c}=\theta _\textrm{c}={0}{\hbox { deg}}\). The target flight condition is a translational flight, i.e. a motion with zero commanded angular speeds \(p_\textrm{c} = q_\textrm{c} = r_\textrm{c} = {0}{\hbox { deg}/\hbox {s}}\).

Thousands of simulations were performed to train the agent within the MATLAB/Simulink environment and to reach a fine-tuned control for the assigned task. In the initial trials, the agent was trained with the two different types of reward functions presented in Sect. 3.11, and finally, it was determined that the hyperbolic penalty function (40) was the one that gave the best results.

By defining the error variables:

$$\begin{aligned} \epsilon _h= & {} h_\textrm{c}-h, \quad \epsilon _\phi = \phi _\textrm{c}-\phi ,\quad \epsilon _\theta = \theta _\textrm{c}-\theta ,\quad \nonumber \\ \epsilon _\psi= & {} \psi _\textrm{c}-\psi ,\quad \epsilon _p=p_\textrm{c}-p, \quad \epsilon _q=q_\textrm{c}-q \end{aligned}$$
(42)

the observation vector \(\varvec{\chi }\) in this scenario is defined as follows:

$$\begin{aligned} \varvec{\chi } = \big [\, \epsilon _h,\, \epsilon _\phi ,\, \epsilon _\theta ,\, \epsilon _\psi ,\, \epsilon _p,\, \epsilon _q,\, \alpha ,\, \beta ,\, {\tilde{\delta }}_T,\, {\tilde{\delta }}_\textrm{a},\, {\tilde{\delta }}_\textrm{e},\, {\tilde{\delta }}_\textrm{r}, {\tilde{\delta }}_\textrm{f} \,\big ]\nonumber \\ \end{aligned}$$
(43)

All simulations reaching a terminal state at the final time \(t_{\mathcal {T}}\) with an altitude error \(|\epsilon _h|> {2000}{\hbox { ft}}\) were marked with a final cumulative reward \(R_{\mathcal {T}} = -1000\) (control target unattained).

Table 1 summarizes the initial conditions of all simulations required to train the agent to follow a given heading and altitude. The main hyperparameters of the fine-tuned training process with the DDPG algorithm are reported in Table 2, while the reward function hyperparameters are listed in Table 3. The major training setup options are reported in Table 4. The computational cost to run the simulations on a personal computer equipped with an Intel i7-9750 h CPU, a DDR4 RAM of 32 GB and an Nvidia GPU RTX2060 is summarized in Table 5. Finally, the cumulative reward history, \(R_e\) as a function of the number of episodes, is plotted in Fig. 12.

Table 1 Initial conditions for the heading and altitude control training
Table 2 Agent hyperparameters for the heading and altitude control training
Table 3 Hyperbolic penalty parameters for the heading and altitude control scenario
Table 4 Training options for the heading and altitude control scenario
Fig. 12
figure 12

Episode reward history during the training for the heading and altitude control scenario

4.2 Simulation scenarios

This section presents the results of various test case simulations where different control tasks are successfully accomplished by the same agent presented in Sect. 4.1.

With reference to the scheme of Fig. 5, the pilot’s commanded inputs are essentially replaced by the agent’s action on the primary controls—i.e. stick and pedals \(({\tilde{\delta }}_\textrm{a},{\tilde{\delta }}_\textrm{e},{\tilde{\delta }}_\textrm{r})\) and the throttle lever, \({\tilde{\delta }}_T\)—forming the four-dimensional input vector \(\varvec{{\tilde{u}}}_\textrm{agt}\) defined in (20). These signals are filtered by the FCS, whose output is then converted in aerosurface deflections and actual throttle setting, and passed to the aircraft dynamics simulation block (directly interfaced to JSBSim) besides other control effector signals produced by the FCS logics. These form the full input vector \({\varvec{u}}\) defined in (21).

Table 5 Statistics for the heading and altitude control agent training scenario (hardware: Intel i7-9750 h CPU, DDR4 32 GB of RAM, Nvidia RTX2060 GPU)
Fig. 13
figure 13

F-16 agent-controlled heading and altitude following simulation scenario. Normalized flight commands histories, as provided by the agent and filtered by the FCS

4.2.1 Heading and altitude following

Fig. 14
figure 14

F-16 agent-controlled heading and altitude following simulation scenario. Actual primary aerosurface deflections correspond to the command inputs coming from the FCS. See also Fig. 5

Fig. 15
figure 15

F-16 agent-controlled heading and altitude following simulation scenario. Aerosurface deflections \(\delta _{\textrm{f,TE}}\) and \(\delta _{\textrm{f,LE}}\) generated by the FCS logics. See also Fig. 5

Fig. 16
figure 16

F-16 agent-controlled heading and altitude following simulation scenario. Time histories of altitude, attitude angles and aerodynamic angles

The heading and altitude following scenario is the one that was actually set up to train the agent and introduced in Sect. 4.1. The details of a representative simulation with control inputs provided by the trained agent are presented here.

The task of achieving a zero heading angle and an assigned new flight attitude is accomplished within the prescribed 30 s. The agent’s inputs as well as the FCS outputs are plotted as normalized flight commands time histories in Fig. 13, where the throttle setting values above 1 mean that the jet engine afterburner is being used. The time histories of the primary aerosurface deflections, as actual inputs to the aircraft dynamics model, are shown in Fig. 14. Time histories of aircraft state variables, such as attitude angles and aerodynamic angles, velocity components, normal load factor, Mach number and angular velocity components, are reported in Figs. 16 and 17.

4.2.2 Waypoint following

This test case generalizes the previous scenario by introducing a number of sequentially generated random waypoints during the simulation, that the aircraft is required to reach under the agent’s control. As shown in Fig. 18, the same agent that was trained to accomplish the simpler task presented in the previous example, is now deployed in a new simulation scenario where the reference values \(\psi _\textrm{c}\) and \(h_\textrm{c}\) change over time. The additional logic with respect to the previous case receives a multi-dimensional reference signal, calculates the error terms, and injects them into a reward calculation block. This reward thus calculated directs the agent to follow a given flight path marked by multiple waypoints.

The waypoints form a discrete sequence \(\big \{(l_\textrm{c} \)\( ,\mu _\textrm{c},h_\textrm{c})_i\mid i=1,2,\ldots ,n\big \}\) of n locations in space of geographic coordinates \((l,\mu )\) and altitude h generated at subsequent time instants \(t_1, t_2, \ldots , t_n\). In the example presented here, a sequence of \(n=10\) waypoints has been considered.

Starting from a random initial flight condition, with casual heading, speed and altitude, the goal is to reach the next random waypoint, and successively one by one all the other waypoints as they reveal themselves to the agent along the way. For \(0=t_0\le t <t_1\) the aircraft points to waypoint \((l_\textrm{c},\mu _\textrm{c},h_\textrm{c})_1\) until at \(t=t_1\) the vehicle is labelled as sufficiently close to the first target, the waypoint \((l_\textrm{c},\mu _\textrm{c},h_\textrm{c})_2\) is generated and pursued for \(t_1\le t <t_2\); the scheme repeats itself until the last waypoint is reached after time \(t_n\). The geographic coordinates and the general aeroplane position tracking are handled through the JSBSim internal Earth model.

In the generic instant t of the simulation, with the aircraft gravity centre having geographic longitude l(t) and latitude \(\mu (t)\), once the vehicle is commanded to fly towards the next ith waypoint, the heading \(\psi _{\textrm{c},i}\) that the agent is required to follow after time \(t_{i-1}\) is calculated as:

$$\begin{aligned} \psi _{\textrm{c},i} (t)= & {} \textrm{atan2} \Big \{ \sin \big [l_{\textrm{c},i} - l(t)\big ]\cos \mu (t),\nonumber \\{} & {} \quad \cos \mu (t)\sin \mu _{\textrm{c},i} -\sin \mu (t)\cos \mu _{\textrm{c},i}\nonumber \\{} & {} \quad \cos \big [l_{\textrm{c},i} - l(t)\big ] \Big \} \end{aligned}$$
(44)

for \(t_{i-1}\le t < t_i\). Therefore, the altitude and heading error considered in the reward calculation function become:

$$\begin{aligned} \epsilon _h(t) = h_{\textrm{c},i}-h(t), \quad \epsilon _\psi (t) = \psi _{\textrm{c},i}-\psi (t) \quad \end{aligned}$$
(45)

for \(t_{i-1}\le t < t_i\) and \(i=1,\ldots ,n\). The above formulas represent the time-varying references provided to the control agent.

The agent looks at one waypoint at a time in this particular test case. When the aircraft reaches an assigned threshold distance from the current pointed waypoint, the next one is generated and passed to the agent. The instantaneous distance from the aircraft gravity centre to the current waypoint is calculated with the Haversine formula [38].

The behaviour of the agent in the multiple waypoints following task can be figured out by the simulation results presented below. The agent’s inputs as well as the FCS outputs are plotted as normalized flight commands time histories in Fig. 19. The time histories of the primary aerosurface deflections, as actual inputs to the aircraft dynamics model, are shown in Fig. 20. Time histories of aircraft state variables, such as attitude angles and aerodynamic angles, velocity components, normal load factor, Mach number and angular velocity components, are reported in Figs. 21 and 22.

Figure 23 reports the actual and commanded values of heading \(\psi \), longitude l and latitude \(\mu \). This scenario is also represented on the map of Fig. 24 reporting the ground track of the aircraft trajectory, the sequence of 10 waypoints and the commanded waypoint altitudes. Finally, the three-dimensional flight path and aircraft body attitude evolution are shown in Fig. 25.

Fig. 17
figure 17

F-16 agent-controlled heading and altitude following simulation scenario. Time histories of velocity components, normal load factor, Mach number and angular velocity components

4.2.3 Varying target with sensor noise

Fig. 18
figure 18

Control scheme of a waypoint following scenario. The reference given by successive waypoints and altitudes are injected into the reward calculation

This test case is set up to investigate the agent’s behaviour when external disturbances are injected into the environment. The simulation scenario is similar to the heading and altitude following example presented in Sect. 4.2.1, but in this case the commanded altitude \(h_\textrm{c}\) and heading \(\psi _\textrm{c}\) do change in time according to assigned stepwise constant functions. In addition, the state signals that define the agent’s observation vector (43) and that are sent also as feedback to the FCS are perturbed with prescribed noisy disturbances.

In particular, a set of random (Gaussian) normally distributed noise signals are generated and then added to some state variables in order to simulate an uncertainty on data available to the controllers. The results of a simulation case with zero-mean perturbations—assuming an accurately calibrated set of sensors, which is typically expected from the class of aircraft considered in this study—are reported in Fig. 26. Table 6 lists the main parameters of the additive noisy signals.

The results of this simulation case are similar to those reported in previous examples. The outcome of the agent’s control actions is represented by the time histories plotted in Fig. 26, which shows the instantaneous aircraft altitude and heading beside their corresponding commanded values. The same figure also reports the aerosurface deflection angles as provided to the aircraft FDM.

Fig. 19
figure 19

F-16 agent-controlled waypoint following simulation scenario. Normalized flight commands histories, as provided by the agent and filtered by the FCS

Fig. 20
figure 20

F-16 agent-controlled waypoint following simulation scenario. Actual primary aerosurface deflections corresponding to the command inputs coming from the FCS

Fig. 21
figure 21

F-16 agent-controlled waypoint following simulation scenario. Time histories of altitude, attitude angles and aerodynamic angles

4.2.4 Prey–chaser scenario

This test case evolves from the waypoint following scenario presented in Sect. 4.2.2 and provides a basis for possible applications of the present agent-based control approach to the field of military fighter pilot training (to dogfight and formation flight, for instance). In a prey–chaser scenario, two aeroplane models coexist and share the same flight environment where the first aircraft acts as the prey, being chased by the second one, i.e. the chaser.

In the particular test case presented here, both the chaser and the prey are piloted by an RL-based agent. In the simulation environment, there are two replicas of the same F-16 model, both of them piloted by two identical trained agents (Agent vs. Agent, the same presented in Sect. 4.1). The first agent controls the prey aircraft to follow a given sequence of random waypoints, much as the previous multiple waypoints following example. The second agent controls the chaser aircraft by acquiring the successive positions of the prey with a given frequency and follows them as if they were virtual waypoints.

The prey–chaser interaction within the waypoint subsequence 1 to 8 is shown in detail by the maps of Fig. 27. The full picture for the full waypoint sequence 1 to 10 is shown in Fig. 28, where also the random initial positions of the two aeroplanes are marked.

This particular simulation demonstrates the ability of the chaser agent to tighten its trajectory when appropriate, for instance, when the prey passes from waypoint 5 to waypoint 6. Another interesting behaviour is observed when the chaser aircraft overtakes the prey when the waypoint 8 is reached and surpassed. In this case, the agent-controlled chaser aircraft performs a complete turn to position itself behind the prey and continue the target following task.

5 Discussion

All simulation examples introduced in the previous section demonstrate the validity of the trained agent when it is directed to execute different control tasks with a progressive level of difficulty.

5.1 Path control with fixed reference

Fig. 22
figure 22

F-16 agent-controlled waypoint following simulation scenario. Time histories of velocity components, normal load factor, Mach number and angular velocity components

Fig. 23
figure 23

F-16 agent-controlled waypoint following simulation scenario. Actual and commanded values of heading \(\psi \), longitude l and latitude \(\mu \). See also Eq. (44)

The simplest example showing a case of path control with a fixed reference is presented in Sect. 4.2.1. The assigned control task is precisely what the agent was trained for, i.e. a case of exploitation. The aircraft is able to reach a target heading angle and an assigned new flight altitude, starting from a randomly generated initial flight condition (random heading, altitude, and speed), within a flight time of 30 s. The agent achieves this result by providing the input actions—normalized flight commands—shown in time histories of Fig. 13. The agent inputs are filtered by the FCS, whose output signals are also plotted in the same figure. In particular, it is seen that to reach the assigned goal as quickly as possible the agent demands a high thrust level, thus requiring the use of the jet engine afterburner (see the topmost plot of Fig. 13, where the output \({\tilde{\delta }}_\textrm{T}\) from the FCS, for \({12}{\hbox { s}} \le t \le {22}{\hbox { s}}\), becomes higher than 1). Time histories of the other primary and secondary controls as well as of the aircraft state variables clearly show an initial left turn, combined with a dive, to reach a prescribed lower altitude, North-pointing flight path. A typical FCS behaviour for this type of high-performance fighter jet is observed in the plot at the bottom of Fig. 13: the output \({\tilde{\delta }}_\textrm{r}\) from the FCS for \({12}{\hbox { s}} \le t \le {22}{\hbox { s}}\) exhibits severe filtering of the agent’s rudder command in order to keep the sideslip angle as low as possible (see \(\beta \) within the same time interval in Fig. 16). The manoeuvre is confirmed by the time histories of primary aerosurface deflections shown in Fig. 14 and of wing leading-edge/trailing-edge flap deflections reported in Fig. 15. The left turn results from the negative (right) aileron deflection \(\delta _\textrm{a}\) (right aileron down, left aileron up), for \({1}{\hbox { s}} \le t \le {6}{\hbox { s}}\)). The initial dive results from the combined action of tail and wing leading-edge flap deflections, \(\delta _\textrm{e}\) and \(\delta _{\textrm{f,LE}}\), respectively, for \({0}{\hbox { s}} \le t \le {6}{\hbox { s}}\), leading to a pitch-down manoeuvre (see negative \(\theta \) within the same time interval in Fig. 16). Figure 15 reveals an inherent complexity of such a controlled flight example, showing time-varying deflections of leading-edge and trailing-edge flaps mounted on the main wing. These deflections are due to the high-fidelity implementation of FCS control logics that can trigger the actuation of high-lift devices when some aircraft state variables do fall within prescribed ranges (see Appendix C). Time histories of aircraft state variables—such as altitude, attitude angles, aerodynamic angles reported in Fig. 16, velocity components, normal load factor, flight Mach number and angular velocity components reported in Fig. 17—confirm the left turn/dive manoeuvre to reach the prescribed terminal state. From Fig. 16, in particular, the errors \(\epsilon _h=h_\textrm{c} - h\), \(\epsilon _\psi = -\psi \), \(\epsilon _\theta = -\theta \), \(\epsilon _\phi = -\phi \) and their vanishing behaviour with time are easily deduced and confirm the effectiveness of the agent’s control actions. This test case clearly shows a valid AI-based control in the presence of nonlinear effects. For instance, the significant variations of altitude, flight Mach number and angle of attack do trigger all those nonlinearities accurately modelled in the aircraft FDM (see Appendices A, and B).

5.2 Path control with varying reference

Fig. 24
figure 24

F-16 agent-controlled waypoint following simulation scenario. Ground track of the aircraft trajectory, sequence of assigned waypoints and commanded waypoint altitudes

A case of multiple waypoints following task is introduced in Sect. 4.2.2. This scenario provides an interesting example of controlled flight with a moving reference. The simulation is performed by using the same control agent that was trained for the simpler one-reference heading/altitude following task. Interestingly, the simulation results presented in Figs. 19 to 23, demonstrate that also the multiple waypoints following exercise is successfully achieved. The agent reacts appropriately to each randomly generated new reference and effectively enacts its policy to control the plant dynamics. The ground track reported in Fig. 24 and the three-dimensional trajectory presented in Fig. 25 show the sequence of turns, dives and climbs that are flown to accomplish the assigned task.

5.3 Control validation in the presence of noise

Section4.2.3 reports an example of how the trained agent behaves when the state observations are perturbed by additive noisy signals. This test case, with the simulation results shown in Fig. 26, proves the robustness of the agent’s control actions with respect to external disturbances, within the assumption of well-calibrated sensors.

5.4 Simulation of a prey–chaser scenario

Finally, Sect. 4.2.4 presents an interesting prey–chaser simulation scenario, demonstrated by the ground tracks reported in Figs. 27 and 28. Two instances of the same trained agent direct two instances, respectively, of the same model of flying vehicle simulating an air engagement. A virtual simulation environment based on the FlightGear flight simulation softwareFootnote 1 has been set up as a means to visualize the air combat in a proper scenery. The scheme of Fig. 29 depicts how the two aircraft instances and their states are represented in the chosen airspace. Four successive screen captures of the virtual simulation environment are represented in Fig. 30, while the prey aircraft, after having reached waypoint 3, is pursuing waypoint 4. The figure shows two camera views on the left, taken following the prey and the chaser aircraft at fixed distances, respectively. The views are synchronized to the evolving ground tracks shown on the right. As seen in this excerpt of simulation, the chasing fighter enters into the field of view of the first camera as it flies at a higher Mach number, while the leading aircraft is still pursuing waypoint 4. Eventually, the chaser reaches its target arriving at the prey’s tail. This test case is one example of several other interesting simulation possibilities. In fact, assuming, for example, that the chaser is piloted by the agent discussed here, the prey can be piloted by a completely different agent instructed to accomplish a prescribed manoeuvre or, as an alternative, the prey can be a human-in-the-loop piloted model. These are possible applications of the flight control approach presented in this study that have the potential to enhance pilot training procedures by means of AI-augmented simulation environments.

Fig. 25
figure 25

F-16 agent-controlled waypoint following simulation scenario. Three-dimensional flight path and aircraft body attitude evolution. Aircraft geometry not in scale (magnified 1250 times for clarity)

Table 6 Aircraft states variables affected by additive zero-mean noisy disturbances in the test case of Sect. 4.2.3
Fig. 26
figure 26

F-16 agent-controlled simulation scenario with varying commanded heading and altitude, and zero-mean sensor noise. Time histories of altitude, heading and aerosurface deflection angles

Fig. 27
figure 27

Details of the waypoint sequence 1–8, and of the two ground tracks in the prey–chaser simulation scenario

Fig. 28
figure 28

Map projection of aircraft trajectory, and assigned waypoints, by a prey–chaser scenario simulation

Fig. 29
figure 29

Virtual simulation environment based on JSBSim, MATLAB/Simulink and FlightGear

Fig. 30
figure 30

Screen captures at four successive instants of the one-to-one air combat virtual simulation (times 1 and 2) and (times 3 and 4)

6 Conclusion

This research presents a high-performance aircraft flight control technique based on reinforcement learning and provides an example of how AI can generate a valid controller. The proposed approach is validated by using a reference simulation environment where the nonlinear, high-fidelity flight dynamics model of a military fighter jet is used to train an agent for a selected set of controlled flight tasks. The simulation results show the control effectiveness to make certain manoeuvres fully automatic in highly dynamic scenarios, even in the presence of sensor noise and atmospheric disturbances.

A future research direction that should evolve from this study is the comparison of the proposed AI-based controller to other standard types of flight control.

7 Supplementary information

This article has no accompanying supplementary file.