Reinforcement learning for shared autonomy drone landings

Backman, Kal; Kulić, Dana; Chung, Hoam

doi:10.1007/s10514-023-10143-3

Reinforcement learning for shared autonomy drone landings

Open access
Published: 21 October 2023

Volume 47, pages 1419–1438, (2023)
Cite this article

Download PDF

You have full access to this open access article

Autonomous Robots Aims and scope Submit manuscript

Reinforcement learning for shared autonomy drone landings

Download PDF

Kal Backman¹,
Dana Kulić^1,2 &
Hoam Chung¹

1476 Accesses
Explore all metrics

Abstract

Novice pilots find it difficult to operate and land unmanned aerial vehicles (UAVs), due to the complex UAV dynamics, challenges in depth perception, lack of expertise with the control interface and additional disturbances from the ground effect. Therefore we propose a shared autonomy approach to assist pilots in safely landing a UAV under conditions where depth perception is difficult and safe landing zones are limited. Our approach is comprised of two modules: a perception module that encodes information onto a compressed latent representation using two RGB-D cameras and a policy module that is trained with the reinforcement learning algorithm TD3 to discern the pilot’s intent and to provide control inputs that augment the user’s input to safely land the UAV. The policy module is trained in simulation using a population of simulated users. Simulated users are sampled from a parametric model with four parameters, which model a pilot’s tendency to conform to the assistant, proficiency, aggressiveness and speed. We conduct a user study ($n=28$) where human participants were tasked with landing a physical UAV on one of several platforms under challenging viewing conditions. The assistant, trained with only simulated user data, improved task success rate from 51.4 to 98.2% despite being unaware of the human participants’ goal or the structure of the environment a priori. With the proposed assistant, regardless of prior piloting experience, participants performed with a proficiency greater than the most experienced unassisted participants.

Unmanned aerial vehicles (UAVs): practical aspects, applications, open challenges, security issues, and future trends

Article 16 January 2023

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

Recent Advances in Unmanned Aerial Vehicles: A Review

Article 25 April 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Unmanned aerial vehicles (UAVs) are often deployed in reconnaissance, inspection and search and rescue tasks (González-deSantos et al., 2020; Sa et al., 2015; Morando et al., 2022; Wang et al., 2019; Albanese et al., 2022), due to their high maneuverability in full 3D space. However their mobility comes at the cost of increased teleoperation complexity. The teleoperation challenges include the pilot’s perception of the UAV’s position relative to nearby objects (Smolyanskiy & Gonzalez-Franco, 2017; Kim et al., 2020), coping with disturbances caused from the ground effect (Kan et al., 2019) and in understanding the mapping of teleoperation controls to UAV movement. Due to these challenges it is difficult for novice pilots to safely land a UAV, particularly in perceptually challenging environments.

Although autonomous solutions have been proposed for UAV landing (Baca et al., 2019; Feng et al., 2018; Polvara et al., 2020), they often require prior knowledge or setup of the environment, or the specification of a known landing target. Fully autonomous solutions are unable to dynamically adapt their objective in response to context sensitive cues or external events within the environment that are not observable with the robot’s sensors or a-priori specified by the designers. In general the difficulties associated with replicating high-level human decision making in a time and safety-critical context make teleoperation control schemes preferred over full autonomy (Shaqura et al., 2018; Perez-Grau et al., 2017), despite the need for expert pilots. Therefore to reduce training costs of potential pilots and improve the accessibility of UAV piloting for all, we aim to develop an assistive landing strategy that allows novice pilots to land at a proficiency at least equal to that of expert pilots.

Shared autonomy combines the control inputs of the pilot with that of an artificial intelligence to collaboratively complete a set of objectives. However, two prominent challenges when developing a shared autonomy system are predicting the intent of the user and deciding the control outputs based on the predicted user’s intent and control actions. In our previous work (Backman et al., 2021), we proposed a shared autonomy system that learns to estimate the pilot’s intent by training with simulated users and demonstrated it to be effective in providing assistance to human pilots landing in simulated environments.

The contributions of this paper are the following:

Rigorous validation in a physical environment with naive users. Prior shared autonomy UAV works are either validated in simulated user studies (Backman et al., 2021; Zhang et al., 2021), provide physical demonstrations (Sa et al., 2015; Perez-Grau et al., 2017) or are evaluated with a small sample size (n=4) physical user study (Reddy et al., 2018) where the goal is explicitly stated. The proposed work is also the first physical UAV shared autonomy system to account for multiple ambiguous goals. Prior works (Reddy et al., 2018; Sa et al., 2015) are limited to a single potential goal within the physical environment.
Reformulation of the state transition process for critics in reinforcement learning shared autonomy systems as a fully observable Markov decision process (MDP) instead of the traditional partially observable Markov decision process (POMDP) by including privileged information of the simulated user’s hidden internal state during training. The proposed reformulation significantly reduces model convergence time and improves policy performance, as MDPs offer an easier to solve problem compared to POMDPs.
Empirical comparison of the proposed approach with related shared autonomy works Javdani et al. (2018); Reddy et al. (2018), comparing user and assistant model performance and limitations for each approach for UAV landing tasks. The experiment highlights the proposed approach’s superior performance and robustness to unseen conditions without requiring prior knowledge of the environment.

1.1 Related work

Previous works related to UAV landing primarily focus on full autonomy over shared autonomy. Polvara et al. (2020); Xia et al. (2021) demonstrates an autonomous UAV visually seeking and landing on marked landing zones. Polvara et al. (2020) uses a sequence of deep Q-learning networks (DQNs) trained in an end-to-end manner in simulation. The model was transferred to reality successfully but learning the large state-space directly from images required two separate networks to be trained to counter the sample inefficiency from sparse rewards. Xia et al. (2021) uses a Markov decision process for collision avoidance to autonomously detect landing zones and plan a path towards them. However the approach is only demonstrated in simulation and requires the landing target location to be known priori. Both approaches (Polvara et al., 2020; Xia et al., 2021) have access to a discrete set of actions, limiting the manoeuvrability of the UAV.

Continuous action approaches are demonstrated in (Rodriguez-Ramos et al., 2019; Shi et al., 2019) where Rodriguez-Ramos et al. (2019) implements the continuous reinforcement learning algorithm DDPG (Lillicrap et al., 2016) to land on a moving platform while Shi et al. (2019) learns an unknown disturbance force caused by the ground effect to be fed into a non-linear trajectory tracking controller. Both approaches facilitate continuous control for UAV landing but require the desired landing position to be known a priori, limiting the feasibility in unexplored environments.

Autonomous landing zone proposal approaches have been developed to eliminate the need of explicitly specifying the landing target a priori (Maturana & Scherer, 2015; Carney et al., 2019; Kaljahi et al., 2019). Maturana and Scherer (2015) uses globally registered LiDAR point clouds to build a volumetric density map, from which a convolutional neural network is used to create a voxelised map of safe landing area probabilities. Carney et al. (2019) uses publicly available population density, elevation, terrain ruggedness and land classification data in a heuristic algorithm to detect viable landing strips from which a single candidate is selected using a weighting algorithm. Kaljahi et al. (2019) applies a series of Gabor filters to RGB images from which a Markov chain code process is used to cluster and classify pixels. However such landing proposal approaches do not take into consideration the potential objectives or preferences of the pilot and are yet to be validated for use in UAV landings.

Shared autonomy approaches circumvent the need for predefined landing zones due to the AI assistant inferring the landing position from the pilot’s actions. Reddy et al. (2018) implements a shared autonomy system using a model free DQN to assist participants (n=12) safely landing in the lunar lander game, where the location of safe landing zones are only known to the user. However when tested in a user study with a physical drone (n=4), the landing pad location is included within the network’s state space.

Researchers have also proposed to estimate the intent of pilots using eye gaze tracking (Pfeiffer et al., 2022) and the flight trajectory of pilots (Patrikar et al., 2022). However, these models are trained in a supervised manner with ground truth labels and are yet to be implemented in the shared autonomy context.

Our previous work (Backman et al., 2021) demonstrated a shared autonomy system where pilots were tasked to land a simulated UAV on one of several platforms with the help of an assistant. The assistant comprised of two modules; a perception module which encodes visual information onto a latent vector and a policy module trained using the reinforcement learning algorithm TD3 (Fujimoto et al., 2018) to augment the pilot’s actions.

Our proposed approach builds upon our prior work (Backman et al., 2021), making significant changes to the perception and policy modules with a focus on real-world implementation without the need for real-world data. The proposed perception module encodes information from two RGB-D cameras for greater field of view whilst maintaining a compressed latent embedding by combining the camera images with a novel camera projection model that is only attainable in simulation. For improved generality, the latent embedding encodes a segmentation map indicating potential safe landing areas in comparison to a single platform prior in Backman et al. (2021). For robustness against sensor noise, noise generating functions are dynamically applied per training batch, making the module highly invariant to disturbances.

The proposed policy module is improved with the inclusion of a LSTM-cell to address the prior inconsistency concerns shared by participants. The simulated user model is expanded to capture how a human pilot operates a joystick controller compared to keyboard presses in Backman et al. (2021). Model convergence time is significantly reduced by providing hidden information of the simulated user’s intent that can only be attained in simulation to the critic, as well as a concurrent exploration training architecture. Overall task complexity is increased due to additional safe landing requirements, unrestricted platform layouts compared to one-dimensional platform configurations in Backman et al. (2021) and user study validation in an unseen physical environment. We demonstrate that the proposed approach using the perceptual and policy modules trained only in simulation are able to successfully assist human pilots during physical shared autonomy drone landings.

1.2 Problem statement and model overview

We consider an unknown environment in which a UAV is operating, which includes a number of suitable landing locations, named landing platforms. Our objective is to develop an assistant that aids pilots of all proficiencies to safely land a UAV on a user-selected landing platform, when neither the number nor locations of landing platforms are known to the assistant a-priori. A landing is considered safe when the UAV remains at rest atop of a landing platform and the vertical and horizontal landing velocities are below a given threshold. The pilot controls the UAV by providing linear XYZ velocities using a radio joystick controller for the onboard flight controller to follow. Due to maintaining a safe viewing distance, the pilot’s ability to perceive the depth of the UAV is hindered.

Our approach is summarised in Fig. 1, and consists of two learning components: (i) a compressed latent representation of the environment to perceive the location of potentially safe landing sites and (ii) a policy network to provide control inputs to assist the pilot in safely landing. The first component takes noisy images from a downwards facing stereo RGB-D camera pair and encodes information of the structure of the scene and location of safe landing zones onto a low-dimensional latent vector. The second component takes the current state of the UAV which includes the latent vector representation, the UAV’s dynamics and the pilot’s current action, as well as the previous state concatenated with the assistant’s previous action to output a linear XYZ target velocity. The final target velocity for the UAV’s flight controller to follow is determined by averaging the action taken by the pilot and the assistant.

2 Learning latent space representation

The purpose of learning a latent space embedding is to reduce the dimensionality of the input state-space the policy network learns from. As the policy network gathers experience by interacting with the environment in real time, a reduction in the dimensionality of the state space leads to accelerated convergence (Curran et al., 2017).

Spurr et al. (2018) introduces a cross-modal variational auto-encoder (CM-VAE) which trains a latent vector representation from multiple data sources. Given an input data source $x_k$ where $k$ represents the modality such as an RGB-D image or an arbitrary sensor reading, the encoder $q_k$ embeds $x_k$ onto a latent representation as a vector of means $\mu $ and variances $\sigma ^2$ of a normal distribution. Decoder $p_l$ samples the latent embedding $z \sim {\mathcal {N}}(\mu , \sigma ^2)$ then subsequently reconstructs $z$ to the desired output $y_l$ where $l$ denotes the output modality. As encoder $q_k$ maps $x_k$ onto a normal distribution, the Kullback–Leibler divergence is used as a regularizing term to enforce that the properties of the target distribution are met, while information is encoded onto the latent vector by minimising the reconstruction loss between the predicted output ${\hat{y}}_l$ and the true output $y_l$.

2.1 CM-VAE implementation

Our implementation of the CM-VAE architecture differs from our previous work (Backman et al., 2021) by including two input RGB-D images taken from cameras situated at the front $x_{F}$ = [$x_{FRGB}$, $x_{FD}$] and back $x_B$ = [$x_{BRGB}$, $x_{BD}$] of the UAV facing downwards. The cameras are strategically placed such that the encoded field of view is maximised over the pilot’s depth axis, the axis of greatest uncertainty. The CM-VAE reconstructs two output modalities in the form of a combined depth map $y_D$ and a binary segmentation map $y_S$ that classifies safe-to-land areas, where a pixel is considered safe-to-land given that all four legs of the UAV are able to contact the surface without the rest of the UAV intersecting the environment. The reconstructed output $y_S$ replaces $y_{XYZ}$, the relative position of the nearest landing platform in our prior work (Backman et al., 2021) as a more general approach to defining potential landing zones, capable of representing multiple potential landing zones compared to a single platform. Both $x_F$ and $x_B$ are passed through a Siamese feature extraction layer $F_{DN}$ using DroNet (Loquercio et al., 2018) from which the resultant vectors are concatenated and encoded onto the latent embedding with encoder $q_{RGBD}$ i.e. $[\mu , \sigma ^2 ] = q_{RGBD}( [F_{DN}(x_{F}), F_{DN}(x_B)] )$. To reconstruct $y_D$ and $y_S$, a sample is taken from the latent embedding $z \sim {\mathcal {N}}(\mu , \sigma ^2)$ which is passed through decoders $p_D$ and $p_S$: ${\hat{y}}_D = p_D(z)$ and ${\hat{y}}_S = p_S(z)$.

A consolidated latent representation for stereo RGB-D images is chosen compared to individually embedding $x_F$ and $x_B$ as $\mu _F$ and $\mu _B$, then reconstructing the depths separately as ${\hat{y}}_{FD}$ and ${\hat{y}}_{BD}$, in order to create a lean latent embedding by reducing redundant information that would be shared amongst both cameras’ field of view. This also increases robustness against image noise as degraded visual information from one camera can be supplemented with the other. To achieve a consolidated latent representation we implement a novel camera projection model to generate $y_D$ and $y_S$ that encompasses the entire field of view of the input stereo image pair. The combined image is sub-divided into 3 segments, (i) front-camera and (ii) back-camera segment, which follows the standard pin-hole camera projection model split in opposite halves along the width of the image. (iii) Middle segment places $N_c$ intermediate 1-dimensional pin-hole cameras evenly distributed between the front and back camera. The intermediate cameras share the same image height and focal lengths of the front and back cameras but have an image width of one. To generate the combined depth and segmentation map we use an image size for the front and back cameras of 141w $\times $ 80 h with $N_c$ = 18 intermediate cameras for a combined image size of 160w $\times $ 80 h, which is subsequently down-sampled to 80w $\times $ 40 h. The combined image projection is demonstrated in Fig. 2.

To generate data to train the network we use the AirSim (Shah et al., 2018) Unreal Engine plugin to capture RGB-D images of generated scenes. Each generated scene consists of $N_p$ randomly generated landing platforms that are placed ensuring adjacent platforms are not within a minimum distance threshold. Lighting conditions are altered and textures are applied to the platforms, walls and floor from a database of 500 materials. For each scene, a total of 20 front and back RGB-D images, segmentation images and combined depth images are taken by sweeping a camera through the environment. As images taken from AirSim are free of noise which is not indicative of real-life sensors, a noise generating function $G_n$ is used to generate $x_F$ and $x_B$ given a clean input batch of depth and RGB images. For greater domain randomisation, noised images are dynamically generated during training by parallel worker threads that send the resultant noised image batches to the network training thread.

Each datapoint within a training batch consists of a ground truth combined depth map, segmentation map and a pair of noisy front and back RGB-D images. Losses used can be categorized into two categories, (i) those that aim to embed information onto the latent space and (ii) those that aim to condition the latent space. For (i) we use the mean squared error between the true combined depth map and the estimated depth map $(y_D, {\hat{y}}_D)$ with a weighting of 1.0 and the binary cross entropy loss between the true segmentation map and the estimated segmentation map $(y_S, {\hat{y}}_S)$ with a weighting of 0.5. For (ii) the Kullback-Leibler divergence loss is given a weighting of 4.0. An overview of the CM-VAE architecture can be seen in Fig. 3.

3 Policy learning: TD3

The aim of the policy network is to assist pilots in safely landing a UAV on a desired platform by providing a target velocity that is averaged with the pilot’s current input, which the onboard flight controller is to follow. We define our problem as a partially observable Markov decision process (POMDP) based on Javdani et al. (2018)’s formalization of shared autonomy systems. We define the set of all possible states as ${\mathcal {S}}$, where the pilot’s goal is treated as a hidden state that must be inferred by the assistant through the observation function ${\mathcal {O}}: {\mathcal {S}} \times \Omega \rightarrow [0, 1]$ where $\Omega $ is the set of observations. The set of all actions available to the assistant is denoted as ${\mathcal {A}}$. The state transition process is defined as a stochastic process: $T: {\mathcal {S}} \times {\mathcal {A}} \times {\mathcal {S}} \rightarrow [0, 1]$, due to the uncertainty in predicting UAV dynamics from unobservable forces such as the turbulence caused from the ground effect. The reward function is defined as $R: {\mathcal {S}} \times {\mathcal {A}} \times {\mathcal {S}} \rightarrow {\mathbb {R}}$, where the aim of reinforcement learning is to find the closest approximation to the optimal policy: $\pi : {\mathcal {S}} \times {\mathcal {A}}$ which maximises the expected future reward with discount factor $\gamma \in [0,1]$.

For optimal policy approximation, we follow our previous work (Backman et al., 2021) and use the reinforcement learning algorithm TD3 (Fujimoto et al., 2018) due to its model-free continuous action space control whilst alleviating the instability concerns of its predecessor DDPG (Lillicrap et al., 2016). As TD3 is an actor-critic approach, the actor follows the aforementioned POMDP definition where the partial observability originates from inferring the pilot’s goal without explicit definition. As the critic is an artifact of the training process that is not intended to be used during application, we model the problem for the critic as an MDP by providing the pilot’s goal information within the observation space ${\mathcal {O}}$ to reduce problem complexity for the critic.

3.1 TD3 implementation

To train the assistant we utilize AirSim as our simulation environment for modelling UAV dynamics. We generate a variety of training scenes with varying location and number of potential landing platforms. Each generated scene consists of $N_p$ randomly generated landing platforms placed either randomly such that adjacent platforms are not within a minimum distance threshold, or are arranged in a grid-like pattern with random spacing.

3.1.1 Simulating users

As training reinforcement learning algorithms takes substantial time, a population of simulated users is developed instead of training with human participants. We characterise a simulated user using four parameters: $\alpha \in [0,1]$ describes the simulated user’s conformance to the assistant’s actions and models how likely a pilot will adopt the policy of the assistant. $\beta \in [0,1]$ describes the user’s proficiency, defined as the ability to improve one’s estimate of the goal by perceiving the relative depth of the UAV to that of the landing platform. Both $\alpha $ and $\beta $ influence the simulated user’s estimate of the current goal position by:

$$\begin{aligned} {\hat{G}}_{i+1} = {\hat{G}}_i + \alpha \frac{a_a - a_u}{K_\alpha } + \beta \frac{G - {\hat{G}}_i}{K_\beta }, \end{aligned}$$

(1)

where $a_a$ & $a_u$ are the previous action taken by the assistant and simulated user respectively, while $K_\alpha $ and $K_\beta $ are scaling constants. In our previous work (Backman et al., 2021), we used a 2 parameter model consisting of $\alpha $ and $\beta $ to model simulated users. However, these two parameters alone do not sufficiently capture the variability in user behaviour such as their desired speed and acceleration, nor model how pilots operate a physical joystick interface. Here we introduce two new parameters $\Psi \in [0,1]$ and $\Phi \in [0,1]$ to model the simulated user’s dexterity in implementing fine motor control of a continuous joystick controller compared to discrete keyboard inputs in Backman et al. (2021). $\Psi $ describes how aggressively a pilot accelerates their thumbs on a joystick controller, modeling their tendency to generate smooth or sharp trajectories. $\Phi $ describes how daring a user is in terms of their maximum desired flight speeds by modeling how far a pilot is likely to push down on a joystick and their ability to provide fine adjustments when landing.

The simulated user is modelled as a state machine comprised of two states: (i) the approach state, where the simulated user travels towards their current estimate of the goal position at a given altitude and (ii) the descent state, where the simulated user attempts to land at the current goal estimate ${\hat{G}}_i$. The desired trajectory and velocity of the simulated user is determined by the trajectory planning module as seen in Fig. 4. The trajectory planning module is responsible for setting waypoints for the UAV to travel towards which are updated according to the current estimate of the target landing platform ${\hat{G}}_i$. The desired speed at which the simulated user travels towards the set waypoint is proportional to $\Phi $, resulting in the simulated user’s desired target velocity $V_t$. The output velocity $V_t$ of the trajectory planning module is then fed into the velocity mapping controller which determines the final output action of the simulated user as seen in Fig. 4. The velocity mapping controller is split into two submodules: joystick control, which aims to model how a human pilot’s thumbs operate a joystick controller and adaptability control, which aims to model how a pilot may react to not traveling at their desired velocity $V_t$ caused by disturbances from the assistant. Given that the current joystick position corresponds to the velocity $J_t$, the joystick control submodule will update the simulated user’s current joystick position using a proportional-controller based on the difference between its desired target velocity $V_t$ and its current joystick input $J_t$ i.e.

$$\begin{aligned} J_{t+1} = J_t + (V_t - J_t) P_{gain} \end{aligned}$$

where $P_{gain} \propto \Psi $. To adapt the simulated user’s control input in accordance with the actions taken by the assistant, the adaptability control submodule uses an integral-controller based on the previous action difference of the simulated user $a_u$ and the assistant $a_a$. This action difference integral $I_t$ is updated as:

$$\begin{aligned} I_{t+1} = I_t + (a_u - a_a)(1 - \alpha ), \end{aligned}$$

which is then subsequently decayed at each time step. The simulated user’s final output velocity command is then the sum of both the joystick control and adaptability control submodule’s outputs:

$$\begin{aligned} a_u = J_{t+1} + I_{t+1} I_{gain}. \end{aligned}$$

During training new simulated users are sampled after each landing trial where each of the four parameters: $\alpha $, $\beta $, $\Psi $ and $\Phi $ are independently sampled from a uniform distribution as well as random number generators seeded based on the current time. Each simulated user contains two random number generators to control for deterministic and non-deterministic decisions. Deterministic decisions are those that are guaranteed to occur in a specific order regardless of the actions taken by the assistant and include the initial starting altitude and whether the simulated user should fly along the principle axis or directly when approaching the target platform. Non-deterministic decisions are those that cannot be guaranteed to occur in a set order due to the actions taken by the assistant and include decisions such as pauses or altitude changes mid-flight. Two random number generators are included so that the exact characteristics and decisions made by the simulated user can be replayed when validating across multiple models for a fair comparison. Examples of the simulated user flying in various simulated environments can be seen in the supplementary video.

3.1.2 Policy network

The actor network architecture used by the assistant to generate target velocities contains two branches of fully connected layers. The first branch extracts features of the current state $s_t$ while the second branch extracts features of the previous state $s_{t-1}$ which is then sequentially fed into an LSTM cell for temporal information extraction. The two branches are then concatenated and fed into additional fully connected layers to generate a vector in ${\mathbb {R}}^3$, denoting the assistant’s desired control input target velocity for the UAV. A multi-branch model is chosen compared to our prior work (Backman et al., 2021) due to it’s poor handling of temporal information. Previously the pilot’s input was averaged over a fixed time window as a cheap means to observe the pilot’s intent over time, which occasionally led to inconsistent assistance being delivered. The actor and critic network share identical architectural designs aside from the activation layer of their final output, where the actor uses a tanh activation function scaled to generate velocity magnitudes between $[0.0, 1.2]$ m/s while the critic uses a linear activation function for generating the Q-value for the given state. The network architecture is summarised in Fig. 5.

At each iteration the actor receives three inputs: (i) the current state vector $s_t$ which contains the UAV position, velocity, pilot’s action and mean latent vector $\mu $ from $q_{RGBD}$. (ii) The previous state $s_{t-1}$ concatenated with the action taken by the assistant in the previous iteration and (iii) the LSTM cell’s memory state for the previous iteration.

The actor is unaware of the pilot’s goal which has to be implicitly inferred through observations of the actions taken by the pilot in the context of the current state. As the critic is not required during inference, the state space for the critic can be augmented with additional information that is only attainable during training in simulation, such as the pilot’s true goal landing position. When training the critic we concatenate the current state $s_t$ with the true goal position $G$ as a means of learning with privileged information (Vapnik & Izmailov, 2015) for decreased network convergence time. As shown in Salter et al. (2021), using privileged information during reinforcement learning has been demonstrated to improve learning speed and lead to impoved final performance and better generalisation.

The reward function used to train the policy includes a total of 5 terms as seen in Eq. (2):

$$\begin{aligned} \begin{aligned} R&= k_{0}R_\textrm{ActionDiff} + k_{1}R_\textrm{LandingError} \\&\quad + k_{2}R_\textrm{SafePos} + k_{3}R_\textrm{HVel} + k_{4}R_\textrm{VVel}. \end{aligned} \end{aligned}$$

(2)

$R_\textrm{ActionDiff}$ is the difference in the action taken by the user $a_u$ and the assistant $a_a$: $R_\textrm{ActionDiff} = {-}\left\Vert a_u - a_a\right\Vert $, to promote the assistant only exerting control when necessary. $R_\textrm{LandingError}$ is the UAV landing error towards the target platform $G$: $R_\textrm{LandingError} = {-}\left\Vert G - XY\right\Vert $, while $R_\textrm{SafePos}$ is whether the UAV landed at a safe location, defined as landing with all four legs contacting the platform (1) or not (-1). $R_\textrm{LandingError}$ is included to ensure the assistant lands near the platform desired by the pilot whereas $R_\textrm{SafePos}$ promotes safely landing on any platform. $R_\textrm{HVel}$ and $R_\textrm{VVel}$ relate to the speed of the UAV when landing in the horizontal $v_h$ and vertical $v_v$ direction respectively. The reward is scaled based on the height $H$ of the UAV up to a threshold height $H_T$ above the target platform, where the magnitude of $R_\textrm{HVel}$ and $R_\textrm{VVel}$ grows as the UAV gets closer to the platform:

$$\begin{aligned} R_\textrm{HVel} = {\left\{ \begin{array}{ll} \frac{({H - H_T})}{H_T} v_h^2 &{}\text {if }H < H_{T} \\ 0 &{}\text {else} \end{array}\right. } \end{aligned}$$

(3)

and

$$\begin{aligned} R_\textrm{VVel} = {\left\{ \begin{array}{ll} \frac{({H - H_T})}{H_T} v_v^2 &{}\text {if }H < H_{T} \\ 0 &{}\text {else} \end{array}\right. }. \end{aligned}$$

(4)

$R_\textrm{HVel}$ and $R_\textrm{VVel}$ are included to promote a landing velocity that is safe for the UAV, where separate weightings are applied along the vertical and horizontal directions to promote the assistant in generating trajectories with minor horizontal movement when landing. The scaling term $\frac{({H - H_T})}{H_T}$ is included to promote a smoother trajectory for the UAV compared to the assistant employing a just-in-time approach in slowing down for landing when only considering the final landing velocity in the vertical and horizontal direction. The actor aims to maximise the future discounted reward function given a discount factor of 0.99, where weighting coefficients $k_{0}$ – $k_{4}$ are given values of 0.375, 12.0, 5.0, 40.0 & 3.5 respectively.

For training, the exploration process is parallelised with 16 concurrent UAVs running across four separate instances of the Unreal Engine. Each Unreal Engine management thread collects the current state of the four UAVs within its environment, the simulated users’ actions and requests actions from the assistant from a centralised network inference thread. The returned assistant actions then have exploratory noise applied to them from an Ornstein-Uhlenbeck process which is decayed after successive epochs. The assistant actions are then averaged with the simulated users’ actions, from which normally distributed noise is then added with variance dependant on the UAV’s distance to the ground to emulate disturbances caused from the ground effect. The final target velocity is then sent to AirSim where an iteration is performed every 200ms. After each epoch the network inference thread loads the newly collected experiences from the 16 UAVs into the experience replay buffer and then performs optimisation iterations in accordance with the total exploration iterations performed in the previously completed epoch, with a batch size of 64. An epoch is considered complete when all 16 UAVs have landed or a total of 1.5 min have elapsed. Training and model validation results are reported in Sect. 5.2.

4 User study

To assess the performance of the proposed assistant with human pilots, a user study was conducted. The study was approved by the Monash University Human Research Ethics Committee (MUHREC), project ID 29565. The physical arena consisted of nine labeled platforms of size 0.5$\times $0.5$\times $0.12m, arranged in a 3$\times $3 grid with 1.4m spacing from each platforms’ centroids. Each participant performed a total of twenty landings split into a sequence of ten landings unassisted and ten landings assisted, where odd participant ID number participants completed all 10 assisted landings first while even ID number participants completed all 10 unassisted landings first. During the study, participants stood as far back as possible, confined within the physical dimensions of the experiment laboratory which was measured to be 11 m behind the centre platform and were told to remain stationary during the flight. The physical layout of the study can be seen in Fig. 6.

Participants flew a custom-built UAV using the Pixhawk Cube flight controller running ArduPilot. The UAV contained an Odroid-N2 as an onboard companion computer to retrieve RGB-D images from two Intel RealSense D435i cameras situated at the front and back of the UAV. The Odroid-N2 wirelessly streamed the camera feeds to a base station whilst receiving target velocities and UAV poses recorded using fourteen Bonita-10 Vicon motion capture cameras which are forwarded onto the onboard flight controller. The motion capture cameras were mounted around a 7.2m diameter circular truss, which the participants flew inside, constrained within a 6.8m diameter safe-to-fly zone by an automatic safety system which alerted and forcibly took control of the UAV if the pilot’s actions were deemed unsafe. During the study a single participant in the unassisted condition triggered the safety system due to unintentionally attempting to fly out of bounds.

Participants were initially asked to fill out a demographics survey detailing their previous experiences with UAVs and joystick controllers, after which were then shown an introductory video explaining the process of the study. Participants were then briefly given the opportunity to practice flying the UAV, where they were instructed to not perform any landings but to develop an understanding of the mapping of control inputs to the physical changes in the UAV state. After the conclusion of the practice flight, the ceiling lights were turned off and eight LED flood lights situated around the base of the truss were turned on to illuminate the arena and to prevent participants relying on shadows cast from the UAV as a substitute for depth perception. Participants were then tasked with performing the predefined sequence of ten landings either unassisted or assisted depending on their participant ID, after which they were given a NASA Task Load Index (TLX) (Hart & Staveland, 1988) survey detailing how they perceived the task workload for the given condition.

To avoid the need for participants to teleoperate the UAV take-off, the first author initiated the take-off sequence and ascended the UAV to an altitude of 1.4m using a separate master RC transmitter. The participant was then instructed which platform they were to land on using the ID number inscribed on the platform. The task commenced once the participant pushed either of the joysticks. The task concluded when the UAV descended to an altitude such that the UAV’s legs would make contact with the target platform, from which the UAV’s motors were turned off and the final landing position was recorded. The sequence of ten landings were then repeated under the opposite condition of unassisted or assisted, followed by the appropriate TLX survey. Finally the participants were instructed to complete the final survey which comprised of fourteen multiple choice questions asking about specific aspects of the task, followed by seven short answer questions. The full list of survey questions can be seen in Tables 2, 3 and 4, alongside the user study results in Sect. 5.4.

5 Results

5.1 Perception module results

To measure the effect that domain randomization has on the CM-VAE’s ability to reconstruct real-life scenes, an experiment was conducted where three identical CM-VAE models were trained using three unique datasets. The proposed CM-VAE used to train the assistant follows the methodology outlined in Sect. 4 and uses the complete dataset. To test the impact of noise generating functions, a model is trained using the complete dataset but images do not have noise applied to them before being fed into the network. To test the impact of training over a diverse set of environments, a third model is trained using an identical amount of unique training images used in the complete dataset, but limiting the variation of textures, lighting conditions and platform configurations. A total of 25 textures, 5% of those used in the complete dataset, are used for the platforms and flooring. Lighting conditions are restricted to a selection of ceiling lights and platform dimensions are sampled from a discrete collection of side lengths compared to a continuous sampling in the complete dataset. Each dataset consists of a total of 80,000 examples where each model is trained over 1-million iterations.

To evaluate each model’s ability to reconstruct real-life scenes, images collected from participants in the user study are used to reconstruct the flight arena with poses collected from the motion capture system. Each model reconstructs the combined depth map given input RBD-D images taken by the front and back camera of the UAV, where the mean error is computed from the ground truth combined depth map. The ground truth combined depth map is created by using the pose of the UAV, relative camera poses to the UAVs coordinate frame and camera intrinsic parameters to project rays onto a simplified version of the environment. The simplified environment mimics the intended user study arena with platforms of size 0.5$\times $0.5$\times $0.12m arranged in a 3$\times $3 grid with 1.4m spacing but does not consider imperfections in the platforms or flooring.

A total of 65,338 samples were used to calculate each model’s mean reconstruction error. The errors for the complete data set, limited domain randomization and no noise dataset models are 0.035m, 0.039m and 0.051m respectively compared to the average input depth map errors of 0.042m. To test if a statistically significant difference in reconstruction performance exists between the proposed model and alternative models, two-sample Welch’s t-tests are used at a 99% confidence level. The proposed model trained over the complete dataset resulted in a statistically significant lower mean reconstruction error compared to both alternative models.

An example plot of the reconstruction error for each input image in a participant’s trajectory can be seen in Fig. 7. For the model trained without noise applied to training images, large spikes in reconstruction errors can be observed in input images with missing depth values or large depth values caused from triangulating mismatched pixels. The model trained with limited domain randomization reconstructed the environment consistently however it was observed that the model had difficulty reconstructing certain sets of consecutive images. It is assumed that this is caused from difficulties associated in generalising to a broader set of unseen samples.

5.2 Policy learning results

To quantify the impact of network architecture decisions, an ablation study was performed where four unique models were trained. (i) LSTM-CriticGoal, the proposed model architecture which includes an LSTM cell and where only the critic has access to the true landing position $G$. (ii) LSTM-NoCriticGoal, where the critic is not provided with $G$ to test what impact providing additional information to the critic has on the training process, similiar to our prior work (Backman et al., 2021). (iii) NoLSTM-CriticGoal, where the LSTM cell is omitted from the network architecture and the fully connected layers from the current and previous state branches are directly concatenated to each other to test the impact of the LSTM cell on performance. (iv) LSTM-Oracle, where $G$ is provided to both actor and critic to test the impact of keeping the pilot’s intent hidden from the actor. Each model was trained over three random initialisations, where each initialisation consisted of two-million training iterations. The training results can be seen in Fig. 8A.

Providing the critic with the true goal information (available only in simulation) can be seen to have the greatest impact on training results. The final correct safe landing rate for the LSTM-NoCriticGoal model was 55%, which was achievable within the first 22% of training iterations for alternative models that included the true goal within the critic’s state space. LSTM-Oracle was the highest performing model during training, with a final average landing error 57% lower than the average of LSTM-CriticGoal and NoLSTM-CriticGoal. Unlike the alternate models, LSTM-Oracle does not mistakenly land on incorrect platforms from wrongly inferring the simulated user’s intent.

Each trained model was then subsequently validated on a standardised validation sequence which consisted of ten landings. The simulated environment was modeled to reflect the physical environment used in the user study, where a total of nine platforms of size 0.5$\times $0.5$\times $0.12m arranged in a 3$\times $3 grid with 1.4m spacing from the platform’s centroids were used. For each model, the sequence of ten landings were performed a total 51 times, where $\beta $ was swept from 0 to 1 in 0.02 increments whilst the remaining of the simulated user’s parameters $\alpha $, $\Psi $ and $\Phi $ were held constant at a value of 0.5. To ensure the simulated user’s policy remained consistent across all validated models, each of the ten landings were assigned a unique number to seed the simulated user’s random number generators. The simulated validation results can be seen in Fig. 8B.

The highest performing architecture was the oracle configuration with an average landing error of 0.042m followed by LSTM-CriticGoal with 0.053m, NoLSTM-CriticGoal with 0.062m and LSTM-NoCriticGoal with 0.109m, all of which outperforming the baseline of the unassisted simulated user with an average landing error of 0.362 m. Comparing the assisted approaches to that of the unassisted simulated user for both landing error and task success rate, it can be seen that the performance of the assisted approaches are invariant to that of the proficiency of the simulated user.

To assess the performance of the proposed approach when transferring from simulation to reality, the LSTM-CriticGoal architecture was validated within the physical environment used in the user study. The simulated user was used to pilot the UAV where the sequence of ten landings were performed for values of $\beta $ from 0 to 1 in 0.25 increments for a total of fifty landings. $\alpha $, $\Psi $ and $\Phi $ were held at constant values of 0.5 and random number generators seeded with identical values to that of the simulated validation. The results of the physical validation can be seen in Fig. 8C.

The baseline performance of the unassisted simulated user in the physical validation environment was comparable to that in simulation, where the range of average landing errors with respect to $\beta $ for the physical validation were [0.200, 0.550]m and [0.226, 0.518]m for simulated validation. Despite the proposed model LSTM-CriticGoal being trained purely on synthetic data, it achieved a landing success rate of 98% on the physical validation sequence where only a single failed landing was recorded for a $\beta = 0.0$ simulated user. The simulation-reality gap accounted for an additional 0.027m in the average landing error which can be attributed to a mixture of imperfect dynamic modelling of the UAV and the additional latency induced with wireless data streaming.

To assess the impact of reward function terms when training the proposed model LSTM-CriticGoal, an additional ablation study was performed where each of the terms in Eq. 2 were individually omitted, aside from $R_\textrm{ActionDiff}$. Each model was trained over a single random initialisation and then subjected to an identical simulated validation sequence as performed in the network architecture ablation study, the results of which can be seen in Fig. 9.

Omitting $R_\textrm{HVel}$ and $R_\textrm{VVel}$ resulted in the largest decrease in success rate due to the UAV landing with unsafe velocities, however having little impact on the landing error. Omitting $R_\textrm{LandingError}$ did not appear to effect the performance of the assistant against simulated users of $\beta > 0.2$, but caused the assistant to land on the incorrect platform for novice simulated users due to their larger initial errors. As $R_\textrm{SafePos}$ rewards the assistant for landing in any safe landing location, there is no incentive for the assistant to land at the location desired by the simulated user when $R_\textrm{LandingError}$ is removed. Omitting $R_\textrm{SafePos}$ caused the assistant to land in the general vicinity of the platform. $R_\textrm{SafePos}$ provides a binary reward for landing on a platform, resulting in strong gradients for the assistant to ensure a safe landing. As $R_\textrm{LandingError}$ is a continuous reward, a small decrease in the landing error results in a small decrease in the landing penalty. Therefore it becomes more beneficial for the assistant to focus on minimising $R_\textrm{HVel}$, $R_\textrm{VVel}$ and $R_\textrm{ActionDiff}$ instead of further reducing the landing error, resulting in an overall higher average landing error and decreased success in landing at a safe position.

The simulated user in our prior work Backman et al. (2021) used a two-parameter model which included $\alpha $ and $\beta $. To assess the impact of introducing two additional parameters $\Psi $ and $\Phi $ and the velocity mapping controller, an additional ablation study was performed where the proposed model LSTM-CriticGoal was trained on simulated users using the previous two-parameter model in Backman et al. (2021). To validate the difference in performance between the two assistants trained on different simulated user models, a standardized validation sequence consisting of 1000 landings was used. The validation sequence was constructed by uniformly sampling the simulated user’s parameters $\alpha $, $\beta $, $\Psi $ and $\Phi $ for each landing compared to the prior ablation study where $\alpha $, $\Psi $ and $\Phi $ remained constant at 0.5. Each landing in the validation sequence was performed in the simulated replica of the environment used in the physical user study, where a random start and goal platform was selected for each landing.

Both assistant models were then tested on identical validation sequences where the average success rate and landing error for the model trained using the four-parameter simulated user model was 96.4% and 0.061m respectively. For the assistant trained on the simpler two-parameter simulated user model the average success rate and landing error was 81.6% and 0.078m respectively. The lower success rate from the model trained with $\Psi $ and $\Phi $ omitted from its simulated user model is predominately caused from landing slightly off the platform which can be attributable to only experiencing simulated users in training that do not account for disturbances caused by the assistant from the adaptability control subsection of the velocity mapping controller module.

Accounting for $\Psi $ and $\Phi $ is important in accurately reflecting human piloting characteristics as the results obtained from Sect. 5.4 show that the average non-zero action for unassisted participants ranged between 0.09m/s to 0.44m/s, demonstrating variability in the equivalent simulated user parameter $\Phi $. While the average non-steady state acceleration in participants’ actions ranged from 0.25 to 1.17 m/s², demonstrating the variability in the equivalent simulated user parameter $\Psi $.

5.3 Related work comparison

To compare the proposed work to alternative shared autonomy works, the validation sequence used in Sect. 5.2 was applied to two comparison approaches proposed in Javdani et al. (2018); Reddy et al. (2018). Each approach formulates a user model and assistance policy.

Javdani et al. (2018) assists humans for robotic arm object grasping tasks and formulates a user model that outputs an action that directly approaches the goal location but with noise added to the output. The assistance policy requires the position of all goals within the environment to be known and computes an optimal action directed to each potential goal, where the policy’s output action is the weighted sum of each goal directed action weighted by the associated goal probability.

Reddy et al. (2018) assists humans in landing in the lunar lander game and formulates a simulated user model with a DQN trained using two rewards based on minimising the distance to goal and attaining landing success. To model imperfections in the optimal policy, the simulated user repeats the previous action with a fixed probability. The assistant copilot was trained using an identical DQN network following an identical reward structure.

The two comparison approaches (Javdani et al., 2018; Reddy et al., 2018) were implemented following the respective author’s GitHub repositories (Javdani, 2016; Reddy, 2018), where modifications are made to Javdani et al. (2018)’s parameters to ensure appropriate scaling for the given task and environment. Implementation of Reddy et al. (2018)’s copilot was trained using identical inputs of the current state branch in our proposed work, while the network model was increased in depth and width to match the output branch of our proposed model. Reddy et al. (2018)’s DQN model required to be trained exclusively in the validation environment as training in randomly generated environments caused the model to fail to converge. The DQN was trained for two-million optimisation iterations, identical to that of our proposed work.

Each of the three assistance strategies were validated against each of the three simulated user models. Each assistance strategy implemented their own action blending policy (combining the input of the simulated user and the assistant) as outlined within their respective works. For Reddy et al. (2018), the simulated user actions were discretised into 27 actions as per our prior work (Backman et al., 2021) for calculating the control sharing policy. Each model evaluation performed the 10 validation landings repeated a total of five times, where our proposed simulated user model had its $\beta $ parameter swept from 0.0 to 1.0 in 0.25 increments whilst the remaining simulated users remained constant as per their implementation.

Summary of the simulated shared autonomy comparison results can be seen in Fig. 10. Our proposed approach demonstrated the best performance in successfully landing, with an average landing error of 0.08 m, compared to 0.33 m and 0.40 m for Reddy et al. (2018) and Javdani et al. (2018) respectively. Importantly, our proposed approach achieves the lowest landing error for all 3 user models. Javdani et al. (2018) generated the most efficient trajectories, defined by minimising $\frac{\text {time} \times \text {distance travelled}}{\text {init. distance}^2}$, with an efficiency score of 3.20 followed by 6.64 and 22.01 by our proposed work and Reddy et al. (2018) respectively. In terms of the degree of control exerted onto the system, our proposed approach was shown to be the least intrusive, with an average Euclidean distance between the UAV’s target velocity and the simulated user’s output target velocity of 0.22, followed by 0.41 and 0.57 by Javdani et al. (2018) and Reddy et al. (2018) respectively.

The main limitation of Javdani et al. (2018) is its lack of dynamics modelling, where the inertia of the UAV often results in overshooting of the target, which is then further exacerbated by drifting closer to incorrect goals to form greater action probabilities towards the incorrect targets. Javdani et al. (2018)’s success in generating efficient trajectories is due to constantly providing actions that move the UAV closer to the target, however the success comes at a cost of requiring the location of all targets within the environment as a priori, limiting its feasibility in practical settings.

The main limitation of Reddy et al. (2018) is its inability to generalise to both different users or environments. Reddy et al. (2018) could not be trained on multiple randomly generated environments and had poor performance when validating against our proposed simulated user model. Unlike Javdani et al. (2018); Reddy et al. (2018) user models, the proposed user model does not fly directly to the goal but performs separate approach and descent phases that more accurately model human pilot behaviour which is not modelled by the other related works, causing suboptimal performance in the unseen conditions.

The proposed model demonstrated robustness in performance across all simulated user models, despite not being trained on the other user models that perform early and aggressive descent manoeuvres. Unlike the other assistance strategies, the proposed approach does not require knowledge of the environment as a priori such as the location of all targets in Javdani et al. (2018) or exclusive validation arena training in Reddy et al. (2018). The proposed work demonstrated the greatest performance in minimising the landing error whilst exerting the least amount of control over the system. However the lower degree of control over the system resulted in less efficient trajectories when compared to Javdani et al. (2018), as the proposed approach aims to refine the pilot’s policy in order to leverage high-level human decision making, which is more effective when assisting human pilots compared to simulated users.

5.4 User study results

28 participants completed the user study for a total of 560 landings (280 unassisted and 280 assisted) with an average study completion time of 1.5 h per participant. A summary of key performance metrics can be viewed in Table 1.

Table 1 User study performance metrics summary

Full size table

The poorest performing unassisted participant scored a success rate of 10%, whilst the highest performing unassisted participant achieved a success rate of 80%. While in the assisted condition, the lowest success rate observed was 90%, where the majority of assisted participants achieved a 100% success rate. A landing was considered a success if the UAV remained at rest atop of the designated platform, landed with a horizontal and vertical velocity of below 0.2 and 0.6m/s respectively and did not engage the automatic safety system due to attempting to fly outside of the arena.

To measure the efficiency of participant’s flight strategy we assess their trajectory under two metrics, (i) the time taken to complete the task and (ii) the total distance traveled. Both metrics are normalised by the initial starting distance to account for landing sequences where the starting platform is further away from the goal platform. Whilst in the assisted condition, pilots on average required two-thirds of the time to complete the same task compared to flying unassisted. For traveled distance, assisted participants generated trajectories 18-percentage points more efficient than that of unassisted participants when comparing to the most efficient path of 1.0 for the distance traveled/initial starting distance metric.

To test if a statistically significant difference exists between the unassisted and assisted condition for the metrics provided in Table 1, two-sample statistical tests under a 99% confidence level are performed for each metric. To test the significance for the success rate, median landing error and landing error variance, the McNemar, Mood’s median test and Levene’s test are respectively used whilst the remaining metrics are subjected to Welch’s t-tests. All metrics were found to have a statistically significant difference between the unassisted and assisted conditions.

To examine whether a learning effect exists from participants performing the unassisted or assisted condition first compared to the respective condition last, statistical analysis was performed on the metrics outlined in Table 1 using the previously aforementioned statistical tests under 99% confidence. It was found for all metrics across both conditions that there was insufficient evidence to suggest that performing either unassisted or assisted condition first or last had an impact on the condition’s metrics outlined in Table 1 under 99% confidence.

An example trajectory of a participant performing the task in the unassisted and assisted condition can be seen in Fig. 11. Unassisted participants tended to undershoot their initial descent followed by showing signs of uncertainty in the UAV’s state by making multiple adjustments along the depth axis before finally landing. In the assisted condition participants continued to undershoot their initial descent however the assistant altered the trajectory to approach the platform whilst descending to ensure a safe landing. The primary cause of failures in the unassisted condition were due to the participant undershooting the landing from their initial start location (63.0%), followed by overshooting the intended platform (32.6%). Further unassisted and assisted landings performed by participants can be seen in the attached supplementary video.

To measure how the assistance delivered by the assistant varies over trajectories, the XYZ distance between participants’ actions to that of the assistant’s action is plotted against the XY distance from the goal as seen in Fig. 12. It can be observed for distances far away from the goal (1.5 m+) the assistant exerts minimal control over the UAV due to the uncertainty in the pilot’s intent. As the UAV approaches the goal location the average assistance exerted increases as the assistant’s confidence in the target increases, before declining at distances less than 0.3 m due to not requiring to exert as much control to attain task success.

Summary results of participants’ perception of the task can be seen in Fig. 13. From the TLX survey, participants perceived a large degree of additional effort, mental demand and frustration while performing the task in the unassisted condition and recognised a large difference in performance whilst performing the task assisted. Participants also perceived a lower physical and temporal demand in the assisted condition albeit to a lesser extent. The TLX survey results were analysed with a Welch’s t-test under a 95% confidence level to determine if a statistically significant difference exists between participants’ perception of the task in the unassisted and assisted condition. It was found that for all metrics aside from physical demand, a statistically significant difference exists.

The full list of survey questions and responses can be seen in Tables 2, 3 and 4. From the final survey, participants strongly agreed that with the help of the assistant, task performance and the time required to complete the task was better than that of the unassisted condition. Compared to our previous work (Backman et al., 2021), a stronger disparity between participant confidence was observed where participants felt less confident flying unassisted, presumably due to the presence of physical risks, whilst being more confident when flying assisted. Previously (Backman et al., 2021) participants were neutral in terms of their trust placed onto the assistant due to inconsistencies in the degree of assistance provided as well as aggressively acting on the incorrect inference of the participants’ intent. In this work, participants’ final survey results displayed a greater trust in the assistant while recognising a consistent degree of assistance and correct inference of their intent. This additional trust and consistency may explain the greater degree of overall confidence participants felt in the assisted condition compared to previously, despite the additional risks and increased task difficulty.

For the worded responses, when asked about the most difficult aspect of the task 68% of participants made explicit references to the difficulties of estimating the depth of the UAV. 18% of participants mentioned difficulties associated with making small adjustments especially due to ground effect as they noted that the “drone became unstable as it neared the platforms”.

For additional features that participants would like to have implemented, 21% of participants made recommendations for a downwards facing laser light, 18% for a downwards facing camera and 21% for feedback via auditory or visual cues when the drone was above a platform or when the assistant was performing an action.

Table 2 Participant Demographics

Full size table

Table 3 Final survey

Full size table

When asked about their most preferred aspect of the assistant 29% of participants mentioned the subtleness of assistance provided stating that it was “non-invasive and well integrated with manual control” whilst some participants “did not notice it was there”. A further 29% of participants were most appreciative with the lack of effort required to land by having to only approximately approach the platform while 21% of participants enjoyed the improved success rate.

When asked about their most disliked aspect of the assistant 21% of participants made references to the uncertainty in the intent of the assistant stating that “I wasn’t sure what it was doing” and that “it was completely invisible”. Although many participants’ most enjoyed aspect of the assistant was its subtleness, this became a double-edged sword for others where the lack of feedback and communication resulted in a sense of uncertainty in what was expected of themselves and the assistant. 18% of participants stated that they thought the assistant landed the drone with too great of a vertical velocity and a further 18% of participants made comments about disliking the assistant’s automatic landing when close to the platform. Those participants made comments of wanting to be able to “readjust the position of the drone” when close to the platform and preferred if the assistant “only controlled the x-y position of the drone”, giving full control of the throttle to the participant.

Table 4 Final survey worded questions

Full size table

Table 5 Participant proficiency regression results

Full size table

To measure the impact of participants’ prior experience and to determine whether the assistant allows novice pilots to perform the task at a proficiency equal to or greater than experienced pilots, a multi-variable linear regression model was established to assess participants’ expertise level. Participants’ responses from the demographic survey as seen in Table 2 were used as independent variables in the model including an additional independent variable to denote whether the participant performed the unassisted task in the second half to control for potential learning throughout the study. The regression model used the participants’ average unassisted landing error as the dependent variable where multiple regressions were performed using backwards elimination until all remaining independent variables were deemed statistically significant under a two-tailed t-test with significance level of $\alpha $ = 0.05. The results of the backwards elimination regression can be seen in Table 5.

Prior drone piloting experience, joystick experience and video game frequency independent variables were found to be statistically significant under $\alpha = 0.05$. The model suggests that participants with a greater piloting and joystick experience on average had a lower unassisted landing error than those that did not, while those who frequently played video games tended to perform worse, which may possibly be caused from unrealistic expectations of real-world UAV dynamics. Insufficient evidence exists to suggest that the remaining independent variables had an effect on participants’ performance. The order in which the participants performed the task had no significant effect on their unassisted landing results.

To determine a participant’s proficiency score, the regressed model was used to estimate their unassisted landing error which was then remapped to fit within the range $[0, 1]$, where a value of zero denotes a novice pilot whilst a value of one denotes an expert pilot. The resultant proficiency score can be plotted against key performance metrics to measure whether pilot proficiency has an effect under the unassisted and assisted conditions, the results of which can be seen in Fig. 14.

Plot A in Fig. 14 shows a strong negative correlation between participants’ proficiency and their unassisted landing error, which intuitively leads to higher unassisted success rates in plot B. This strong relationship is expected as the previous regression model for estimating participants’ proficiency regressed against their unassisted landing error. For the assisted condition it can be seen that participants’ landing error and success rate was invariant to their prior piloting experience. Regardless of proficiency, participants performed consistently in the assisted condition, at a level greater than that of the highest performing participants.

For efficiency metrics in plots C and D in Fig. 14, a negative correlation exists between participants’ proficiency and both the time taken to complete the task and the distance traveled. Meaning that on average the more proficient a participant is, the shorter the flight trajectory and time that is required to complete the task. Whilst in the assisted condition, the average time taken did not depend on the participants’ proficiency, but a weak negative correlation is observed for the trajectory distance.

To confirm whether a statistical relationship exists between the aforementioned metrics and participant proficiency, the slope coefficients are subjected to a two-tailed t-test under significance $\alpha = 0.05$. For the unassisted condition, sufficient evidence exists to suggest that participant proficiency does affect all metrics shown in Fig. 14. Whilst for the assisted condition, there is insufficient evidence to suggest that participant proficiency influences any of the tested metrics and that regardless of ones prior piloting experience, participants are projected to perform equally across all metrics with the help of the assistant.

During the assisted portion of the user study, it was observed that participants developed riskier or lower-effort flying habits. In the later assisted landings participants provided low-effort initial estimates of the target platform compared to earlier assisted landings. Participants also made comments about becoming “reckless” in their post-study responses, which was observed by participants flying at greater speeds with aggressive descents in the later portion of assisted landings. In general, as participants’ confidence in the assistant grew, the level of risk they were willing to take also grew. The observations of low-effort initial estimates and increased risk-taking behavior are empirically supported through the final four landings in the assisted condition. A statistically significant difference in the landing error between the initial (0.08m) and later portion (0.11m) of the assisted landings were observed using a Welch’s t-test (p=0.008) under a 95% significance level. Of the five failed assisted landings, 80% occurred in the final four landings. Performance degradation was not observed in the unassisted condition where participants performed better in the later portion of the unassisted landings.

6 Discussion

The success of our proposed shared autonomy approach can be attributable to three main factors: (i) domain randomization, (ii) simulated user modeling and (iii) training efficiency. Transferring models trained purely in simulation must pass the simulation-to-reality gap, where visual perception accounts for the most significant portion of the gap due to difficulties in replicating photorealistic images, leading to simulation trained models to fail once transferred (Bousmalis et al., 2018). Domain randomization reduces the need for photorealistic images with the aim of making the trained network invariant to unimportant information i.e., illumination and background textures, which has shown success in other UAV applications (Loquercio et al., 2020). By training the perception module over a wide variety of simulated scenes under intense dynamically generated noise, the perception module becomes robust to visual disturbances. This is reflected in the results in Sect. 5.1, where the perception module trained with high domain randomization was found to have lower reconstruction errors when transferred to real scenes. Domain randomization is also a key consideration when developing simulated users as human pilots vary greatly in terms of strategy and proficiency.

Developing simulated users in a shared autonomy system is the most difficult challenge as they must cover a broad range of policies to reflect variability amongst users and realistically react to actions taken by the actor. Focusing the development of the simulated user around four base parameters simplifies the task while being capable of displaying a wide range of policies through the variation of the base parameters. Figure 15 shows the experimentally observed flight trajectories of participants in the unassisted and assisted conditions alongside an equivalent sample of unassisted simulated users. Participants approached the target platform in two ways, by flying along the principle axes or by flying directly to the platform as modeled by the simulated user. During descent unassisted participants appear to follow a complex trajectory, which is unmodeled by the simulated user. During the assisted condition, the trajectories reflect the smoothness of the simulated user’s model but with lower landing error variance. The source of trajectory complexity in the unassisted condition stems from uncertainty felt by the participants, which is absent in the assisted condition leading to simpler trajectories. The robustness of the user model is also illustrated in Fig. 10, which shows that the proposed model achieves the best performance even when tested against unseen user models.

The third factor which made our shared autonomy approach successful was training efficiency. From the ablation study results in Sect. 5.2, it was found that providing additional information only available in simulation to the critic greatly improved convergence time. Training the actor took approximately 1.5 days to complete with 16 UAVs flying concurrently. Our previous work (Backman et al., 2021) did not include additional information to the critic or perform concurrent exploration. Instead, it heavily relied on preloading the experience replay buffer with prior failed models’ state-transitions and performing additional optimization iterations to converge within a reasonable time. The training efficiency made it practical to perform hyper-parameter optimization, reward structure modification and ensured proper model convergence before being deployed.

A major focus of the proposed approach was in developing an assistant that was subtle and non-intrusive by implicitly inferring the pilot’s intent and only providing assistance as needed. From the responses outlined in the user study results in Sect. 5.4, it was found that some participants preferred this non-intrusive behavior for its intuitive controls, however others found the lack of communication and feedback unsettling. This may be due to participants not being told how the assistant operates or because of only using the assistant for a short period of time, however further work needs to be done investigating if a non-intrusive approach or one that actively communicates to the pilot is most appropriate.

Another concern is the observed behavioral change participants displayed in the later portion of the assisted condition with decreased effort and increased risk-taking, expecting the assistant to cover for them. Although the role of the assistant is to ensure safe landing with decreased effort and requisite skill, prolonged exposure may lead to degraded piloting proficiency over extended use. However the decrease in performance may be due to participants being tested in a low-stakes environment where the curiosity of exploring the limits and behavior of the assistant became more interesting.

A limitation of the proposed work is the reliance on motion capture for pose estimation when flying in the GPS denied environment. To extend the proposed approach to outdoor conditions, additional onboard sensing would need to be implemented to fuse GPS, IMU and visual odometry sensor streams. Visual odometry information can be attainable using the Intel RealSense Tracking camera T265 which has shown success in autonomous UAV landing tasks (Nogar, 2020; Wang et al., 2022).

Although the proposed approach has been fine tuned for UAV landing tasks, the principles of the approach have the potential to be transferred to alternative tasks. Three components are required for replication in alternative environments: (1) a perception module that perceives the environment by encoding information onto a latent vector, achievable with any auto-encoder network architecture trained to reconstruct or segment images. (2) A parametric simulated user model, where the parameters account for adaptability towards the assistant’s actions ($\alpha $), the proficiency of the simulated user in its ability to successfully complete the tasks by itself ($\beta $) and parameters associated with the dynamics of the output control variables ($\psi $ & $\phi $). The third component required is a policy module trained using TD3 with critic MDP formulation that includes the simulated user’s objective, while reward terms focus on task success components ($R_\textrm{LandingError}$, $R_\textrm{SafePos}$, $R_\textrm{HVel}$ & $R_\textrm{VVel}$) and agreement with the user ($R_\textrm{ActionDiff}$).

7 Conclusion

In this work we propose a shared autonomy approach that assists pilots of all skill levels to safely land a UAV under conditions where depth perception is difficult. The assistant is unaware of the pilot’s intent nor any prior knowledge of the structure of the environment and must disambiguate the task using observations of the pilot’s actions and its immediate surroundings. The proposed approach comprises of two fundamental components (i) a perception module that encodes information about the structure of the environment from two RGB-D cameras and (ii) a policy network that provides control inputs to assist the pilot in safely landing the UAV.

A user study ($n=28$) was held to validate the assistant’s performance where participants were instructed to land the drone on one of nine platforms in both the unassisted and assisted conditions. With the help of the assistant participants had significantly higher success rates from 51.4% to 98.2% and completed the task more efficiently in respect to time taken and distance traveled. Regression analysis showed that regardless of prior experience, the assistant allowed participants of all piloting proficiencies to perform equally amongst each other and greater than that of the most proficient unassisted pilots.

Compared to the unassisted condition, participants perceived the assisted condition to require a lower task load and to be the more favorable approach according to the NASA TLX and final survey. Participants stated that their most preferred aspect of the assistant was its non-invasive subtle nature as well as an overall reduction in effort required to successfully complete the task. However the assistant’s subtleness was a divisive aspect for others where the lack of communication and inability to give feedback about its intent led to participants feeling a sense of uncertainty in what was expected of them. Further work needs to be done to address how an AI in a shared autonomy setup can effectively communicate its intent and provide feedback on the user’s actions whilst maintaining an intuitive non-intrusive behavior.

References

Albanese, A., Sciancalepore, V., & Costa-Pérez, X. (2022). Sardo: An automated searchand- rescue drone-based solution for victims localization. IEEE Transactions on Mobile Computing, 21(9), 3312–3325.
Article Google Scholar
Baca, T., Stepan, P., Spurny, V., Hert, D., Penicka, R., Saska, M., & Kumar, V. (2019). Autonomous landing on a moving vehicle with an unmanned aerial vehicle. Journal of Field Robotics, 36(5), 874–891.
Article Google Scholar
Backman, K., Kulić, D., & Chung, H. (2021). Learning to assist drone landings. IEEE Robotics and Automation Letters, 6(2), 3192–3199.
Article Google Scholar
Bousmalis, K., Irpan, A., Wohlhart, P., Bai, Y., Kelcey, M., Kalakrishnan, M., & Vanhoucke, V. (2018). Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In IEEE International Conference on Robotics and Automation, (pp. 4243-4250).
Carney, E., Castano, L., & Xu, H. (2019). Determination of safe landing zones for an autonomous uas using elevation and population density data. Aiaa scitech 2019 forum (p. 1-16).
Curran, W., Pocius, R., & Smart, W.D. (2017). Neural networks for incremental dimensionality reduced reinforcement learning. In: 2017 ieee/rsj iros (pp. 1559-1565).
Feng, Y., Zhang, C., Baek, S., Rawashdeh, S., & Mohammadi, A. (2018). Autonomous landing of a uav on a moving platform using model predictive control. Drones, 2(4), 34.
Article Google Scholar
Fujimoto, S., van Hoof, H., & Meger, D. (2018). Addressing function approximation error in actor-critic methods. In Proceedings of the 35th int. conference on machine learning (Vol. 80, pp. 1587-1596).
González-deSantos, L., Martínez-Sánchez, J., González-Jorge, H., Navarro-Medina, F., & Arias, P. (2020). Uav payload with collision mitigation for contact inspection. Automation in Construction, 115, 103200.
Article Google Scholar
Hart, S. G., & Staveland, L. E. (1988). Development of nasa-tlx (task load index): Results of empirical and theoretical research. Human Mental Workload, 52, 139–183.
Article Google Scholar
Javdani, S. (2016). Ada assistance policy. https://github.com/personalrobotics/ ada assistance policy
Javdani, S., Admoni, H., Pellegrinelli, S., Srinivasa, S. S., & Bagnell, J. A. (2018). Shared autonomy via hindsight optimization for teleoperation and teaming. The International Journal of Robotics Research, 37(7), 717–742.
Article Google Scholar
Kaljahi, M. A., Shivakumara, P., Idris, M. Y. I., Anisi, M. H., Lu, T., Blumenstein, M., & Noor, N. M. (2019). An automatic zone detection system for safe landing of uavs. Expert Systems with Applications, 122, 319–333.
Article Google Scholar
Kan, X., Thomas, J., Teng, H., Tanner, H. G., Kumar, V., & Karydis, K. (2019). Analysis of ground effect for small-scale uavs in forward flight. IEEE Robotics and Automation Letters, 4(4), 3860–3867.
Article Google Scholar
Kim, D.-H., Go, Y.-G., & Choi, S.-M. (2020). An aerial mixed-reality environment for firstperson—view drone flying. Applied Sciences, 10(16), 5436.
Article Google Scholar
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., & Wierstra, D. (2016). Continuous control with deep reinforcement learning. In International conference on learning representations. [cs.LG]
Loquercio, A., Kaufmann, E., Ranftl, R., Dosovitskiy, A., Koltun, V., & Scaramuzza, D. (2020). Deep drone racing: From simulation to reality with domain randomization. IEEE Transactions on Robotics, 36, 1–14.
Article Google Scholar
Loquercio, A., Maqueda, A. I., del-Blanco, C. R., & Scaramuzza, D. (2018). Dronet: Learning to fly by driving. IEEE Robotics and Automation Letters, 3(2), 1088–1095.
Article Google Scholar
Maturana, D., & Scherer, S. (2015). 3d convolutional neural networks for landing zone detection from lidar. In 2015 ieee international conference on robotics and automation (icra) (p. 3471-3478).
Morando, L., Recchiuto, C. T., Calla, J., Scuteri, P., & Sgorbissa, A. (2022). Thermal and visual tracking of photovoltaic plants for autonomous uav inspection. Drones, 6(11), 5436.
Article Google Scholar
Nogar, S.M. (2020). Autonomous landing of a uav on a moving ground vehicle in a gps denied environment. In 2020 ieee international symposium on safety, security, and rescue robotics (ssrr) (p. 77-83).
Patrikar, J., Moon, B., Oh, J., & Scherer, S. (2022). Predicting like a pilot: Dataset and method to predict socially-aware aircraft trajectories in non-towered terminal airspace. In 2022 international conference on robotics and automation (icra) (p. 2525-2531).
Perez-Grau, F., Ragel, R., Caballero, F., Viguria, A., & Ollero, A. (2017). Semi-autonomous teleoperation of uavs in search and rescue scenarios. In International conference on unmanned aircraft systems (icuas) (p. 1066- 1074).
Pfeiffer, C., Wengeler, S., Loquercio, A., & Scaramuzza, D. (2022). Visual attention prediction improves performance of autonomous drone racing agents. PLOS ONE, 17(3), 1–16.
Article Google Scholar
Polvara, R., Patacchiola, M., Hanheide, M., & Neumann, G. (2020). Sim-to-real quadrotor landing via sequential deep q-networks and domain randomization. Robotics, 9(1), 8.
Article Google Scholar
Reddy, S. (2018). Deep assist. https://github.com/rddy/deepassist.
Reddy, S., Dragan, A., & Levine, S. (2018). Shared autonomy via deep reinforcement learning. In Proceedings of robotics: Science and systems.
Rodriguez-Ramos, A., Sampedro, C., Bavle, H., Puente, P. D. L., & Campoy, P. (2019). A deep reinforcement learning strategy for uav autonomous landing on a moving platform. Journal of Intelligent and Robotic Systems, 93, 351–366.
Article Google Scholar
Sa, I., Hrabar, S., & Corke, P. (2015). Inspection of pole-like structures using a visual-inertial aided vtol platform with shared autonomy. Sensors, 15, 22003–22048.
Article Google Scholar
Salter, S., Rao, D., Wulfmeier, M., Hadsell, R., & Posner, H. (2021). Attention-privileged reinforcement learning. In Conference on Robot Learning.
Shah, S., Dey, D., Lovett, C., & Kapoor, A. (2018). Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and service robotics (pp. 621-635). Springer International Publishing.
Shaqura, M., Alzuhair, K., Abdellatif, F., & Shamma, J.S. (2018). Human supervised multirotor uav system design for inspection applications. In Ieee international symposium on safety, security, and rescue robotics (ssrr) (p. 1-6).
Shi, G., Shi, X., O’Connell, M., Yu, R., Azizzadenesheli, K., Anandkumar, A., & Chung, S. (2019). Neural lander: Stable drone landing control using learned dynamics. In Icra (p. 9784-9790).
Smolyanskiy, N., & Gonzalez-Franco, M. (2017). Stereoscopic first person view system for drone navigation. Frontiers in Robotics and AI, 4, 11.
Article Google Scholar
Spurr, A., Song, J., Park, S., & Hilliges, O. (2018). Cross-modal deep variational hand pose estimation. In Ieee/cvf conference on computer vision and pattern recognition (p. 89-98).
Vapnik, V., & Izmailov, R. (2015). Learning using privileged information: Similarity control and knowledge transfer. Journal of Machine Learning Research, 16(61), 2023–2049.
MathSciNet MATH Google Scholar
Wang, P., Wang, C., Wang, J., & Meng, M.Q.-H. (2022). Quadrotor autonomous landing on moving platform. In Procedia Computer Science, 209 , 40-49. (Proceedings of the 2022 International Symposium on Biomimetic Intelligence and Robotics (ISBIR))
Wang, Y., Bai, P., Liang, X., Wang, W., Zhang, J., & Fu, Q. (2019). Reconnaissance mission conducted by uav swarms based on distributed pso path planning algorithms. IEEE Access, 7, 105086–105099.
Article Google Scholar
Xia, B., Mantegh, I., & Xie, W. (2021). Integrated emergency self-landing method for autonomous uas in urban aerial mobility. In 2021 21st international conference on control, automation and systems (iccas) (p. 275- 282).
Zhang, D., Tron, R., & Khurshid, R.P. (2021). Haptic feedback improves human-robot agreement and user satisfaction in sharedautonomy teleoperation. In 2021 ieee international conference on robotics and automation (icra) (p. 3306-3312).

Download references

Acknowledgements

None

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions This work was supported in part by D. Kulić’s Australian Research Council Future Fellowship (FT200100761).

Author information

Authors and Affiliations

Department of Mechanical and Aerospace Engineering, Monash University, Wellington Road, Clayton, VIC, 3800, Australia
Kal Backman, Dana Kulić & Hoam Chung
Department of Electrical and Computer Systems Engineering, Monash University, Wellington Road, Clayton, VIC, 3800, Australia
Dana Kulić

Authors

Kal Backman
View author publications
You can also search for this author in PubMed Google Scholar
Dana Kulić
View author publications
You can also search for this author in PubMed Google Scholar
Hoam Chung
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Project Concept: [KB, HC, DK]; Methodology: [KB, DK]; User Study Design and Implementation [KB, HC, DK]; User Study Execution and Data Acquisition [KB]; Analysis of the Results [KB, HC, DK]; Writing—original draft preparation: [KB]; Writing—review and editing: [HC, DK]; Supervision: [HC, DK]

Corresponding author

Correspondence to Kal Backman.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Ethical approval

The authors declare that the submitted work is free from personal conflicts of interest. The user study was approved by the Monash University Human Research Ethics Committee (MUHREC), project ID 29565. All participants gave informed consent prior to participating in the user study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (mp4 136569 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Backman, K., Kulić, D. & Chung, H. Reinforcement learning for shared autonomy drone landings. Auton Robot 47, 1419–1438 (2023). https://doi.org/10.1007/s10514-023-10143-3

Download citation

Received: 28 July 2022
Accepted: 26 September 2023
Published: 21 October 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10514-023-10143-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Reinforcement learning for shared autonomy drone landings

Abstract

Similar content being viewed by others

Unmanned aerial vehicles (UAVs): practical aspects, applications, open challenges, security issues, and future trends

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

Recent Advances in Unmanned Aerial Vehicles: A Review

1 Introduction

1.1 Related work

1.2 Problem statement and model overview

2 Learning latent space representation

2.1 CM-VAE implementation

3 Policy learning: TD3

3.1 TD3 implementation

3.1.1 Simulating users

3.1.2 Policy network

4 User study

5 Results

5.1 Perception module results

5.2 Policy learning results

5.3 Related work comparison

5.4 User study results

6 Discussion

7 Conclusion

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation