Air Learning: A Deep Reinforcement Learning Gym for Autonomous Aerial Robot Visual Navigation

We introduce Air Learning, an open-source simulator, and a gym environment for deep reinforcement learning research on resource-constrained aerial robots. Equipped with domain randomization, Air Learning exposes a UAV agent to a diverse set of challenging scenarios. We seed the toolset with point-to-point obstacle avoidance tasks in three different environments and Deep Q Networks (DQN) and Proximal Policy Optimization (PPO) trainers. Air Learning assesses the policies' performance under various quality-of-flight (QoF) metrics, such as the energy consumed, endurance, and the average trajectory length, on resource-constrained embedded platforms like a Raspberry Pi. We find that the trajectories on an embedded Ras-Pi are vastly different from those predicted on a high-end desktop system, resulting in up to 40% longer trajectories in one of the environments. To understand the source of such discrepancies, we use Air Learning to artificially degrade high-end desktop performance to mimic what happens on a low-end embedded system. We then propose a mitigation technique that uses the hardware-in-the-loop to determine the latency distribution of running the policy on the target platform (onboard compute on the aerial robot). A randomly sampled latency from the latency distribution is then added as an artificial delay within the training loop. Training the policy with artificial delays allows us to minimize the hardware gap (discrepancy in the flight time metric reduced from 37.73% to 0.5%). Thus, Air Learning with hardware-in-the-loop characterizes those differences and exposes how the onboard compute's choice affects the aerial robot's performance. We also conduct reliability studies to assess the effect of sensor failures on the learned policies. All put together, Air Learning enables a broad class of deep RL research on UAVs. The source code is available at:http://bit.ly/2JNAVb6.


I. INTRODUCTION
Deep Reinforcement Learning (DRL) has shown promising results in domains like sensorimotor control for cars [1], indoor robots [2], as well as UAVs [3], [4]. Deep RL's ability to adapt and learn with minimum apriori knowledge makes them attractive for use in complex systems [5].
Unmanned Aerial Vehicles (UAVs) serve as a great platform for advancing state of the art for DRL research. UAVs have practical applications, such as search and rescue [6], package delivery [7], [8], construction inspection [9]. Compared to other robots such as self-driving cars, robotic arm, they are vastly cheap to prototype and build, which makes them truly scalable. 1 Also, UAVs have fairly diverse control requirements. Targeting low-level UAV control (e.g. attitude control) requires continuous control (e.g., angular velocities) whereas, Fig. 1: Aerial robotics is a cross-layer, interdisciplinary field. Designing an autonomous aerial robot to perform a task involves interactions between various boundaries, spanning from environment modeling down to the choice of hardware for the onboard compute.
targeting high-level tasks such as point-to-point navigation can use discrete control. Last but not least, at deployment time they must be a fully autonomous system, running onboard computationally-and energy-constrained computing hardware.
But despite the promise of Deep RL, there are several practical challenges in adopting reinforcement learning for the UAV navigation task as shown in Figure 1. Broadly, the problems can be grouped into four main categories: (1) environment simulator, (2) learning algorithms, (3) policy architecture, and (4) deployment on resource-constrained UAVs. To address these challenges, the boundaries between reinforcement learning algorithms, robotics control, and the underlying hardware must soften. The figure illustrates the cross-layer, and interdisciplinary nature of the field, spanning from environment modeling to the underlying system. Each layer, in isolation, has a complex design space that needs to be explored for optimization. In addition, there are interactions across the layers that are also important to consider (e.g., policy size on a power-constrained mobile or embedded computing system). Hence, there is a need for a platform that can aid interdisciplinary research. More specifically, we need a research platform that can benchmark each of the layers individually (for depth), as well as end-to-end execution for capturing the interactions across the layers (for breadth).
In this paper, we present Air Learning (Section IV) -an open source Deep RL research simulation suite and bench-mark for autonomous UAVs. As a simulation suite of tools, Air Learning provides a scalable and cost-effective applied reinforcement learning research. It augments existing frameworks such as AirSim [10] with capabilities that make it suitable for Deep RL experimentation. As a gym, Air Learning enables RL research for resource constrained systems. Air Learning addresses the simulator level challenge, by providing domain randomization. We develop a configurable environment generator with a range of knobs to generate different environments with varying difficulty levels. The knobs (randomly) tune the number of static and dynamic obstacles, their speed (if relevant), texture and color, arena size, etc. In the context of our benchmarking autonomous UAV navigation task, the knobs help the learning algorithms' generalize well without overfiting to a specific instance of an environment. 2 Air Learning addresses the learning challenges (RL algorithm, policy design, and reward optimization) by exposing the environment generator as an OpenAI gym [11] interface and integrating it with Baselines [12], which has high-quality implementations of the state-of-the-art RL algorithms. We provide templates which the researchers can use for building multi-modal input policies based on Keras/Tensorflow. And as a DRL benchmark, the OpenAI gym interface enables easy addition of new deep RL algorithms. At the point of writing this paper, we provide two different reinforcement learning algorithms Deep Q-Networks (DQN) [13] and Proximal Policy Optimization (PPO) [14]. DQN is offpolicy, discrete action RL algorithm, and PPO is onpolicy, continuous action control of UAVs. Both come ready with curriculum learning [15] support.
To address the resource-constrain challenge early on in the design and development of Deep RL algorithms and policies, Air Learning uses a "hardware-in-the-loop" (HIL) [16] method to enable robust hardware evaluation without risking real UAV platform. Hardware in the loop, which requires plugging in the computing platform used in the UAV into the software simulation, is a form of real-time simulation that allows us to understand how the UAV responds to simulated stimuli on a target hardware platform. 3 HIL simulation helps us quantify the real-time performance of reinforcement learning policies on various compute platforms, without risking experiments on real robot platforms before they are ready.
We use HIL simulation to understand how a policy performs on an embedded compute platform that might potentially be the onboard computer of the UAVs. To enable systematic HIL evaluation, we use a variety of Quality-of-Flight (QoF) metrics, such as the total energy consumed by the UAV, the average length of its trajectory and endurance, to compare the different reinforcement learning policies. To demonstrate that Air Learning's HIL simulation is essential and that it can reveal interesting insights, we take the best performing policy from our policy exploration stage and evaluate the performance of the policy on a resource-constrained lowperformance platform (Ras-Pi 4) and compare it with a highperformance desktop counterpart . The difference between the Ras-Pi 4 and the Core-i9 based performance for the policy is startling. The Ras-Pi 4 sometimes takes trajectories that are nearly 40% longer in some environments. We investigate the reason for the difference in the performance of the policy on Ras-Pi 4 versus Intel Core-i9 and show that the choice of onboard compute platform directly affects the policy processing latency, and hence the trajectory lengths. The discrepancy in the policy behavior from training to deployment hardware is a challenge that must be taken into account when designing the DRL algorithm for a resource-constrained robot. We define this behavior as 'Hardware induced gap' because of the performance gap in training machine versus deployment machine. We use a variety of metrics to quantify the hardware gap, such as percentage change between the QoF metrics that include flight time, success rate, the energy of flight, and trajectory distance.
In summary, we present an open-source gym environment and research platform for Deep RL research for autonomous aerial vehicles. The contributions within this context include: • We present an open source benchmark to develop and train different RL algorithms, policies, and reward optimizations using regular and curriculum learning. • We present a UAV mapless navigation task benchmark for RL research on resource constrained systems. • We present a random environment generator for domain randomization to enable RL generalization. • We introduce and show 'Hardware induced gap' -that the policy's behavior depends on a computing platform it is running on, and that the same policy can result in a very different behavior if the target deployment platform is very different from than the training platform. • We describe the significance of taking energy consumption and the platform's processing capabilities into account when evaluating policy success rates. • To alleviate the hardware-induced gap, we train a policy using HIL to match the target platform's latencies. Using this mitigation technique, we minimized the hardware gap between the training platform and resource-constrained target platform from 38% to less than 0.5% on flight time, 16.03% to 1.37% on the trajectory length, and 15.49% to 0.1% on the energy of flight metric. Air Learning will be of interest to both fundamental and applied RL research community. The point to point UAV navigation benchmark can yield to progress on fundamental RL algorithm development for resource-constrained systems where training and deployment platforms are different. From that point of view, Air Learning is another OpenAI Gym environment. For the applied RL researchers, interested in RL applications for UAV domains such as source seeking, search and reuse, etc., Air Learning serves as a simulation platform and toolset with for full-stack research and development.

II. REAL WORLD CHALLENGES
We describe the real-world challenges associated with developing Deep RL algorithms on resource-constrained UAVs. We consolidate the challenges into four categories, namely Environment simulator, challenges related to the learning algorithm, policy selection challenges, and hardware-level challenges.
Environment Simulator Challenges: The first challenge is that Deep RL algorithms targetted for robotics need simulator. Collecting large amounts of real-world data is challenging because most commercial and off-the-shelf UAVs operate for less than 30 mins. To put this into perspective, creating a dataset as large as the latest "ImageNet" by Tencent for ML Images [17] would take close to 8000 flights (assuming a standard 30 FPS camera), thus making it a logistically challenging issue. But perhaps an even more critical and difficult aspect of this data collection is that there is a need for negative experiences, such as obstacle collisions, which can severely drive up the cost and logistics of collecting data [4]. More importantly, it has been shown the environment simulator having high fidelity and ability to perform domain randomization aids in the better generalization of reinforcement learning algorithms [18]. Hence, any infrastructure for Deep RL must have features to address these challenges to deploy RL policies in real-world robotics applications.
Learning Algorithm Challenges: The second challenge is associated with reinforcement learning algorithms. Choosing the right variant of a reinforcement learning algorithm for a given task requires fairly exhaustive exploration. Furthermore, since the performance and efficiency of a particular reinforcement learning algorithm are greatly influenced by its reward function, to get good performance, there is a need to perform design exploration between the reinforcement learning algorithms and its reward function. Though these challenges are innate to the Deep RL domain, having an environment simulator exposed as a simple interface [11] can allow us to efficiently automate the RL algorithm selections, rewards shaping, hyperparameter tuning [19].
Policy Selection Challenges: The third challenge is associated with the selection of policies for robot control. Choosing the right policy architecture is a fairly exhaustive task. Depending upon the available sensor suite on the robot, the policy can be uni-modal or multi-modal in nature. Also, for effective learning, the hyperparameters associated with the policy architecture have to be appropriately tuned. Hyperparameter tuning and policy architecture search is still an active area of research, which has lead to techniques such as AutoML [20] to determine the optimal neural network architecture. In the context of DRL policy selection, having a standard machine learning back-end tool such as Tensorflow/Keras [21] can allow DRL researchers (or roboticist) to automate the policy architecture search.
Hardware-level Challenges: The fourth challenge is regarding the deployment of Deep RL policies on the resourceconstrained UAVs. Since UAVs are mobile machines, they need to accomplish their tasks with a limited amount of onboard energy. Because onboard compute is a scarce resource and RL policies are computationally intensive, we need to carefully co-design the policies with the underlying hardware so that the compute platform can meet the real-time requirements under power constraints. As the UAV size decreases, the problem exacerbates because battery capacity (i.e., size) decreases, which reduces the total onboard energy (even though the level of intelligence required remains the same). For instance, a nano-UAV such as a CrazyFlie [22] must have the same autonomous navigation capabilities as compared to its larger mini counterpart, e.g., DJI-Mavic Pro [23] while the CrazyFlie's onboard energy is 1 15 th that of the Mavic Pro. Typically in Deep RL research for robotics, the system and onboard computers are based on commercial off the shelf hardware platforms. However, whether the selection of these compute platforms is optimal is mostly unknown. Hence, having the ability to characterize the onboard computing platform early on can lead to resource-friendly DRL policies.
Air Learning is built with features to overcome the challenges listed above. Due to the interdisciplinary nature of the tool, it provides flexibility to researchers to focus on a given layer (e.g., policy architecture design) while also understanding its impact on the subsequent layer (e.g., hardware performance). In the next section, we describe the related work and list of features that Air Learning supports out of the box.

III. RELATED WORK
Related work in Deep RL toolset and benchmarks can be divided into three categories. The first category of related work includes environments for designing and benchmarking new DRL algorithms. In the second category of related work includes tools used specifically for DRL based aerial robots. In the third category, we include other learning-based toolsets that support features that are important for Deep RL training. The feature list and comparison of related work to Air Learning are tabulated in Table I.
Benchmarking Environments: The first category of related work includes benchmarking environments such as OpenAI Gym [11], Arcade Learning Environments [24], and Mu-joCo [25]. These environments are simple by design and allow designing and benchmarking of new Deep RL algorithms. However, using these environments for real-life applications such as robotics is challenging because they do not address the hardware-level challenges (Section II) for transferring trained RL policies to real robots. Air Learning addresses these limitations by introducing Hardware-in-the-Loop (HIL), which allows end-user to benchmark and characterize the RL policy performance on a given onboard computing platform.
UAV Specific Deep RL Benchmarks: The second category of related work includes benchmarks that focus on UAVs. For example, AirSim [10] provides a high-fidelity simulation and dynamics for the UAVs in the form of a plugin that can be imported in any UE4 (Unreal Engine 4) [26] project. However, there are three AirSim limitations that AirLeaning addresses. First, the generation of the environment that includes domain randomization for the UAV task is left to the end-user to either develop or source it from the UE4 market place. The domain randomizations [18] are very critical for generalization of the learning algorithm, and we address this limitation in AirSim using the Air Learning environment generator.
Second, AirSim does not model UAV energy consumption. Energy is a scarce resource in UAVs that affects overall mission capability. Hence, learning algorithms need to be evaluated for energy efficiency. Air Learning uses energy model [27] within AirSim to evaluate learned policies. Air Learning also allows studying the impact of the performance  denotes that the feature exists. denotes missing feature or requires significant effort from end-user to enable that feature.
of the onboard compute platform on the overall energy of UAVs, allowing us to estimate in the simulation how many missions UAV can do, without running in the simulation.
Third, AirSim does not offer interfaces with OpenAI gym or other reinforcement learning framework such as stable baselines [12]. We address this drawback by exposing the Air Learning random environment generator with OpenAI gym interfaces and integrate it with a high-quality implementation of reinforcement learning algorithms available in the framework such as baselines [12] and Keras-RL [28]. Using Air Learning, we can quickly explore and evaluate different RL algorithms for various UAV tasks.
Another related work that uses a simulator and OpenAI gym interface in the context of UAVs is GYMFC [31]. GYMFC uses Gazebo [32] simulator and OpenAI gym interfaces for training an attitude controller for UAVs using reinforcement learning. The work primarily focuses on replacing the conventional flight controller with a real-time controller based on a neural network. This is a highly specific, low-level task. We focus more on high-level tasks, such as pointto-point UAV navigation in an environment with static and dynamic obstacles, and we provide the necessary infrastructure to carry research to enable on-edge autonomous navigation in UAVs. Adapting this work to support a high-level task such as navigation will involve overcoming the limitations of Gazebo, specifically in the context of photorealism. One of the motivations of building AirSim is to overcome the limitations of Gazebo by using state-of-the-art rendering techniques for modeling the environment, which is achieved using robust game engines such as Unreal Engine 4 [26] and Unity [33].
UAV Agnostic Deep RL Benchmarks: The third category of related work includes Deep RL benchmarks used for other robot tasks, such as grasping by a robotic arm or self-driving car. These related work are highly relevant to Air Learning because it contains essential features that improve the utility/performance of Deep RL algorithms.
The most prominent work in learning-based approaches for self-driving cars is CARLA [34]. It supports a photorealistic environment built on top of a game engine. It also exposes the environment as an OpenAI gym interface, which allows researchers to experiment with different Deep RL algorithms. The physics is based on the game engine, and they do not model energy or focus on the compute hardware performance. Since the CARLA was built explicitly for self-driving cars, porting these features to UAVs will require significant engi-neering effort.
For the robotic arm grasping/manipulation task, prior work [35], [36], [29], [37] include infrastructure support to train and deploy Deep RL algorithms on these robots. In [38], they introduce collective learning where they provide distributed infrastructure to collect large amounts of data with real platform experiments. They introduce an asynchronous variant of guided policy search to maximize the utilization (computer and synchronization between different agents), where each agent trains a local policy while a single global policy is trained based on the data collected from individual agents. However, these kinds of robots are fixed in a place; hence, they are not limited by energy or by onboard compute capability. So the inability to process or calculate the policy's outcome in real-time only slows down the grasping rate. It does not cause instability. In UAVs, which have a higher control loop rate, uncertainty due to slow processing latency can cause fatal crashes [39], [40].
For mobile robots with/without grasping such as Lo-coBot [41], PyRobot [42], ROBEL [29] provides open-source tools and benchmarks for training and deploying Deep RL policies on the LocoBot. The simulation infrastructure is based on Gazebo or MuJoCo, and hence it lacks photorealism in the environment and other domain randomization features. Similar to CARLA and robot grasping benchmarks, PyRobot does not model energy or focus on computing hardware performance.
In softlearning [43], the authors apply a soft-actor critic algorithm for the quadrupedal robot. They use Nvidia TX2 on the robot for data collection and also running the policy. The data collected is then used to train the global policy, which is then periodically updated to the robot. In contrast, in our work, we show that training policy on a high-end machine can result in a discrepancy in performance for aerial robot platform. Aerial robots are much more complex to control and unstable compared to ground-based quadrupedal robots. Hence small differences in processing time can hinder its safety. We propose training a policy using the HIL technique with the target platform's latency distribution to mitigate the difference.
Effect of Action Time in RL Agents: Prior works [44], [45] have studied the relationship between decision making time (i.e., time taken to decide an action) and task performance in RL agents. The authors propose reactive reinforcement learning algorithms propose a new "reactive SARSA" algorithm that orders computational components without affecting the training convergence to make decision making faster. In First, it has a configurable random environment generator built on top of UE4, a photo-realistic game engine that can be used to create a variety of different randomized environments. Second, the random environment generators are integrated with AirSim, OpenAI gym, and baselines for agile development and prototyping different state of the art reinforcement learning algorithms and policies for autonomous aerial vehicles. Third, its backend uses tools like Keras/Tensorflow that allow the design and exploration of different policies. Lastly, Air Learning uses the "hardware in the loop" methodology for characterizing the performance of the learned policies on real embedded hardware platforms. In short, it is an interdisciplinary tool that allows researchers to work from algorithm to hardware with the intent of enabling intra-and inter-layer understanding of execution. It also outputs a set of "Quality-of-Flight" metrics to understand execution.
Air Learning, we expose a similar effect where differences in training hardware (high-end CPU/GPU) and deployment hardware (embedded CPUs) can result in entirely different agent behavior. To that end, we propose a novel action scaling technique based on Hardware-in-the-loop that minimizes the differences between training and deployment of the agent on resource-constrained hardware. Unlike "reactive SARSA" [44], we do not make any changes the RL algorithm.
Another related work [46] studies the impact of delays in the action time in the robotic arm's context. The authors use previously computed action until a new action is processed. We study the same problem in aerial robots, where we show that the differences in training and deployment hardware are another source of introducing processing delays and often overlooked. Since drones are deployed in a more dynamic environment, delayed action reduces the drones' reactivity and can severely hinder their safety. To mitigate the performance gaps (hardware gap), we use the HIL methodology to model the target hardware delays and use them for training the policy.
In summary, Air Learning provides an open source toolset and benchmark loaded with the features to develop Deep RL based applications for UAVs. It helps design effective policies, and also characterize them on an onboard computer using the HIL methodology and quality-of-flight metrics. With that in mind, it is possible to start optimizing algorithms for UAVs, treating the entire UAV and its operation as a system.

IV. AIR LEARNING
In this section, we describe the various Air Learning components. The different stages are shown in Figure 2, which allows researchers to develop and benchmark learning algorithms for autonomous UAVs. Air Learning consists of six keys components: an environment generator, an algorithm exploration framework, closed-loop real-time hardware in the loop setup, an energy and power model for UAVs, quality of flight metrics that are conscious of the UAV's resource constraints, and a runtime system that orchestrates all of these components. By using all these components in unison, Air Learning allows us to fine-tune algorithms for the underlying hardware carefully.

A. Environment Generator
Learning algorithms are data hungry, and the availability of high-quality data is vital for the learning process. Also, an environment that is good to learn from should include different scenarios that are challenging for the robot. By adding these challenging situations, they learn to solve those challenges. For instance, for teaching a robot to navigate obstacles, the data set should have a wide variety of obstacles (materials, textures, speeds, etc.) during the training process.
We designed an environment generator specifically targeted for autonomous UAVs. Air Learning's environment generator creates high fidelity photo-realistic environments for the UAVs to fly in. The environment generator is built on top of UE4 and    uses the AirSim UE4 [10] plugin for the UAV model and flight physics. The environment generator with the AirSim plugin is exposed as OpenAI gym interface. The environment generator has different configuration knobs for generating challenging environments. The configuration knobs available in the current version can be classified into two categories. The first category includes the parameters that can be controlled via a game configuration file. The second category consists of the parameters that can be controlled outside the game configuration file. The full list of parameters that can be controlled are shown in tabulated in Table II. For more information on these parameters, please refer appendix.

B. Algorithm Exploration
Deep reinforcement learning is still a nascent field that is rapidly evolving. Hence, there is significant infrastructure overhead to integrate random environment generator and evaluate new deep reinforcement learning algorithms for UAVs.
So, we expose our random environment generator and AirSim UE4 plugin as an OpenAI gym interface and integrate it popular reinforcement learning framework with stable baselines [12], which is based on OpenAI baselines. 4 To expose our random environment generator into an OpenAI gym interface, we extend the work of AirGym [47] to add support for environment randomization, a wide range of sensors (Depth image, Inertial Measurement Unit (IMU) data, RGB image, etc.) from AirSim and support exploring multimodal policies.
We seed the Air Learning algorithm suite with two popular and commonly used reinforcement learning algorithms. The first is Deep Q Network (DQN) [13] and the second is Proximal Policy Optimization (PPO) [14]. DQN falls into the discrete action algorithms where the action space is high-level commands ('move forward,' 'move left' e.t.c.,) and Proximal Policy Optimization falls into the continuous action algorithms (e.g., policy predicts the continuous value of velocity vector). For each of the algorithm variants, we also support an option to train the agent using curriculum learning [15]. For both these algorithms, we keep the observation space, policy architecture and reward structure same and compare  Normalized reward during training for algorithm exploration between PPO-C and DQN-C. We find that the DQN agent performs better than the PPO agent irrespective of whether the agent was trained using curriculum learning or non-curriculum learning. The rewards are averaged over five runs with random seeds. agent performance. The environment configuration used in the training of PPO/DQN, the policy architecture, the reward function, is described in the appendix (Appendix B). Figure 4a shows the normalized reward of the DQN agent (DQN-NC) and PPO agent (PPO-NC) trained using noncurriculum learning. One of the observations is that the PPO agent trained using non-curriculum learning consistently accrues negative reward throughout the training duration. In contrast, the DQN agent trained using non-curriculum learning starts at the same as the PPO agent but the DQN agent accrues more reward beginning in the 2000 th episode. Figure 4b shows the normalized episodic reward for the DQN (DQN-C) and PPO (PPO-C) agents trained using curriculum learning. We observe a similar trend as we saw with the agents trained using non-curriculum learning where the DQN agent outperforms the PPO agent. However, in this case, the PPO agent has a positive total reward. But the DQN agent starts to accrue more reward starting from the 1000 th episode. Also, the slight dip in the reward at 3800 th is due to the curriculum's change (increased difficulty).
Reflecting on the results, we gathered in Figure 4a and Figure 4b, continuous action reinforcement learning algorithms such as PPO have generally been known to show promising results for low-level flight controller tasks that are used for stabilizing UAVs [48]. However, as our results indicate, applying these algorithms for a complex task, such as end-to-end navigation in a photo-realistic simulator, can be challenging for a couple of reasons.
First, we believe that the action space for the PPO agent limits the exploration compared to the DQN agent. For the PPO agent, the action space is the components of velocity vector v x and v y whose value can vary from [-5 m/s, 5 m/s].
Having such an action space can be a constraining factor for PPO. For instance, if the agent observes an obstacle at the front, it needs to take action such that it moves right or left. Now for PPO agent, since the action space is continuous values of [V x , V y ], for it to move forward in the x-direction, the V x can be any positive number while the V y component has to be '0'. It can be quite challenging for the PPO agent (or continuous action algorithm) to learn this behavior, and it might require a much more sophisticated reward function that identifies these scenarios and rewards or penalizes these behaviors accordingly. In contrast, for the DQN agent, the action space is much simpler since it has to only yaw (i.e., move left or right) and then move forward or vice versa.
Second, in our evaluation, we keep the reward function, input observation and the policy architecture same for DQN and PPO agent. We choose to fix these because we want to focus on showcasing the capability of the Air Learning infrastructure. Since RL algorithms are sensitive to hyperparameters and the choice of the reward function, it could be possible that our reward function, policy architecture could have inadvertently favored the DQN agent compared to the PPO agent. The sensitivity of the RL algorithms to the policy and reward is still an open research problem [49], [50].
The takeaway is that we can do algorithm exploratory studies with Air Learning. For high-level task like point-to-point navigation, discrete action reinforcement learning algorithms like DQN allows more flexibility compared to continuous action reinforcement learning algorithms like PPO. We also demonstrate that incorporating techniques such as curriculum learning can be beneficial to the overall learning.

C. Policy Exploration
Another essential aspect of deep reinforcement learning is the policy, which determines the best action to take. Given a particular state the policy needs to maximize the reward. A neural network approximates the policies. To assist the researchers in exploring effective policies, we use Keras/TensorFlow [51] as the machine learning back-end tool. Later on, we demonstrate how one can do algorithm and policy explorations for tasks like autonomous navigation though Air Learning is by no means limited to this task alone.

D. Hardware Exploration
Often aerial roboticists port the algorithm onto UAVs to validate the functionality of the algorithms. These UAVs can be custom built [52] or commercially available off-the-shelf (COTS) UAVs [53], [54] but mostly have fixed hardware that can be used as onboard compute. A critical shortcoming of this approach is that the roboticist cannot experiment with hardware changes. More powerful hardware may (or may not) unlock additional capabilities during flight, but there is no way to know until the hardware is available on a real UAV so that the roboticist can physically experiment with the platform.
Reasons for wanting to do such exploration includes understanding the computational requirements of the system, quantifying the energy consumption implications as a result of interactions between the algorithm and the hardware, and so forth. Such evaluation is crucial to determine whether an algorithm is, in fact, feasible when ported to a real UAV with a specific hardware configuration and battery constraints.
For instance, a Parrot Bepop [55] comes with a P7 dual-core CPU Cortex A9 and a Quad core GPU. It is not possible to fly the UAV assuming a different piece of hardware, such as the NVIDIA Xavier [56] processor that is significantly more powerful; at the time of this writing there is no COTS UAV that contains the Xavier platform. So, one would have to wait until a commercially viable platform is available. However, using Air Learning, one can experiment how the UAV would behave with a Xavier since the UAV is flying virtually.
Hardware exploration in Air Learning allows for evaluation of the best reinforcement learning algorithm and its policy on different hardware. It is not limited by the onboard compute available on the real robot. Once the best algorithm and policy are determined, Air Learning allows for characterizing the performance of these algorithms and policies on different types of hardware platforms. It also enables to carefully fine-tune and co-design algorithms and policy while being mindful of the resource constraints and other limitation of the hardware.
A HIL simulation combines the benefits of the real design and the simulation by allowing them to interact with one another as shown in Figure 5. There are three core components in Air Learning's HIL methodology: (1) a high-end desktop that simulates a virtual environment flying the UAV (top); (2) an embedded system that runs the operating system, the deep reinforcement learning algorithms, policies and associated software stack (lef t); and (3) a flight controller that controls the flight of the UAV in the simulated environment (right).
The simulated environment models the various sensors (RGB/Depth Cameras), actuators (rotors), and the physical world surrounding the agent (Obstacles). This data is fed into the reinforcement learning algorithms that are running on the embedded companion computer, which processes the input and outputs flight commands to the flight controller. The controller then communicates those commands into the virtual UAV flying inside the simulated game environment.
The interaction between the three components is what allows us to evaluate the algorithms and policy on various embedded computing platforms. The HIL setup we present allows for the swap-ability of the embedded platform under test. The methodology enables us to effectively measure both the performance and energy of the agent holistically and more accurately, since one can evaluate how well an algorithm performs on a variety of different platforms.
In our evaluation, which we discuss later, we use a Raspberry Pi (Ras-Pi 4) as the embedded hardware platform to evaluate the best performing deep reinforcement learning algorithm and its associated policy. The HIL setup includes running the environment generator on a high-end desktop with a GPU. The reinforcement learning algorithm and its associated policy run on the Ras-Pi 4. The state information (Depth image, RGB image, IMU) are requested by Ras-Pi 3 using AirSim Plugins APIs which involves an RPC (remote procedural calls) over TCP/IP network (both high-end desktop and Ras-Pi 4 are connected by ethernet). The policy evaluates the actions based on the state information it received from the high-end desktop. The actions are relayed back to the high-end desktop through AirSim flight controller API's.

E. Energy Model in AirSim Plugin
In Air Learning, we use the energy simulator we developed in our prior work [27]. The AirSim plugin is extended with a battery and energy model. The energy model is a function of UAVs velocity, acceleration. The values of velocity and acceleration are continuously sampled and using these we estimate the power as proposed in this work [57]. The power is calculated using the following formula: In Eq. 1, v xy and a xy are the velocity and acceleration in the horizontal direction. v z and a z denotes the velocity and acceleration in the z direction. m denotes the mass of the payload. β 1 to β 9 are the coefficients based on the model of the UAV used in the simulation. For the energy calculation model, we use the columb counter technique as described in prior work [58]. The simulator computes the total number of columb that has passed over the battery over every cycle. Using the energy model Air Learning allows us to monitor the energy continuously during training or during the evaluation of the reinforcement learning algorithm.

F. Quality of Flight Metrics
Reinforcement learning algorithms are often evaluated based on success rate where the success rate is based on whether the algorithm completed the mission. This metric only captures the functionality of the algorithm and grossly ignores how well the algorithm performs in the real world. In the real world, there are additional constraints for a UAV, such as the limited onboard compute capability and battery capacity.
Hence, we need additional metrics that can quantify the performance of learning algorithms more holistically. To this end, Air Learning introduces Quality-of-Flight (QoF) metrics that not only captures the functionality of the algorithm but also how well they perform when ported to onboard compute in real UAVs. For instance, the algorithm and policies are only useful if they accomplish the goals within finite energy available in the UAVs. Hence, algorithms and policies need to be evaluated on the metrics that describe the quality of flight such as mission time, distance flown, etc. In the first version of Air Learning, we consider the following metrics.
Success Rate: The percentage of time the UAV reaches the goal state without collisions and running out of battery. Ideally, this number will be close to 100% as it reflects the algorithms' functionality, taking into account resource constraints.
Time to Completion: The total time UAV spends finishing a mission within the simulated world.
Energy Consumed: The total energy spent while carrying out the mission. Limited battery available onboard constrains the mission time. Hence, monitoring energy usage is of utmost importance for autonomous aerial vehicles, and therefore should be a measure of policy's efficiency.
Distance Traveled: Total distance flown while carrying out the mission. This metric is the average length of the trajectory that can be used to measure how well the policy did.

G. Runtime System
The final part is the runtime system that orchestrates the overall execution. The runtime system starts the game engine with the correct configuration of the environment before the agent starts. It also monitors the episodic progress of the reinforcement learning algorithm and ensures that before starting a new episode that it randomizes the different parameters, so the agent statistically gets a new environment. It also has resiliency built into it to resume the training in case any one of the components (for example UE4 engine) crashes.
In summary, using Air Learning environment generator, researchers can develop various challenging scenarios to design better learning algorithms. Using Air Learning interfaces to OpenAI gym, stable-baselines and TensorFlow backend, they can rapidly evaluate different reinforcement learning algorithms and their associated policies. Using Air Learning HIL methodology and QoF metrics, they can benchmark the performance of learning algorithms and policies on resourceconstrained onboard compute platforms.

V. EXPERIMENTAL EVALUATION PRELUDE
The next few sections focus heavily on how Air Learning can be used to demonstrate its value. As a prelude, this section presents the highlights to focus on the big picture.
Policy Evaluation (Section VI): We show how Air Learning can be used to explore different reinforcement learning based policies. We use the best algorithm determined during the algorithm exploration step and use that algorithm to explore the best policy. In this work, we use Air Learning environment generator to generate three environments namely No Obstacles, Static Obstacles, and Dynamic Obstacles. These three environments create a varying level of difficulty by changing the number of static and dynamic obstacles in the environments for the autonomous navigation task. We also show how Air Learning allows end users to perform benchmarking of the policies by showing two examples. In the first example,  In curriculum learning, we split the arena into virtual partitions, and the end goal is placed within a specific zone and gradually moved higher zone once it succeeds in more than 50% over 1000 latest episode.
we show how well the policies trained in one environment generalize to the other environments. In the second example, we show to which of the sensor inputs the policy is most sensitive towards. This insight can be used while designing the network architecture of the policy. For instance, we show that image input has the highest sensitivity amongst other inputs. Hence a future iteration of the policy can have more feature extractors (increasing the depth of filters) dedicated to the image input.
System Evaluation (Section VII): We show the importance of benchmarking algorithm performance on resource-constrained hardware such as what is typical of a UAV compute platform. In this work, we use a Raspberry Pi 4 (Ras-Pi 4) as an example of resource-constrained hardware. We use the best policies determined in the policy exploration step (Section VI) and use that to compare the performance between Intel Core-i9 and Ras-Pi 4 using HIL and the QoF metrics available in Air Learning. We also show how to artificially degrade the performance of the Intel Core-i9 to show how compute performance can potentially affect the behavior of a policy when it is ported over to a real aerial robot.
In summary, using these focused studies, we demonstrate how Air Learning can be used by researchers to design and benchmark algorithm-hardware interactions in autonomous aerial vehicles, as shown previously in Figure 2.

VI. POLICY EXPLORATION
In this section, we perform policy exploration for the DQN agent with curriculum learning [15]. The policy exploration phase aims to determine the best neural network policy architecture for each of the tasks (i.e., autonomous navigation) in different environments with and without obstacles.
We start with a basic template architecture, as shown in Figure 7. The architecture is multi-modal and takes depth image, velocity, and position data as its input. Using this template, we sweep two parameters, namely # Layers and # Filters (making the policy wider and deeper). To simplify the search, for convolution layers, we restrict filter sizes to 3 x 3 with stride 1. This choice ensures that there is no loss of pixel information. Likewise, for fully-connected layers, # Filter parameter denotes the number of hidden neurons in that layer. The choice of using #Layers and # Filters parameters to control both the convolution and fully-connected layers is to manage the complexity of searching over large NN hyperparameters design space.
The # Layers and # Filters and the template policy architecture can be used to construct a variety of different policies. For example, a tuple of (# Filters = 32, # Layers = 5) will result in a policy architecture where there five convolution layers with 32 filters (with 3 x 3 filters) followed by five fully-connected layers with 32 hidden neurons each. For each of the navigation tasks (in different environments), we sweep the template parameters (# Layers and # Filters) to explore multiple policy architectures for the DQN agent.

A. Training and Testing Methodology
The training and testing methodology for the DQN agent running in the different environments is described below.
Environments: For the point-to-point autonomous navigation task for UAVs, we create three randomly generated environments, namely No Obstacles, Static Obstacles, and Dynamic Obstacles with varying levels of static obstacles and dynamic obstacles. The environment size for all three levels is 50 m x 50 m. For the No Obstacles environment, there are no obstacles in the main arena, but the goal position is changed every episode. For Static Obstacles, the number of obstacles varies from five to ten, and it is changed every four episodes. The end goal and position of the obstacles are changed every episode. For Dynamic Obstacles, along with five static obstacles, we introduce up to five dynamic obstacles of whose velocities range from 1 m/s to 2.5 m/s. The obstacles and goals are placed in random locations every episode to ensure that the policy does not over-fit.
Training Methodology: We train the DQN agent using curriculum learning in the environments described above. We use the same methodology described in Appendix B, where we checkpoint policy in each zone for the three environments. The hardware used in training is an Intel Core-i9 CPU with an Nvidia GTX 2080-TI GPU.
Testing Methodology: For testing the policies, we evaluate the checkpoints saved in the final zone. Each policy is evaluated on 100 randomly generated goal/obstacle configuration (controlled by the 'Seed' parameter in Table II). The same 100 randomly generated environment configurations are used across different policy evaluations. The hardware we use for testing the policies is the same as the hardware used for training them (Intel Core-i9 with Nvidia GTX 2080-TI).

B. Policy Selection
The policy architecture search for No Obstacles, Static Obstacles, and Dynamic Obstacles are shown in Figure 8. Figure 8a, Figure 8b and Figure 8c show the success rate for different policy architecture searched for the DQN agent trained using curriculum learning on No Obstacles, Static Obstacles, and Dynamic Obstacles environments, respectively. In the figures, the x-axis corresponds # Filter sizes (32, 48, or 64) and the y-axis corresponds to the # Layers (2, 3, 4, 5, and 6) for No Obstacles/Static Obstacles environments and # Layers (5,6,7,8,9) for Dynamic Obstacles environment. The reason for sweeping different (larger) policies is because "Dynamic Obstacles" will be a harder task, and a deeper policy might help improve the success rate compared to a shallow policy. Each cell corresponds to a unique policy architecture based on the template defined in Figure 8. The value in each cell corresponds to the success rate for the best policy architecture. The ± denotes the standard deviation (error bounds) across five seeds. For instance, in Figure 8a, the best performing policy architecture with # Filters of 32 and # Layers of 2 results in a 72% success rate. The success rate across five seeds results in a standard deviation of ± of 8% error. For evaluation, we always choose the best performing policy (i.e., the policy that achieves best success rate).
Based on the policy architecture search, we notice that as the task complexity increases (obstacle density increases), a larger policy improves the task success rate. For instance, in the No Obstacles case (Figure 8a), the policy with # Filters of 32 and # Layers of 5 achieves the highest success rate of 91%. Even though we name the environment No Obstacles, the UAV agent can still collide with the arena walls, which lowers the success rate. For Static Obstacles case (Figure 8b), the policy with # Filters of 48 and # Layers of 4 achieves the best success rate of 84%. Likewise, for Dynamic Obstacles case (Figure 8c), the policy architecture with # Filters of 32 and # Layers of 7 achieves the best success rate of 61%. The success rate loss in Static Obstacles and Dynamic Obstacles cases can be attributed to an increase in the possibility of collisions with static and dynamic obstacles.

C. Success Rate Across the Different Environments
To study how a policy trained in one environment performs in other environments, we take the best policy trained in the No Obstacles environment and evaluate it on the Static Obstacles and Dynamic Obstacles environments. We do the same for the best policy trained on Dynamic Obstacles and assess it on the No Obstacles and Static Obstacles environments.
The results for the generalization study are tabulated in Table III. We see that the policy trained in the No Obstacles environment has a steep drop in success rate from 91% to 53% in Static Obstacles and 32% in Dynamic Obstacles environment, respectively. In contrast, we observe that the policy trained in the Dynamic Obstacles environment has an increased success rate from 61% to 89% in the No Obstacles and 74% in the Static Obstacles environment, respectively. The drop in the success rate for the policy trained in the No Obstacles environment is expected because, during its training, the agent might not have encountered a variety of obstacles (static and dynamic obstacles) to learn from as it might have encountered in the other two environments. The same reasoning can also apply to the improvement in the success rate observed in the policy trained in the Dynamic Obstacles   Obstacles environments. Each cell shows the success rate for the policies for # Layers and # Filters' corresponding values. The success rate is evaluated in Zone 3, which is the region that is not used during training. Each policy is evaluated on the same 100 randomly generated environment configuration (controlled by the 'Seed' parameter described in Table II). The policy architecture with the highest success rate is chosen as the best policy for DQN agents in the environment with no obstacles, static obstacles, and dynamic obstacles. The standard deviation error across multiple seeds are denoted by (±) sign. For the No Obstacles environment, the policy with # Layers of five and # Filters of 32 is chosen as the best performing policy. Likewise, for the Dynamic Obstacles environment, the policy architecture with # Layers of 7 and # Filter of 32 is chosen as the best policy. environment when it is evaluated on the No Obstacles and Static Obstacles environments.
In general, the agent performs best in the environment where it is trained, which is expected. But we also observe that training an agent in a more challenging environment can yield good results when evaluating in a much less challenging environment. Hence, having a random environment generator, such as what we have enabled in Air Learning, can help the policy generalize well by creating a wide variety of different experiences for the agent to experience during training.

D. Success Rate Sensitivity to Sensor Input Ablation
In doing policy exploration, one is also interested in studying the policy's sensitivity towards a particular sensor input. So we ablate the sensor inputs to the policy to understand the effects. We ablate the policy's inputs one by one and see the impact of various ablation and its success rate. It is important to note that we do not re-train the policy with ablated inputs. This is to perform reliability study and simulate the real-world scenario if a particular sensing modality is corrupted.
The policy architecture we used for the DQN agent in this   Fig. 9: The effect of ablating the sensor inputs on the success rate. We observe that the depth image contributes the most to the policy's success, whereas velocity input affects the least in the success. All the policy evaluations are in Zone3 on the Intel Core-i9 platform.
work is multi-modal in nature which receives depth image, velocity measurement V t and position vector X t as inputs.
where v x , v y , v z are the components of velocity vector in x, y and z directions at time 't'. The X t is a 1-dimensional vector of the form [X goal , Y goal , D goal ], where X goal and Y goal are the relative 'x' and 'y' distance with respect to the goal position and D goal is the euclidean distance to the goal from the agent's current position.
The baseline success rate we use in this study is when all the three inputs are fed to the policy. The velocity ablation study refers to removing the velocity input measurements from policy inputs. Likewise, the position ablation study and depth image ablation study refer to removing the position vector and depth image from the policy's input stream. The results of various input ablation studies are plotted in Figure 9.
For the No Obstacles environment, the policy success rate drops from 91% to 53% when velocity measurements are ablated. When the depth image is ablated, we find that the success rate drops to 7%, and when the position vector is ablated, the success rate drops to 42%. Similarly, for Static Obstacles, we find that if the depth image input is ablated, it fails to reach the destination. Likewise, when the velocity and position inputs are ablated, we observe the success rate drops from 84% to 33%. Similarly, we see a similar observation in a Dynamic Obstacles environment where the success rate drops to 0% when the depth image is ablated. The depth image is the highest contributor to the policy's success, whereas the velocity input is significant but least among the other two inputs. The drop in the policy success rate due to depth image ablation is evident from policy architecture since maximum features in the flatten layer are contributed by the depth image than velocity and position (both 1 x 3 vectors). Another interesting observation is that when the position input is ablated, the agent also loses the information about its goal. The lack of goal position results in an exploration policy capable of avoiding obstacles (due to depth image input). In No Obstacles environment (where there are no obstacles except walls), the agent is free to explore unless it collides with the walls or exhaust maximum allowed steps. Due to the exploration, the agent reaches the goal position 42 out of 100 times. Our results are in line with prior work [60], [61] where such random action-based exploration yields some amount of success. However, in a cluttered environment, random exploration may result in sub-optimal performance due to a higher probability of collision or exhausting maximum allowed steps (a proxy for limited battery energy).
Using Air Learning, researchers can gain better insights into how reliable a particular set of inputs in the case of sensor failures. The reliability studies and its impact on learning algorithms are essential given the kind of application the autonomous aerial vehicles are targeted. Also, understanding the sensitivity of a particular input towards success can lead to better policies where more feature extraction can be assigned to those inputs.

VII. SYSTEM EVALUATION
This section demonstrates how Air Learning can benchmark the algorithm and policy's performance on a resourceconstrained onboard compute platform, post-training. We use the HIL methodology (Section IV-D) and QoF metrics (Section IV-F) for benchmarking the DQN agent and its policy. We evaluate them on the three different randomly generated environments described in Section VI.

A. Experimental Setup
The experimental setup has two components, namely, the server and System Under Test (SUT), as shown in Figure 10. The server component is responsible for rendering the environment (for example, No Obstacles). The server consists of an 18 core Intel Core-i9 processor with an Nvidia RTX-2080. The SUT component is the system on which we want to evaluate the policy. The SUT is the proxy for the onboard compute system used in UAVs. In this work, we compare the policies' performance on two systems, namely Intel Core-i9 and Ras-Pi 4. The key differences between the Intel Core-i9 and Ras-Pi 4 platform are tabulated in Table IV. The systems are vastly different in their performance capabilities and represent ends of the performance spectrum.
Three latencies affect the overall processing time. The first is t 1 , which is the latency to extract the state information (Depth Image, RGB Image, etc.) from the server. The state   information is fetched from the server to the SUT. The communication protocol used between the server and the SUT is TCP/IP. Initially, we found that ethernet adapter on Intel Core-i9 faster compared to the ethernet adapter on Ras-Pi 4. We make the t 1 latencies between Intel Core-i9 and Ras-Pi 4 same by adding artificial sleep for Intel Core-i9 platform. 5 The second latency is t 2 , which is the policy evaluation time for the SUT (i.e., the Intel Core-i9 or the Ras-Pi 4). The policies are evaluated on the SUT, which predicts the output actions based on the input state information received from the server. The policy architecture used in this work has 40.3 Million (No Obstacles and Static Obstacles) and 161.77 Million parameters (Dynamic Obstacles. The t 2 latency for No Obstacles policy on Ras-Pi 4 is 396 ms, while on the desktop, equipped with GTX 2080 Ti GPU and Intel Core i9 CPU, it is 11 ms. The desktop is 36× times faster.
The third latency is t 3 . Once the policies are evaluated, it predicts actions. These actions are converted to the lowlevel actuation using the AirSim flight controller APIs. 6 These APIs have a duration parameter that controls the duration of a particular action must be applied. This duration parameter is denoted by t 3 , and it is kept the same for both SUTs.
To evaluate the impact of the SUT performance on the overall learning behavior, we keep the t 1 and t 3 latencies 5 The sleep latency value that was added to Intel Core-i9 was determined by doing a ping test with the packet size equal to the size of the data (Depth Image) we fetch from the server and averaged it over 50 iterations. 6 https://github.com/Microsoft/AirSim/blob/master/docs/apis.md constant for both Intel Core-i9 and Ras-Pi 4 systems. We focus only on the difference in the policy evaluation time (i.e., t 2 ) and study how it affects the overall performance time.
Using this setup, we evaluate the best policy determined in Section VI for environments with no obstacles, static obstacles, and dynamic obstacles.

B. Desktop vs. Embedded SUT Performance
In Table V, we compare the performance of the policy on a Intel Core-i9 (high-end desktop) and the Ras-Pi 4. We evaluate the best policy on the No Obstacles, Static Obstacles and Dynamic Obstacles environments described previously in Section VI.
In the No Obstacles case, the policy running on the highend desktop is 11% more successful compared to the policy running on the Ras- Overall, across the three different environments, the policy evaluated on the Ras-Pi 4 achieves a success rate that is within 13% compared to the policy assessed on the desktop. While some degradation in performance is expected, the magnitude of the deterioration is more severe for the other QoF metrics, such as flight time, energy, and distance flown. This difference is significant to note because when the policies are ported to resource-constrained compute like the Ras-Pi 4 (a proxy for onboard compute in real UAVs), they could perform worse, such as being unable to finish the mission due to low battery.
In summary, the takeaway is that evaluations of policies solely on a high-end machine do not accurately reflect the   real-time performance on an embedded compute system such as those available on UAVs. Hence, relying on success rate as the sole metric is insufficient, though this is by and a large state of the art means to report success. Using Air Learning and its HIL methodology and QoF metrics, we can understand to what extent the choice of onboard compute affects the performance of the algorithm.

C. Root-cause Analysis of SUT Performance Differences
It is important to understand why the policy performs differently on the Intel Core i9 versus the Ras-Pi 4. So, we perform two experiments. First, we plot the policy trajectories on the Ras-Pi 4 and compare it to the Intel Core-i9 to understand if there is a flight path difference. Visualizing the trajectories helps us build intuition about the variations between the two platforms. Second, we take an Intel Core-i9 platform and degrade its performance by adding artificial sleep such that the policy evaluation times are similar to that of Ras-Pi 4. This helps us validate if the processing time is giving rise to the QoF metric discrepancy.
To plot the trajectories, we fix the end goal's position, obstacles, and evaluate 100 trajectories with the same configuration in the No Obstacles, Static Obstacles, and Dynamic Obstacles environments. The trajectories are shown in Figure 11a, Figure 11b, and Figure 11c. They are representative of repeated trajectories between the start and end goals. The trajectories between the desktop and Ras-Pi 4 are very different-the desktop trajectory orients towards the goal and the proceeds directly. The Ras-Pi 4 trajectory starts toward the goal, but then drifts, resulting in a longer trajectory. This is likely a result of the actions taken because of stale sensory information, due to the longer inference time; recall there is a 20× difference in the inference time between the desktop and Ras-Pi 4 (Section VII-A and Table V).
To further root-cause and test whether the (slower) processing time (t 2 ) is giving rise to the long trajectories, we take the best performing policy trained on the high-end desktop in the Static Obstacles environment and gradually degrade the policy's evaluation time by introducing artificial sleep times into the program. 7 Sleep time injection allows us to model the big differences in the behavior of the same policy and its sensitivity to the onboard compute performance.  performance of Intel Core-i9 is degraded by 150 ms and 300 ms, respectively. As performance deteriorates from 3 ms to 300 ms, the flight time degrades by 97%, the trajectory distance degrades by 21%, and energy degrades by 43%. We visualize degradation impact by plotting the same policy's trajectories on the baseline Intel Core-i9 system and the degraded versions of Intel Core-i9 systems (150 ms and 300 ms). The trajectory results are shown in Figure 12. As we artificially degrade, the drift in trajectories gets wider, which increases the trajectory length to reach the goal position, thus degrading the QoF metrics. We also see that the trajectory of the degraded Intel Core i9 closely resembles the Ras-Pi 4 trajectory.
In summary, the onboard compute choice and algorithm profoundly affect the resulting UAV behavior and shape of the trajectory. Additional quality of flight metrics (energy, distance, etc.) captures the differences better than just the success rate. Moreover, evaluations done purely on a high-end desktop might show lower energy consumption in a mission, but when the solution is ported to real robots, the solution might consume more energy due to the sub-par performance of the onboard compute. Using the hardware-in-the-loop (HIL) methodology allows us to identify these differences and other performance bottlenecks that arise due to the onboard compute without having to port things to the real robots. Hence, a tool like Air Learning with its HIL methodology helps identify such differences at the early stage.
In the next section, we show how Air Learning HIL can mitigate the hardware gap and characterize the end-to-end learning algorithms and model these characteristics to create robust and performance-aware policies.

VIII. MITIGATING THE HARDWARE GAP
In this section, we demonstrate how Air Learning HIL technique can be used to minimize the hardware gap due to differences in the training hardware and deployment hardware (onboard compute). To that end, we propose a general methodology where we train a policy on the high-end machine with added latencies to mimic the onboard compute's performance. Using this method, we show that it minimizes the hardware gap from 38% to less than 0.5% on flight time metric, 16.03% to 1.37% on the trajectory length metric, and 15.49% to 0.1% on the energy of flight metric. One way to mitigate the hardware gap is to directly train the policy on the onboard computer available in the robot [36], [63], [29]. Though on-device RL training is practical for ground-based or fixed robots to overcome the 'sim2real gap' [64], [65], in the context of UAVs, training the RL policy on-device during flight has logistical limitations and not scalable (As explained in Section II). Moreover, some of the onboard computers on these UAVs don't have the necessary hardware resources required for on-device RL training. For instance, most hobbyist drones and research UAV platforms (e.g., CrazyFlie) are typically powered by microcontrollers and have a total of 1 MB memory. For most vision based navigation, its storage space is insufficient for the policy weights. Hence these resource constraints make RL training on-device extremely challenging.
To overcome these resource constraint limitations and enable on-device RL training for UAVs, we introduce a methodology that uses HIL for training the RL policy. This methodology allows us to train the RL policy on a high-end machine (e.g., Intel Core-i9 with GPUs) while capturing the latencies incurred in processing the policy in the onboard computer. We describe the details of the methodology below.

A. Methodology
The methodology is divided into three phases, namely 'Phase 1', 'Phase 2', and 'Phase 3' as shown in Figure 13. In Phase 1 (Figure 13a), we use the HIL to determine three specific latencies namely t 1 , t 2 , and t 3 defined in Section VII-A. We capture the latency distribution when the policies are run on-device (e.g., Ras-Pi). The distribution captures the variation in the decision-making times when the policy is deployed in the onboard computer.
Once the latency distribution is captured, we calculate the maximum achievable velocities for safe navigation based on (c) Phase 3: Deploying the policy. Fig. 13: A three-phase methodology for mitigating the hardware gap using hardware-in-the-loop training. (a) In phase 1, we use the hardware-in-the-loop methodology on a candidate policy to get the policy's latency distribution on target hardware (Ras-Pi 4). We use prior work [62] as the cyber-physical model to determine the upper bound for maximum velocity. (b) In phase 2, we use the latency distribution to randomly sample the delay that needs to be added to the policy's training. (c)In phase 3, the HIL trained policy is deployed on the target hardware for evaluation.
the decision-making time [66]. This is to ensure that the drone can navigate safely without colliding with an obstacle. We evaluate the maximum safe velocity the aerial robot can travel based on the visual performance model proposed in this work [62]. The model considers the time to action latency, drones' physics (e.g., thrust-to-weight ratio, sensing distance, etc.) to determine the drone's maximum safe velocity.
In phase 2 (Figure 13b), we train the policy by adding extra delays sampled from the latency distribution determined in phase 1. The decision-making loop's added delays mimic the typical processing delay when the policy is deployed on the resource-constrained onboard computer. The policy's action space is also scaled based on the maximum velocity achievable based on the decision-making time [62], [66].
Once the policy is trained, in phase 3 (Figure 13c), we deployed the trained policy on the onboard compute (Ras-Pi) and evaluate its performance and quality of flight metrics.

B. Experimental Setup and Evaluation
To validate the methodology, we train a policy on the Static Obstacle environment with at most two to three obstacles. The candidate architecture policy is 5 Layers with 32 Filters based on the template defined in Figure 8.
We use the HIL setup described in Figure 10 to evaluate the decision making latency on Ras-Pi 4, which is our target resource-constrained hardware platform. The simulation environment is rendered on the Intel Core-i9 server. We deploy a randomly initialized policy on the Ras-Pi 4 at this stage to benchmark the latencies. We do a rollout of 1000 steps using HIL to capture the variations in decision-making times.
On the high-end server (Intel core-i9 with GTX 2080 TI), we train the candidate policy for the same task (i.e., Static Obstacles) with added delay element in the decision-making loop. The delay element's actual value is randomly sampled from the latency distribution obtained for the candidate policy (5 Layers with 32 Filters) running on the Ras-Pi 4. Also, based on the maximum value of the latency from the distribution, we estimate the upper limit for the safe velocity for drone [66], [62]. This upper limit in safe velocity is then used to scale the action space such that at any point, the drone's velocity at each step does not exceed the maximum safe velocity.
Once the candidate policy's training with added latency is complete, we deploy the policy on the Ras-Pi 4 platform (target resource-constrained onboard compute). We use the HIL methodology to evaluate the quality of flight metrics on Ras-Pi 4. The comparison in trajectories between Core-i9 and Ras-Pi 4 is shown in Figure 14. The two trajectories are very similar to each other and do not suffer from larger drifts that were seen before. Table VII compares the quality of flight metric. The performance gap (denoted by "Perf Gap") is reduced from 38% to less than 0.5% on the flight time metric, 16.03% to 1.37% on the trajectory length metric, and 15.49% to 0.1% on the energy of flight metric. Fig. 14: Comparison of trajectory for a policy that uses mitigation technique (denoted by the label "With mitigation") with the policy that does not use mitigation technique (represented by the label "Without mitigation"). The policy trained on the training machine (denoted by the label ("HIL") is also plotted for comparison. Using the mitigation technique, we reduced the trajectory length degradation from 34.15 m to 29.03m (to within 1.37%).
In summary, we show that training policy with added delay to mimic the target platform can be used to minimize the hardware gap and performance difference between the training machine and the resource-constrained onboard compute.

IX. FUTURE WORK
The Air Learning toolset and benchmark that we built can be used for solving several open problems related to UAVs which spans multiple disciplines. The goal of this work was to  and Intel Core-i9 with and without mitigation. After using the methodology to minimize the hardware gap, we were able to reduce the gap from 38% to less than 0.5% on the flight time, 16.03% to 1.37% on the trajectory length, and 15.49% to 0.1% on the energy of flight.
demonstrate the breadth of Air Learning as an interdisciplinary tool. In the future, Air Learning can be used to address numerous other questions, including but not limited to the following. Environments: In this work, we focus primarily on UAV navigation for indoor applications [67]. Future work can extend Air Learning's environment generator to explore new robust reinforcement learning policies for UAV control under harsh environmental conditions. For instance, AirSim weather APIs can be coupled with Air Learning environment generator to explore new reinforcement learning algorithms for UAV control with different weather conditions. 8 Algorithm Design: Reinforcement algorithms are susceptible to hyperparameter tuning, policy architecture, and reward function. Future work could involve using techniques such as AutoML [20] and AutoRL [19] to determine the best hyperparameters, and explore new policy architectures for different UAV tasks.
Policy Exploration: We designed a simple multi-modal policy and kept the policy architecture same across DQN and PPO agent. In future work, one could explore other types of policy architectures, such as LSTM [68] and recurrent reinforcement learning [69]. Another future work could expand our work by exploring energy efficient policies by using the capability available in Air Learning to monitor energy consumption continuously. Energy-aware policies can be associated with open problems in mobile robots, such as charging station problem [70].
System Optimization Studies: Future work on the system optimization can be classified into two categories. First, one can perform a thorough workload characterization for reducing the reinforcement learning training time. System optimizations will speed up the training process, thus allowing us to build more complex policies and strategies [71] for solving open problems in UAVs. Second, path to building custom hardware accelerators to improve the onboard compute performance can be explored. Having specialized hardware onboard would allow better real-time performance for UAVs.

X. CONCLUSION
We present Air Learning, a Deep RL gym and crossdisciplinary toolset, which enables Deep RL research for resource constraint systems, and an end-to-end holistic applied RL research for autonomous aerial vehicles. We use Air Learning to compare the performance of two reinforcement learning algorithms namely DQN and PPO on a configurable environment with varying static and dynamic obstacles. We show that for an end to end autonomous navigation task, DQN performs better than PPO for a fixed observation inputs, policy architecture and reward function. We show that the curriculum learning based DQN agent has a better success rate compared to non-curriculum learning based DQN agent with the same number of experience (steps). We then use the best policy trained using curriculum learning and expose the difference in the behavior of aerial robot by quantifying the performance of the policy using HIL methodology on a resource-constrained Ras-Pi 4. We evaluate the performance of the best policy using quality of flight metrics such as flight time, energy consumed and total distance traveled. We show that there is a non-trivial behavior change and up to 40% difference in the performance of policy evaluated in high-end desktop and resource-constrained Ras-Pi 4. We also artificially degrade the performance of the high-end desktop where we trained the policy. We observe a similar variation in the trajectory as well as other QoF metrics as observed in Ras-Pi 4 thereby showing how the onboard compute performance can affect the behavior of policies when ported to real UAVs. We also show the impact of energy QoF on the success rate of the mission. Finally, we propose a mitigation technique using the HIL technique that minimizes the hardware gap from 38% to less than 0.5% on the flight time metric, 16.03% to 1.37% on the trajectory length metric, and 15.49% to 0.1% on the energy of flight metric. Fig. 15: The network architecture for the policy in the PPO and DQN agents. Both the agents take a depth image, velocity vector, and position vector as inputs. The depth image has four layers of convolutions after which the results are concatenated with the velocity and position vectors. In a 32 (4 X 4) convolution filter, 32 is the depth of the filter and (4 X 4) is the filter size. The combined vector space is applied to the three layers of a fully connected network, each with 256 hidden units. The action space determines the number of hidden units in the last fully connected layer. For DQN agent, we have twenty-five actions, and for PPO agent we have two actions which control the velocity of the UAV agent in X and Y direction.
of the magnitude of velocity lie anywhere between 1 m/s to 5 m/s. We use the MaxDegreeOfFreedom option in the AirSim API that calculates the yaw rates automatically to make sure the drone is pointed in the direction it moves.
Reward: The reward function for both PPO agent and DQN agent are kept the same and is defined as follows. r = 1000 * α − 1000 * β − D g + D c * γ (2) α is a binary variable where '1' denotes if the goal is reached else it is '0'. β is also a binary variable where '1' denotes if there is a collision with walls, obstacles or ground else it is '0'. D g is the distance to the goal at any time steps from the agents' current position. If the agent is going away from the goal, the distance to the goal increases thus penalizing the agent. γ is also a binary variable which is set to '1' if the agent is closer to the goal. D c is the distance correction which is applied to penalize the agent if it chooses actions which speed up the agent away from the goal. The distance correction term is defined as follows: V max is the maximum velocity possible for the agent which for DQN is fixed at 5 m/s and for PPO the outputs are scaled to lie between 1 m/s to 5 m/s. V now is the current velocity of the agent and t max is the duration of the actuation.

C. Policy Architecture vs Runtime Latency Tradeoffs
Air Learning HIL can also be used to understand the tradeoff between the policy selection and the onboard hardware. In this section, we study the latency tradeoffs for various policies trained for point-to-point navigation policies in No Obstacle, Static Obstacle, and Dynamic Obstacle environments.   Figure 16 shows the latency tradeoff between the size of the policy and the latency to run on Ras-Pi 4. As the policy becomes wider/deeper, we can see that it increases the policy execution time, translating to increased decision making time. Hence while selecting a policy architecture, one must also account for the hardware latency.