# Reinforcement learning and model predictive control for robust embedded quadrotor guidance and control

## Abstract

A new method for enabling a quadrotor micro air vehicle (MAV) to navigate unknown environments using reinforcement learning (RL) and model predictive control (MPC) is developed. An efficient implementation of MPC provides vehicle control and obstacle avoidance. RL is used to guide the MAV through complex environments where dead-end corridors may be encountered and backtracking is necessary. All of the presented algorithms were deployed on embedded hardware using automatic code generation from Simulink. Results are given for flight tests, demonstrating that the algorithms perform well with modest computing requirements and robust navigation.

## Keywords

Model predictive control Reinforcement learning Exploration Micro air vehicle## 1 Introduction

This paper introduces a method for navigation and control of quadrotors within a non-convex obstacle field. The method uses online optimization within a model predictive control (MPC) framework, taking advantage of Fast MPC (Wang and Boyd 2010) with soft constraint modifications (Richards 2015) to provide a real-time controller on embedded hardware. Furthermore, the use of reinforcement learning (RL) enables autonomous navigation by providing high level path planning decisions for navigation of previously unexplored spaces. Flight test experiments demonstrate the methods within a two dimensional control scenario. The experiments use off-board localization by motion capture and synthesized sensing of obstacles: although these also include important challenges, the focus here is on the decision-making.

Trajectory generation in the presence of obstacles is NP-hard (Reif 1979) and has been the subject of considerable algorithm development, including randomized methods (LaValle and Kuffner 1999; Garcia and How 2005), integer programming (Richards and How 2002) and nonlinear optimization (Milam et al. 2000; Borrelli et al. 2006; Cowling et al. 2010). Another approach has been to separate the problem into multiple planning layers, for example combining the travelling salesman problem and potential field methods (Nieuwenhuisen et al. 2014).

The MPC framework introduced in this paper adopts a two stage process to avoid high computational requirements. Like Augugliaro et al. (2012), Deits and Tedrake (2015) and Sharma (2011), a local, convex optimization problem is derived from the harder global problem. Other approaches along these lines include following tunnels (Vitus et al. 2008), receding horizon optimization (Bellingham et al. 2002) or combining path generation with dynamic optimization (Hoffmann et al. 2008) or feasibility testing (Hehn and D’Andrea 2011). The authors’ approach here is to decompose the problem geometrically to form a local convex problem and then deploy a quadratic program (QP) optimization. This is similar in spirit to the ellipsoidal tunneling method by Sharma (2011) but without requiring quadratic constraints.

MPC provides a framework to unify control, including stability and robustness analysis, with motion planning, including dynamics and operating constraints (Maciejowski 2002). Liu and Chen (2013) illustrated its potential for UAV control with obstacle avoidance constraints. Considerable work has focussed on tailoring fast real-time optimizers for MPC, of which the work by Wang and Boyd (2010) is adopted here. Faster solvers can be implemented using FPGAs (Hartley et al. 2014) but this is beyond the scope of this paper.

Planning within environments with uncertain maps also poses challenges. For example, when using simultaneous localisation and mapping (SLAM) the robot’s map of the world can become distorted and relies on methods such as loop closure (Williams et al. 2009) to re-align features in the map. A key feature of the proposed method is that it does not depend on an accurate global map for far-term planning, unlike other MPC-based methods (Bellingham et al. 2002). In principle, this makes the method robustly compatible with different mapping and localization strategies.

Autonomous exploration has been demonstrated in the past using frontier-based mapping (Yamauchi 1997), where an evidence grid is formed in order to map locations that are occupied by obstacles. This relies on maintaining a global map of the obstacle locations, but has shown good performance in complex and cluttered environments. Other methods such as the subsumption architecture (Brooks 1986) for boundary tracing and mapping (Mataric 1992) enable the robot to map an area whilst avoiding people moving within its environment. For more variety of sparsity in the environment Fraundorfer et al. (2012) use a combination of frontier-based mapping and the bug algorithm (Choset et al. 2005). The common theme between these methods is that they perform exploration by maintaining a map of the locations of the environment’s obstacles.

Exploration of previously unmapped environments is addressed here by developing a reinforcement learning (Sutton and Barto 1998) (RL) algorithm. RL is a machine learning technique that updates its knowledge about the world based upon rewards following actions taken. A discrete set of node locations in the world are given a weighting, or *cost*, and the RL algorithm targets the lowest cost node visible. The obstacle avoidance is handled separately by the MPC controller. Should the MAV be unable to make progress, for example by spending time stuck in a dead end, it will learn to turn back and explore a different path. The full explanation of the control architecture is given in Sect. 3.

Other works have explored different combinations of MPC, machine learning, and UAVs. Zhang et al. (2016) used MPC as the supervisor for their learning algorithm, resulting in a deep neural network policy for obstacle avoidance, while Aswani et al. (2013) used a learning ‘oracle’ to refine the prediction model for a general MPC. Sharma and Taylor (2012) used RL for waypoint generation to improve the performance of their avoidance algorithm based on local ellipsoid constraints (Sharma 2011). Learning for exploration has been used in previous work with random neural Q-learning (Yang et al. 2016) to navigate and avoid obstacles in unknown environments. This approach to use learning for exploration has similarities to the work here, one of which is that it provides continuous control whilst discrete action choices about where to go next must be made. In (Yang et al. 2016) a continuous action space is built from the learning.

The distinctions of the present paper are that MPC is used online, not purely as part of the training. RL is used solely for waypoint selection, hence for a discrete choice rather than continuous, and compensates for the susceptibility of the locally convex MPC to explore local minima. The benefits of using RL for exploration to perform autonomous exploration is that an accurate global map does not need to be constructed, stored and maintained. An evidence grid, for example, could grow quickly if exploring large areas, especially if over three dimensions. It would also be necessary to produce a grid of fine enough a resolution such that the MAV could pass through narrow openings. Perhaps an even greater concern is one of robustness; it is well known (Lu and Milios 1997) that maintaining good alignment of a map is challenging and so any exploration technique that relies on a well maintained global map could fail.

## 2 Experimental setup and modelling

Experiments were performed using the Parrot AR.Drone (Fig. 2) within the Bristol Robotics Laboratory’s flying arena. The flying arena is instrumented with ten Vicon motion capture system cameras, providing vehicle position information at 100Hz. The example obstacle fields (Fig. 1) were laid out over a six by eight metre area within the arena. The obstacle locations were defined numerically and detection is simulated through the knowledge of the vehicle’s position within the Vicon coordinate system. All of the algorithms presented were programmed in MATLAB Simulink, compiled using automated code generation and executed on a dSpace MicroAutoBox. The MicroAutoBox has a 900Mhz PowerPC processor and enables rapid prototyping of software for execution on embedded hardware.

In the scenario of exploration, speed is limited by the rate of decision-making, and the full agility of the quadrotor is not exploited and it remains close to hover configuration throughout. Therefore, we adopt the approach of Fraundorfer et al. (2012) and use a linear dynamics model determined by system identification techniques. The work uses level flight without rotation in yaw, controlled by downward-looking ultrasound, rate gyro and magnetometer, so only the lateral movement dynamics are considered. Feedback linearization (Helwa and Schoellig 2016) provides an alternative route to a linear model that would exploit more of the flight envelope, but this has not been pursued in the scope of this work.

*x*(

*k*) is the state at time step

*k*,

*u*(

*k*) is the control input,

*w*(

*k*) is an unknown disturbance and

*y*(

*k*) is the output. From Eq. 1, for a single axis (pitch or roll) operating at 10 Hz the system matrices \(A_1\) and \(B_1\) are

## 3 Exploration architecture

*x*(

*k*) of the vehicle is estimated by the state estimation block, using a Luenberger observer driven by the position measurements (5). To avoid the complexity of a Kalman filter with an augmented state, disturbance estimation is performed by low-pass filtering the measured prediction error (Tatjewski 2014):

The MAV position is measured externally via a Vicon motion capture system, which is passed into a virtual sensor block. The virtual sensor block uses the measured position and returns obstacle face information based upon hard-coded obstacle fields such as those described in Fig. 1. In future work, this block could be swapped for a real sensor so that obstacles would not have to be pre-defined in code. Also, goal locations would be based on recognition, not location. However, here we focus on control aspects, so simplifications are adopted in sensing.

## 4 Operating region calculation

The MPC method used employs quadratic programming (QP), which requires a convex solution space to be described in terms of linear constraints. The non-convex obstacle field must therefore also be decomposed into a convex solution space, or rather a convex ‘operating region’ within which the MAV is allowed to fly.

Vitus et al. (2008) propose two methods for decomposing a non-convex solution space into a sequence of convex polytopes in their paper on tunnel mixed integer linear programming (MILP). These are trapezoidal decomposition and Delaunay triangulation. This reduces the complexity of the online optimization but it still requires MILP. Augugliaro et al. (2012) went further and reduced the local problem to a local quadratic optimization by linearizing multi-vehicle separation constraints. Similarly, Sharma (2011) reformulated the region determination as a QP by adopting ellipsoidal regions. Deits and Tedrake (2015) present an iterative semidefinite programming method for finding convex regions that might also be suitable for defining the operating region within which the quadrotor may fly.

Two methods for computing convex operating regions from the non-convex obstacle fields are presented here. The first method provides rectangular operating regions with limits aligned with the global *x* and *y* axes, which for the purposes of the presented experiments will also always align with the vehicle *x* and *y* axes. The second method provides operating regions that also have four faces, but the faces are orientated in an effort to increase freedom in the direction of travel. Both methods provide operating regions that impose a fixed number of position constraints upon the MAV’s movement, which is desirable to avoid having to rebuild the MPC.

### 4.1 Rectangular operating regions

The rectangular convexification method presented here computes operating regions with orthogonal faces. The method therefore assumes all obstacles are rectangular or expanded to become rectangular. By aligning the obstacle and hence operating region faces with the *x* and *y* axes, it is possible to pose the MPC as two separate MPCs with one for each axis. Decoupling the MPCs is possible due to the symmetry of the MAV. It is anticipated that two small MPCs will be less computationally expensive than a single large one. The computational savings are identified in Sect. 7.

The algorithm for finding the operating region starts by defining the distance *d* that denotes the distance the MAV could travel if it were travelling at the maximum permissible velocity (\(v_{\lim }\)) for all *T* time steps. The operating region limits are then defined by the MAV position *p* plus or minus the distance *d*. The newly defined operating region \({\mathcal {R}}\) is then inspected to see if any of the obstacle faces \({\mathcal {F}}\) intersect, *i.e.* the operating region is not free of obstacles. Should any of the obstacle faces intersect the operating region, the operating region is shrunk until it is free of obstacles.

### 4.2 Quadrilateral operating regions

The rectangular convexification method presented in Sect. 4.1 assumed the *x* and *y* axes are decoupled, which is reasonable for the dynamics of the MAVs used. This assumption will, however, restrict the complexity of operating region shapes that can be used. The second method as described here lifts this requirement, which could provide more freedom for the MAV’s movement when operating close to obstacles that are not aligned with the vehicle’s axes.

The virtual sensor block allows the MAV to see obstacles at a distance of up to 2 metres. The process of defining the operating region consists of many stages, whereby an initial region that fits within the 2 metre radius is reduced in size until it no long intersects detected obstacles.

## 5 Application of model predictive control

Model predictive control (Maciejowski 2002) is used to drive the quadrotor to setpoints determined by the planner while respecting operating constraints on velocity and control. The operating constraints also include the position constraints imposed by the convexification scheme presented in the previous section. The algorithm implemented uses the “Fast MPC” formulation of Wang and Boyd (2010) to achieve fast solution times.

*Soft*constraints (Kerrigan and Maciejowski 2000) are also incorporated, which must be respected if possible but can be violated if no alternative exists

*Q*and

*R*are defined as

Since the hard constraints (14) only apply to the inputs, feasibility is guaranteed, without necessarily satisfying the state constraints. However, since the method inherits the properties from Maeder et al. (2009) and Limon et al. (2008), stability and satisfaction of *all* constraints is guaranteed provided that the soft constraints are sufficiently weighted (Kerrigan and Maciejowski 2000), the disturbance is constant and the constraints do not change. The dynamic convexification invalidates this last condition, although more recent methods (Bali and Richards 2017) provide a constrained convexification to promote recursive feasibility.

## 6 Reinforcement learning for navigation

Reinforcement learning (RL) is a machine learning technique that is employed here to help the exploration algorithms become ‘unstuck’ from dead ends and even unforeseen problems such as failures of the QP to converge. RL updates its knowledge about the world based upon rewards following actions taken. The MAV may therefore learn from time spent trying to progress via a dead end that it needs to turn back and explore a different path. The previous sections described the MPC and operating region calculation methods required to guide the MAV to a target position. By itself, however, the MPC is unable to guide the MAV throughout environments such as the example scenarios without leaving the MAV stuck down a dead end. To fix this problem, the RL algorithm dynamically chooses the target position \(y_S(k)\). The convexification and MPC algorithms then provide the control inputs for transition to the latest target.

Reinforcement learning has been used in the past (Richards and Boyle 2010) to learn the cost-to-go for a receding horizon planner over successive repetitive UAV missions. The work in this paper, however, aims to use RL to enable the MAV to robustly navigate previously unexplored environments. Specifically, temporal difference learning (Sutton and Barto 1998) is employed.

*k*. The initial cost of each node is estimated using the distance from the final goal position \((p_x^G,p_y^G)\) divided by the expected mean speed \({\bar{v}}\):

*i*. This is computed from the cost held by the node and the estimated time required to reach the node.

*i*. The node with the lowest cost-to-go is selected as

*N*(

*k*) and passed to the MPC algorithm to guide the MAV.

*N*(

*k*) and \(N(k-1)\) to point at the same node, which will indeed be the case for most time steps, and hence the updated cost of the node is based upon time spent aiming for it. It follows from (33) that the cost will be unchanged if the MAV gets closer to the same node at the estimated speed \({\bar{v}}\). However, if the MAV becomes stuck at a node, that nodes cost will increase. Subsequently, when a node is reached at a dead end location, the MAV will loiter for a few seconds whilst the cost builds up and it becomes ‘cheaper’ for it to turn around and backtrack.

## 7 Results

Section 7.1 provides simulation results for the Fast MPC algorithm, demonstrating the performance in both constraint satisfaction and computation time. Details of particular parts of the optimiser construction are discussed along with their impact on the results. Section 7.2 presents results for flight tests performed using all of the developed algorithms.

### 7.1 MPC performance

First, simulation results are presented in order to describe key features of the optimiser’s construction. Section 7.1.1 presents the best case results for a one dimensional moving setpoint demonstration using a numerical model of the AR.Drone. These results are then used for comparison in Sect. 7.1.2 where the computation cost of two separate MPC optimisations is compared to one larger MPC optimisation. Section 7.1.3 then presents flight test results using the MPC for control of the AR.Drone with the learning disabled.

#### 7.1.1 Simulation results for 1D

Initial results for the Fast MPC algorithm are presented in Fig. 7, where a simulated experiment was carried out on just the *x*-axis of the AR.Drone. The simulated plant used the dynamics identified in Sect. 2. The position target was alternated between + 1 m and - 1 m every 10 s. A soft constraint on the velocity of 0.5 ms\(^{-1}\) was imposed and all position constraints were relaxed.

*turnaround time*, had a mean value of 7.2 ms. Note that this figure was achieved by exploiting sparsity in the matrix structure, such as the patterns apparent in (8), (9) and (10) due to the decoupling of the axes dynamics. The version of code generator used did not exploit sparsity natively, so these were implemented in custom multiplier utilities. Without exploiting sparsity, turnaround time increased to 13.6 ms, almost doubled, highlighting the importance of exploiting structure in the solver.

At the bottom of the Fig. 7 the residuals for the optimisation are plotted. It can be seen that every ten seconds the residuals spike, demonstrating that the optimisation has not converged to the optimal. This spiking occurs as the setpoints change and is due to the low iteration count used within Fast MPC. The initial solutions used are from the previous iteration due to the warm start procedure, which will be far from the optimal solution when the setpoint changes significantly. This can be seen to have minimal, if any, impact on performance as the residual falls back down very quickly—usually after just one further iteration.

#### 7.1.2 Decoupling the control axes

In order for the quadrotor to navigate environments such as the examples depicted in Fig. 1, it is necessary for the MPC to operate on both the *x* and *y* axes. Due to the symmetry in the vehicle’s dynamics it is possible to decouple the *x* and *y* control into two separate MPC controllers. By decoupling the control axes it should be possible to reduce the computational complexity of the problem, although the representation of the operating region (for obstacle avoidance) must be simplified as demonstrated in Sect. 4.1.

First the single axis MPC controller used above was duplicated within the controller, such that the *x* and *y* axes were being optimized for separately. It was found that solutions did not change, as would be expected, but the turn around time did increase. The turnaround time increased from 7.2 ms for the single axis to 17.9 ms for the dual axis control. It is interesting that the turnaround time increased by a factor larger than two, as the number of operations required by the dual MPC algorithms is exactly double.

By solving for both axes within a single MPC optimizer, the turnaround time could be expected to take longer than double due to non-linear scaling of the matrix operations. The mean turnaround time was 43.9 ms in the combined case, showing that it is considerably more expensive. The computational cost may be worth it, however, as constraints may be posed in terms of both the *x* and *y* axes, allowing non-orthogonal operating regions to be represented. Crucially, this value indicates that real time operation at 10 Hz is feasible.

#### 7.1.3 Initial flight test results

A flight test result is shown in Fig. 8a where the MPC controller was used to control the quadrotor, but the learning algorithm was disabled. In this example the setpoint presented to the MPC is the closest point within the operating region to the goal. It can be seen that the quadrotor hugs the walls as it is attracted to the goal and finally ends up getting stuck down a dead end. The walls are artificially enlarged when sent to the function that computes the operating region, taking into account the size of the vehicle as well as some buffer for violation of the position soft constraints. The resulting path is therefore not seen to directly touch the walls. Given the orthogonal nature of the obstacles, the first convexification method of Sect. 4.1 was used to determine the operating regions

### 7.2 Results using MPC and learning

The same scenario from the previous section was flown, but this time with the addition of the reinforcement learning exploration algorithm presented in Sect. 6. Figure 8 shows four separate flight paths taken for each of the presented scenarios, demonstrating good repeatability. In Fig. 8b the obstacles are all aligned with the axes and the rectangular convexification technique from Sect. 4.1 was employed, together with decoupled control for each of the two axes of motion. Figure 8c shows the flight path taken for the second scenario in which obstacles are not orthogonally aligned and the second method for computing operating regions from Sect. 4.2 was required. In this case the slower combined MPC for both axes was employed, but still running in real time. A video of the experiments seen in Fig. 8b can be found at https://youtu.be/Vym7QEdG7OM.

Figure 9 shows how the cost of the nodes varied over the duration of one of the flights, whilst learning to navigate the first obstacle field. For brevity the node indices and where they map to in physical space are not shown; rather the plot illustrates how the cost of nodes increase one at a time depending on which is being aimed for. The figure also shows the time history of the node index being aimed for and the resulting position traversed by the MAV on its way to the goal. One interesting observation here is the occasional rapid switching between two node indices, here the cost of the pair of nodes plus the estimated cost to reach them (Eq. 31) is very similar. During this period of switching rapidly between target nodes the costs of both nodes increase until a distinct new node is selected.

The velocity history in Fig. 9 shows that the MAV navigates almost continually through the environment, close to its upper velocity magnitude of 0.5 m/s. As discussed in Sect. 5, despite the constraint for each plan to terminate in a stationary condition, the MAV itself rarely stops. An exception to this can be seen around 30 s into the flight: at this point, the MAV has reached a node that continues to have the lowest cost, which is the aforementioned dead end of obstacle field 1. This dead end node with cost coloured green (Fig. 9) must increase until the node in black (backtracking) becomes the best next choice. The rate at which nodes increase in cost can be tuned, as previously mentioned, through the weighting \(\alpha \) in Eq. 33.

## 8 Conclusions and further work

A new method for exploring unknown environments with a quadrotor was introduced. It was shown that reinforcement learning could be used to navigate non-convex obstacle fields without maintaining a global map of the world. Fast model predictive control was executed in real-time on modest hardware for control, providing the necessary obstacle avoidance. Aspects of the MPC construction that affected solve time and obstacle representation were highlighted. Automatic code generation was used for construction of the controllers, which reduced development time, although special attention to matrix structure was necessary to maintain the speed of the MPC algorithm. The immediate extension for this work is to integrate with real sensing, such as an RGBD camera or LIDAR, and real localization such as SLAM. A more fundamental challenge is to make the placement of the learning nodes automatic within the environment, attaching them to recognized features rather than fixed locations. By attaching nodes to recognised features with a minimum level of detail (*e.g.* not blank walls) confidence in localisation can be maintained, much like the ideas presented in Bose and Richards (2013).

## Notes

## References

- Aswani, A., Gonzalez, H., Sastry, S. S., & Tomlin, C. (2013). Provably safe and robust learning-based model predictive control.
*Automatica*,*49*(5), 1216–1226.MathSciNetCrossRefzbMATHGoogle Scholar - Augugliaro, F., Schoellig, A., & D’Andrea, R. (2012). Generation of collision-free trajectories for a quadrocopter fleet: A sequential convex programming approach. In
*2012 IEEE/RSJ international conference on intelligent robots and systems (IROS)*(pp. 1917–1922).Google Scholar - Bali, C., & Richards, A. (2017). Robot navigation using convex model predictive control and approximate operating region optimization. In
*2017 IEEE/RSJ international conference on intelligent robots and systems (IROS)*(pp. 2171–2176).Google Scholar - Bellingham, J. S., Richards, A. G., & How, J. P. (2002). Receding horizon control of autonomous vehicles. In
*Proceedings of the American control conference*.Google Scholar - Borrelli, F., Subramanian, D., Raghunathan, A. U., & Biegler, L. T. (2006). MILP and NLP techniques for centralized trajectory planning of multiple unmanned vehicles. In
*Proceedings of the American control conference*.Google Scholar - Bose, L. N., & Richards, A. G. (2013). Mav belief space planning in 3d environments with visual bearing observations. In
*International micro air vehicle conference and flight competition (IMAV2013)*.Google Scholar - Brooks, R. (1986). A robust layered control system for a mobile robot.
*IEEE Journal of Robotics and Automation*,*2*(1), 14–23.CrossRefGoogle Scholar - Choset, H., Lynch, K., Hutchinson, S., Kantor, G., Burgard, W., Kavraki, L., et al. (2005).
*Principles of robot motion: Theory, algorithms, and implementations*. Cambridge: MIT Press.zbMATHGoogle Scholar - Cowling, I., Yakimenko, O., Whidborne, J., & Cooke, A. (2010). Direct method based control system for an autonomous quadrotor.
*Journal of Intelligent & Robotic Systems*,*60*, 285–316.CrossRefzbMATHGoogle Scholar - Deits, R., & Tedrake, R. (2015). Computing large convex regions of obstacle-free space through semidefinite programming. In H. Levent Akin Nancy, M. Amato Volkan Isler & A. Frank van der Stappen (Eds.),
*Algorithmic foundations of robotics XI*(pp. 109–124). Berlin: Springer.Google Scholar - Fraundorfer, F., Heng, L., Honegger, D., Lee, G. H., Meier, L., Tanskanen, P., & Pollefeys, M. (2012). Vision-based autonomous mapping and exploration using a quadrotor mav. In
*2012 IEEE/RSJ international conference on intelligent robots and systems (IROS)*(pp. 4557–4564). IEEE.Google Scholar - Garcia, I., & How, J. (2005). Trajectory optimization for satellite reconfiguration maneuvers with position and attitude constraints. In
*Proceedings of the 2005 American control conference, 2005*(pp. 889–894). IEEE.Google Scholar - Hartley, E. N., Jerez, J. L., Suardi, A., Maciejowski, J. M., Kerrigan, E. C., & Constantinides, G. A. (2014). Predictive control using an fpga with application to aircraft control.
*IEEE Transactions on Control Systems Technology*,*22*(3), 1006–1017.CrossRefGoogle Scholar - Hehn, M., & D’Andrea, R. (2011). Quadrocopter trajectory generation and control. In
*Proceedings of the IFAC world congress*.Google Scholar - Helwa, M. K., & Schoellig, A. P. (2016). On the construction of safe controllable regions for affine systems with applications to robotics. In
*2016 IEEE 55th conference on decision and control (CDC)*, (pp. 3000–3005).Google Scholar - Hoffman, G., Rajnarayan, D. G., Waslander, S. L., Dostla, D., Jang, J. S., & Tomlin, C. J. (2004). The stanford testbed of autonomous rotorcraft for multi-agent control (starmac). In
*Proceedings of the 23rd digital avionics systems conference*.Google Scholar - Hoffmann, G., Waslander, S., & Tomlin, C. (2008). Quadrotor helicopter trajectory tracking control. In
*AIAA guidance, navigation and control conference and exhibit, Honolulu, Hawaii*(pp. 1–14). Citeseer.Google Scholar - How, J., Bethke, B., Frank, A., Dale, D., & Vian, J. (2008). Real-time indoor autonomous vehicle test environment.
*Control Systems, IEEE*,*28*(2), 51–64.MathSciNetCrossRefzbMATHGoogle Scholar - Kerrigan, E., & Maciejowski, J. (2000). Soft constraints and exact penalty functions in model predictive control. In
*Control 2000 conference, Cambridge*.Google Scholar - LaValle, S. M., & Kuffner, J. J. (1999). Randomized kinodynamic planning. In
*Proceedings of international conference on robotics and automation*.Google Scholar - Limon, D., Alvarado, I., Alamo, T., & Camacho, E. (2008). MPC for tracking piecewise constant references for constrained linear systems.
*Automatica*,*44*(9), 2382–2387.MathSciNetCrossRefzbMATHGoogle Scholar - Liu, C., & Chen, W.-H. (2013). Hierarchical path planning and flight control of small autonomous helicopters using mpc techniques. In
*Intelligent vehicles symposium (IV), 2013 IEEE*(pp. 417–422). IEEE.Google Scholar - Lu, F., & Milios, E. (1997). Globally consistent range scan alignment for environment mapping.
*Autonomous Robots*,*4*(4), 333–349.CrossRefGoogle Scholar - Maciejowski, J. M. (2002).
*Predictive control with constraints*. Englewood Cliffs: Prentice Hall.zbMATHGoogle Scholar - Maeder, U., Borrelli, F., & Morari, M. (2009). Linear offset-free model predictive control.
*Automatica*,*45*(10), 2214–2222.MathSciNetCrossRefzbMATHGoogle Scholar - Mataric, M. J. (1992). Integration of representation into goal-driven behavior-based robots.
*IEEE Transactions on Robotics and Automation*,*8*(3), 304–312.CrossRefGoogle Scholar - Michael, N., Mellinger, D., Lindsey, Q., & Kumar, V. (2010). The grasp multiple micro-uav testbed.
*Robotics & Automation Magazine, IEEE*,*17*(3), 56–65.CrossRefGoogle Scholar - Milam, M. B., Mushambi, K., & Murray, R. M. (2000). A new computational approach to real-time trajectory generation for constrained mechanical systems. In
*Proceedings of the IEEE conference on decision and control*(pp. 845–851).Google Scholar - Nelder, J. A., & Mead, R. (1965). A simplex method for function minimization.
*The Computer Journal*,*7*(4), 308–313.MathSciNetCrossRefzbMATHGoogle Scholar - Nieuwenhuisen, M., Droeschel, D., Beul, M., & Behnke, S. (2014). Obstacle detection and navigation planning for autonomous micro aerial vehicles. In
*2014 international conference on unmanned aircraft systems (ICUAS)*(pp. 1040–1047). IEEE.Google Scholar - Reif, J. H. (1979). Complexity of the movers problem and generalizations. In
*20th IEEE symposium on the foundations of computer science*(pp. 421–427).Google Scholar - Richards, A. (2015). Fast model predictive control with soft constraints.
*European Journal of Control*,*25*, 51–59.MathSciNetCrossRefzbMATHGoogle Scholar - Richards, A., & Boyle, P. (2010). Combining planning and learning for autonomous vehicle navigation. In
*AIAA guidance, navigation, and control conference*.Google Scholar - Richards, A. G., & How, J. P. (2002). Aircraft trajectory planning with collision avoidance using mixed integer linear programming. In
*Proceedings of American control conference*.Google Scholar - Schouwenaars, T., How, J. P., & Feron, E. (2004). Receding horizon path planning with implicit safety guarantees. In
*Proceedings of the American control conference*.Google Scholar - Sharma, S. (2011). Qcqp-tunneling: Ellipsoidal constrained agent navigation. In
*IASTED international conference on robotics*.Google Scholar - Sharma, S., & Taylor, M. E. (2012). Autonomous waypoint generation strategy for on-line navigation in unknown environments.
*environment*, 2:3D.Google Scholar - Sutton, R. S., & Barto, A. G. (1998).
*Reinforcement learning: An introduction*(Vol. 1). Cambridge: Cambridge University Press.Google Scholar - Tatjewski, P. (2014). Disturbance modeling and state estimation for offset-free predictive control with state-space process models.
*International Journal of Applied Mathematics and Computer Science*,*24*(2), 313–323.MathSciNetCrossRefzbMATHGoogle Scholar - Vitus, M., Pradeep, V., Hoffmann, G., Waslander, S., & Tomlin, C. (2008). Tunnel-milp: Path planning with sequential convex polytopes. In
*AIAA guidance, navigation, and control conference*.Google Scholar - Wang, Y., & Boyd, S. (2010). Fast model predictive control using online optimization.
*IEEE Transactions on Control Systems Technology*,*18*(2), 267–278.CrossRefGoogle Scholar - Williams, B., Cummins, M., Neira, J., Newman, P., Reid, I., & Tardós, J. (2009). A comparison of loop closing techniques in monocular slam.
*Robotics and Autonomous Systems*,*57*(12), 1188–1197.CrossRefGoogle Scholar - Yamauchi, B. (1997). A frontier-based approach for autonomous exploration. In
*1997 IEEE international symposium on computational intelligence in robotics and automation, 1997. CIRA’97., Proceedings*(pp. 146–151). IEEE.Google Scholar - Yang, J., Shi, Y., & Rong, H.-J. (2016). Random neural q-learning for obstacle avoidance of a mobile robot in unknown environments.
*Advances in Mechanical Engineering*,*8*(7), 1687814016656591.Google Scholar - Zhang, T., Kahn, G., Levine, S., & Abbeel, P. (2016). Learning deep control policies for autonomous aerial vehicles with mpc-guided policy search. In
*2016 IEEE international conference on robotics and automation (ICRA)*(pp. 528–535). IEEE.Google Scholar

## Copyright information

**OpenAccess**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.