A Survey on 3D Visual Tracking of Multicopters

Three-dimensional (3D) visual tracking of a multicopter (where the camera is fixed while the multicopter is moving) means continuously recovering the six-degree-of-freedom pose of the multicopter relative to the camera. It can be used in many applications, such as precision terminal guidance and control algorithm validation for multicopters. However, it is difficult for many researchers to build a 3D visual tracking system for multicopters (VTSMs) by using cheap and off-the-shelf cameras. This paper firstly gives an over- view of the three key technologies of a 3D VTSMs: multi-camera placement, multi-camera calibration and pose estimation for multi-copters. Then, some representative 3D visual tracking systems for multicopters are introduced. Finally, the future development of the 3D VTSMs is analyzed and summarized.


Introduction
Multicopters have been widely used in recent years [1,2] , e.g., in aerial photography, goods transportation and search and rescue. Accurate and robust pose estimation (or motion estimation) of these vehicles is a crucial issue for their autonomous operation. With advantages in the aspects of accuracy, weight, cost, and applicable environment, vision sensors have become a popular choice for providing location (or three-dimensional (3D) tracking) results for multicopters [3,4] .
Note that 3D visual tracking means continuously recovering the six-degree-of-freedom pose of an object relative to the camera (the camera is fixed while the object is moving) or the camera relative to the scene (the scene is fixed while the camera is moving) [5] . Considering that small multicopters often feature CPUs with limited capabilities, this paper focuses on studying the former case. Compared to 3D visual tracking, the traditional 2D visual tracking aims at continuously recovering the size, the centroid or the trajectory of the object in the image [6,7] , but does not involve recovering the 3D position of the object. From the perspective of aims, 3D visual tracking goes further than 2D visual tracking and is more challenging. The relationship between 3D visual tracking and 2D visual tracking is shown in Fig. 1.
Although there are some commercial products for 3D visual tracking, such as Vicon [8] and OptiTrack [9] , they are expensive and proprietary. Moreover, these 3D visual tracking systems are not specially designed for multicopters and do not consider the force characteristics of multicopters (the thrust force is perpendicular to the propeller plane). As a result, the robustness of pose estimation for multicopters is limited. Therefore, researchers may want to build their own 3D visual tracking systems for multicopters by using cheap and off-the-shelf cameras.
As shown in Fig. 2, we have built a 3D visual tracking system for multicopters by using four MUC36M (MGYYO) infrared cameras 1 equipped with four AZURE-0420MM lenses and four infrared light-emitting diodes (LEDs). These cameras are synchronized by a hardware synchronizer. The markers fixed with the quadrotor do not emit light and just reflect the infrared light emitted by the LEDs so that they will be detected by the cameras. Then, the camera images are transferred to a computer to compute the quadrotor pose. The estimated pose will be sent to another computer to calculate the control command. Based on the above steps, the closed-loop control of the quadrotor is implemented.
Note that there are three key technologies in a 3D VTSMs: 1) Multi-camera placement (how to compute the optimal camera placement off-line); 2) Multi-camera calibration (how to effectively compute the intrinsic and extrinsic parameters of multiple cameras off-line); 3) Pose estimation for multicopters (how to robustly estimate the pose of multicopters on-line). To build a 3D visual track-ing system for multicopters (VTSMs), researchers need to be familiar with these technologies. The main contribution of this paper is to give an overview of the key technologies. This paper aims to be helpful for researchers who want to build a 3D VTSMs by using cheap and offthe-shelf cameras.
Note that a 3D VTSMs generally consists of multiple fixed cameras because: 1) By fusing the information from multiple cameras, the total field of view (FOV) could be increased and the overall accuracy and robustness of pose estimation could be improved [10] ; 2) Compared to onboard vision, using fixed cameras enables the adoption of higher quality imaging devices and more sophisticated vision algorithms (there is no need to consider the constraints related to limited payload and onboard computational capabilities) [11] . The 3D VTSMs could provide accurate six-degree-of-freedom pose of single or multiple multicopters, and it is usually used as a testbed to provide a quick navigation solution for testing and evaluating flight control and guidance algorithms. This paper is organized as follows. An overview of multi-camera placement methods and multi-camera calibration methods is given in Sections 2 and 3, respectively. Then, in Section 4, a review of pose estimation methods for multicopters is presented, followed by the introduction of some 3D visual tracking systems for multicopters. Finally, challenging issues and conclusions are given in Sections 6 and 7, respectively.

Camera selection
To build a 3D VTSMs, two types of cameras can be used: visible-light cameras and infrared cameras. The images of visible-light cameras have rich information, but they are sensitive to illumination and do not facilitate marker detection. Therefore, it is better to use infrared cameras (850 nm) together with infrared markers, the same as Vicon and OptiTrack. A sample image of a MUC36M (MGYYO) infrared camera is shown in Fig. 3. By using an infrared camera, infrared markers in Fig. 4 could be easily detected because: 1) The light reflected by the markers lies in the infrared spectrum; 2) The outside light could be minimized by setting the exposure time of the camera to a small value.

Existing methods
The placement (position and orientation) of multiple cameras determines the volume of the 3D visual tracking system and 3D reconstruction accuracy of feature points. Therefore, it is very important to optimize the multi-camera placement. The cameras can be placed in the following ways: 1) Attached to the tripods if these cameras are often moved; 2) Attached to the ceiling or a rigid structure if these cameras are rarely moved. The second way is recommended in the real experiments since these camera′s orientation and position are not easy to shift. For the commercial motion-capture systems such as OptiTrack, some multi-camera placement examples are given to the users [12] . For cheap and off-the-shelf cameras, a camera placement example with cameras attached to the ceiling is shown in Fig. 5.
In the literature on stereo-vision reconstruction, the camera placement problem has been studied and related methods can be roughly divided into the following two categories: 1) Generate-and-test methods [13,14] . These methods, also called trial-and-error methods, are the original methods of solving the camera placement problem. The principle is to first generate the parameters of the cameras, and then estimate them with respect to task constraints. A target-centric grid sphere (see Fig. 6 [15] ) is usually used to discretize the observation space. The radius of the However, calculating and searching the high-dimensional grid parameter space requires a lot of computations, and there is a problem with the grid size ratio (sampling rate).
2) Synthesis methods [16][17][18] . These methods are also called the constrained optimization methods, which use analytical functions to model constraints (i.e., constructing constraint functions) and task requirements (i.e., constructing the objective function) so that the camera parameters satisfying the constraints can be directly computed. Compared to generate-and-test methods, synthesis methods can be combined with various optimization techniques and actively understand the relationships between the camera parameters to be planned and the task requirements, rather than searching exhaustively.

Discussions
Synthesis methods are now popular for solving the multi-camera placement problem. However, most of the existing synthesis methods focus on optimizing the positioning accuracy of 3D feature points. This is suitable for pose estimation of static rigid objects. But for moving ri-gid objects, we should not only consider the positioning accuracy of 3D feature points on the rigid objects but also the motion characteristics of rigid objects (including multicopters). In this way, pose estimation accuracy of moving rigid objects (including multicopters) could be further improved.
Note that the camera placement problem still needs to be studied. It is related to many factors, such as the field of view of the camera, the power of the infrared LEDs, the diameter of the marker, etc. Therefore, there is no general solution for the camera placement problem. Based on our experience and the advices given by OptiTrack, the simple and effective method for researchers is to place the cameras uniformly like Fig. 5.

Existing methods
Since there are errors in the multi-camera placement in practice and the camera intrinsic parameters are unknown, it is necessary to perform multi-camera calibration to accurately compute the intrinsic parameters (principal point, lens distortion, etc.) and the extrinsic parameters (rotation matrix and translation vector between the camera coordinate system and the reference coordinate system) of each camera. Multi-camera calibration is the basis of pose estimation for multicopters. Pose estimation accuracy of a 3D VTSMs will be determined by the calibration accuracy of multiple cameras directly, so the process of multi-camera calibration is very important. According to the dimension of the calibration object, multicamera calibration methods can be roughly divided into the following six categories.

Three-dimensional calibration methods
Three-dimensional calibration methods require that a 3D calibration object with known geometry in 3D space is used. For example, a calibration object with 3D geometric information known (see Fig. 7) is imaged by a single camera. Note that the 3D calibration object can also be made by using several 2D calibration patterns [19] . Constraint equations are established according to the corresponding relationships between the 3D points of the calibration object and the image points, in order to perform camera calibration. This kind of method can calibrate the intrinsic and extrinsic parameters of multiple cameras simultaneously, but the calibration object is not easy to manufacture and required to be placed in the common field of view of the multiple cameras.

Two-dimensional calibration methods
The calibration object commonly used in two-dimensional (2D) calibration methods is a checkerboard pattern with black and white squares (see Fig. 8). Multiple images of the checkerboard pattern are taken from different views, and camera calibration is achieved by establishing constraint equations based on the corresponding relationships between the space points and the image points of the planar pattern. This kind of method is easy to use and does not require motion information of the planar pattern. For a monocular camera, the typical calibration method is Zhang′s method [20] that can estimate the intrinsic and extrinsic parameters of the camera with radial distortion. This method requires the camera to take a few (at least two) images of the planar pattern from different orientations, and the intrinsic parameters of the camera are constrained by the homography matrix of each image. Zhang′s method is a two-step method, i.e., first computing the initial values of some parameters linearly, then using the maximum likelihood criterion to optimize the computation results with radial distortion considered. Finally, the extrinsic parameters are obtained by using the camera intrinsic matrix and the homography matrix.
Most of the existing 2D calibration methods for multiple cameras are an extension of Zhang′s method. The problem is that it is not easy for the planar pattern to be observed by all the cameras, so it is difficult to obtain the extrinsic parameters accurately. However, some efforts have been made to solve this problem. As shown in Fig. 9, Theobalt et al. [21] put the planar pattern on the floor in order to make it visible to all the cameras on the ceiling. They chose a corner of the calibration pattern as the origin of the inertial (or world) coordinate system. Then the intrinsic and extrinsic parameters of each camera can be easily obtained by using Zhang′s method.
If the calibration pattern is not placed to be observed simultaneously by all the cameras, transformations of the extrinsic parameter matrix are needed. The optical center of one camera is usually chosen as the origin of the world coordinate system, and extrinsic parameters of the other cameras are computed relative to the world coordinate system. In order to ensure the accuracy of calibration results, it is necessary to perform a global optimization, or directly establish a multi-camera calibration model [22] .

One-and-half-dimensional calibration methods
As shown in Fig. 10, Sun et al. [23] used a one-and-halfdimensional (1.5D) calibration object (between one-dimensional calibration object and two-dimensional calibration object) to calibrate the intrinsic and extrinsic parameters of multiple cameras. The 1.5D calibration object has five points in the form of "+", similar with two onedimensional calibration objects bounded together. In this method, the calibration object moves freely and a linear solution is obtained first. Then, the accuracy of the linear solution is improved by using nonlinear optimization. However, only simulation experiments are given in the paper.

One-dimensional calibration methods
The first one-dimensional (1D) calibration method is proposed by Zhang [24] , which uses a calibration object consisting of three or more collinear points with known distances (see Fig. 11). Six or more images of the 1D calibration object are taken to achieve camera calibration. But, this method needs to fix one point, and only allows the 1D calibration object to rotate around the fixed point. To improve the accuracy of Zhang′s method [24] , Wang et al. [25] proposed a 1D calibration method based on the heteroscedastic error-in-variables (HEIV) model.
For multiple synchronized cameras, Mitchelson and Hilton [26] proposed a 1D calibration method for calibrating the intrinsic and extrinsic parameters simultaneously, without limiting the motion of the 1D calibration object in the common field of view of all cameras. In this method, stereo calibration is first performed to calculate the initial values of intrinsic and extrinsic parameters, assuming that the principal points of stereo cameras are known or have reasonable values. An iterative bundle adjustment method is then used to optimize the intrinsic and extrinsic parameters of all the cameras. Kurillo et al. [27] studied the problem of initial estimation and global optimization of extrinsic parameters for multiple synchronized cameras with known intrinsic parameters.
In addition, Wang et al. [28,29] proposed a method to linearly calibrate the intrinsic parameters of multiple synchronized cameras based on a freely-moving 1D calibration object. However, this method requires that the 1D calibration object moves in the common field of view of all the cameras. For synchronized multiple perspective cameras (obeying the pinhole camera model), Fu et al. [30,31] proposed a calibration method based on a freely-moving 1D calibration object (see Fig. 12). This method can simultaneously compute the intrinsic and extrinsic parameters of each camera and does not need to limit the 1D calibration object moving in the common field of view of all the cameras. They also extend this method to a generic calibration method [32] , which is not only suitable for synchronized multiple perspective cameras but also for synchronized multiple fish-eye cameras. 1 5 Svoboda et al. [33] developed a point calibration toolbox for calibrating the intrinsic and extrinsic parameters of at least three cameras simultaneously. As shown in Fig. 13, the calibration object used is made of a red or green transparent plastic covering a standard laser emitter. The only thing needed is to move the calibration object in the space to be calibrated, and the rest of the work is done automatically by the computer. The calibration object does not need to be observed by all the cameras simultaneously in the process of spatial movement, because in this method there is an algorithm that uses knowledge such as polar line geometry to solve points that cannot be observed. The calibration accuracy is high with about pixel reprojection error, and some researchers have carried out relevant verification work [34] . The advantage of this method is that the calibration object is simple and the calibration process is highly automatic. The disadvantage is that a relatively dark environment is required so that the calibration object can be easily distinguished from the background.

Self-calibration methods
Self-calibration methods [35][36][37] usually calibrate the intrinsic and extrinsic parameters of multiple cameras by using the point correspondences among the images without relying on any calibration object. Therefore, they are also called zero-dimensional calibration methods. For example, Bruckner et al. [35] proposed an active self-calibration method for multi-camera systems consisting of pan-tilt zoom cameras, which exploited the rotation information provided by the pan-tilt unit and did not require any artificial calibration object or user interaction. Nguyen and Lhuillier [37] designed a self-calibration method for a moving multi-camera system, which simultaneously estimates intrinsic parameters, inter-camera poses, etc.
Self-calibration methods are more flexible than the other kinds of methods, but these methods are nonlinear and require complex computations without prior knowledge such as geometry information about the scene and motion information about the cameras. The calibration accuracy of these methods is not high (reprojection errors are usually less than 5 pixels). Therefore, self-calibration methods are not suitable for calibrating the cameras in a 3D VTSMs.

Discussions
Comparison of the multi-camera calibration methods mentioned above is shown in Table 1. The calibration accuracy is evaluated by the camera reprojection error. It can be concluded that compared to the other kinds of methods, 1D calibration methods are very suitable for multi-camera calibration due to their advantages such as being simple to manufacture, not requiring common FOV of all cameras and no self-occlusion problem. In addition, a 1D camera calibration toolbox for generic multiple cameras is published as open-source (available at http://rfly.buaa.edu.cn/resources.html) so that other researchers can use the toolbox. However, most of the existing 1D calibration methods are designed for hardware-synchronized multiple cameras. These methods are no longer suitable for unsynchronized cameras (e.g., wired cameras without a hardware synchronizer or wireless cameras). Practical 1D calibration methods for calibrating unsynchronized multiple cameras need to be proposed in the future.

Existing methods
n According to whether there are markers (point markers are commonly-used) on the rigid object, pose estimation methods for rigid objects can be divided into markerbased pose estimation methods and marker-free pose estimation methods. At present, marker-based pose estimation methods are often used, so this section will focus on reviewing the marker-based methods. In computer vision, estimating the pose of a calibrated camera from 3D-2D point correspondences is known as the Perspective-n-Point (PnP) problem [38] . It is easy to transform the problem of pose estimation for rigid objects into a PnP problem. Therefore, existing marker-based pose estimation methods for rigid objects (including multicopters) can be roughly divided into the following three categories.

O(n) O(n) O(n)
Linear methods used to solve the PnP problem had high computation complexity in the early years, but recently there are some linear methods with computation complexity of , which can handle arbitrary point sets. The first method is EPnP [39,40] (Efficient Perspective-n-Point), which converts the PnP problem into the problem of solving the 3D coordinates of four control points. It only considers the distance constraints among the four control points, and finally uses a simple linearization method to solve the derived quadratic polynomial. Then some methods with the computation complexity of have improved the accuracy of EPnP by replacing the linearization method with a polynomial solver. For example, the Direct-Least-Squares (DLS) method [41] establishes a nonlinear objective function and derives a fourth-order polynomial equation, which is solved by the Macauley matrix method [42] . The main disadvantage of the DLS method is that there are singular points in the parameterization of the rotation matrix. In order to solve this problem, Zheng et al. [43] proposed an optimal Perspective-n-Point (OPnP) method that adopts a quaternion parameterized rotation matrix and solves the polynomial equations based on Grobner basis. For multiple cameras, Martinez et al. [44,45] built a real-time vision system consisting of three ground cameras to estimate the pose of a rotary wing unmanned aerial vehicle and then controlled it to achieve some tasks. The pose estimation method they used is a linear 3D reconstruction method [46] based on the perspective imaging model.
Note that the advantage of linear methods is that they are simple and intuitive. The disadvantage is that they are sensitive to noises.

Iterative methods
The iterative methods used to solve the PnP problem are to optimize an objective function involving all the point correspondences. The commonly-used objective function is to optimize a geometric error. For example, Faessler et al. [47] built a monocular-vision pose estimation system to control a quadrotor. The pose estimation method they used is the Perspective-3-Point (P3P) algorithm [48] followed by minimizing reprojection errors. In addition to geometric errors, algebraic errors can be used to make the methods more efficient. For example, Lu et al. [49] proposed an orthogonal iterative method for solving the PnP problem, which minimizes the line-of-sight deviations of 3D-2D point correspondences.
Based on the orthogonal iterative method [49] , Xu et al. [50] derived a generalized orthogonal iteration algorithm for multiple cameras. In this method, feature points observed by all the cameras can be used. Assa and Janabi-Sharifi [10] proposed a pose estimation method for multiple cameras based on virtual visual servoing (VVS), and designed two fusion structures, namely centralized and decentralized fusion. They pointed out that the centralized fusion structure offers higher accuracy at the cost of increased computation, while the decentralized fusion structure improves the computation speed at the price of lower accuracy.
Compared to linear methods, iterative methods are more accurate and robust, but, they are computationally more intensive than linear methods and prone to fall into local minima.

Recursive methods
Recursive methods depend on time filtering methods, especially extended Kalman filter (EKF) methods (the measurement model is nonlinear in the system states due to the camera imaging model). Wilson et al. [51] designed a position-based robot visual servoing control framework using monocular vision, in which the relative pose between the robot and the work piece is computed in real time based on the traditional EKF. The main problem of the traditional EKF method is that it does not perform well when the statistical characteristics of noises change or the initial state estimation is not good. In order to deal with varying noise statistical characteristics, Ficocelli and Janabi-Sharifi [52] proposed an adaptive extended Kalman filter (AEKF) to update the process-noise-covariance matrix. To facilitate the initialization of EKF, Shademan and Janabi-Sharifi [53] proposed an iterated extended Kalman filter (IEKF) for robotic visual servoing applications. In order to deal with poor noise statistical characteristics and poor initial state estimation simultaneously, Janabi-Sharifi and Marey [54] proposed an iterative adaptive extended Kalman filter (IAEKF) method using monocular vision. Then, Assa and Janabi-Sharifi [55] extended the IAEKF method to the multi-camera case to improve the accuracy of pose estimation and the robustness to camera motion and image occlusion.
In addition, Fu et al. [56] proposed a nonlinear constantvelocity process model featured with the characteristics of multicopters. Based on this process model and monocular vision observations, an EKF pose estimation method is designed. Observability analysis shows that this method is more robust to occlusion than the traditional EKF method (only two feature points are needed to achieve the six-degree-of-freedom pose estimation for multicopters), but, this method is not suitable for multiple cameras. For the optical tracking system with four wireless cameras in Fig. 5, Rasmussen et al. [57] proposed an EKF method to fuse the unsynchronized multi-camera observations.
Note that if pose estimation of multicopters is achieved using fitering methods (e.g., Kalman filter), the filter equations can be written as follows: Σ1 : where is the state vector including the pose of multicopters and is the vector whose elements are the placement (position and orientation) parameters of multiple cameras. It is known from that the placement parameters of multiple cameras can indeed determine the estimation accuracy of the states (including the pose of multicopters). In fact, the state estimation accuracy of is related to the degree of observability of the system. The degree of observability is used to describe the observability of a linear or nonlinear system quantitatively. The larger the degree of observability is, the higher the accuracy of state estimation would be. It has been applied to solve the problem of sensor placement in many areas, such as aeronautics and astronautics [58,59] , and power systems [60,61] . Therefore, it is promising to use the degree of observability as a performance index to optimize the placement parameters of multiple cameras.

Σ1
Note that the commonly-used process model in the system is a linear constant-velocity process model applicable to many rigid objects [51,54,55] , which is not a very appropriate model for multicopters. As shown in Fig. 14 (take quadrotors as an example), multicopters have their Σ1 own motion characteristics, i.e., they are under-actuated systems with four independent inputs (a thrust force perpendicular to the propeller plane and three moments) and six coordinate outputs [1] . Compared to adopting the linear constant-velocity process model applicable to many rigid objects [51,54,55] in the system , it is better to use the nonlinear constant-velocity process model featured with the characteristics of multicopters [56] .

Discussions
Comparison of the pose estimation methods for multicopters mentioned above is shown in Table 2. It is found that compared to linear methods and iterative methods, recursive methods are accurate and computationally efficient, and are very suitable for image sequence processing. However, most of the existing recursive methods are designed for general rigid objects and monocular vision. Multicopters are under-actuated systems with four independent inputs (a thrust force perpendicular to the propeller plane and three moments) and six coordinate outputs [1] . Without considering the characteristics of multicopters, the accuracy and robustness of pose estimation for multicopters will be degraded. Therefore, new pose estimation methods based on the process model considering the characteristics of multicopters and synchronized or unsynchronized multiple cameras need to be designed.
Note that the pose estimation results can be sent to the quadrotor by Wifi or Bluetooth communication. The transmission distance and bandwidth of Bluetooth communication are smaller than Wifi communication. However, the power consumption of Wifi communication is higher. The choice of communication depends on the applications.

Existing systems
There are some representative 3D visual tracking systems for multicopters. The flying machine arena (FMA) is a dual-purpose platform for both research and demonstrations, with fleets of small flying vehicles (mostly quadrotors) at the Swiss Federal Institute of Technology Zurich (ETH Zurich) [62,63] . The platform is designed similarly to the real-time indoor autonomous vehicle test environment (RAVEN) and the general robotics, automation, sensing and perception (GRASP) testbed at Massachusetts Institute of Technology and the University of Pennsylvania, respectively, where all the agents communicate with a central network consisting of ground-based control computers and the agents. The control computers monitor the states of all the agents, and communicate with them. Based on a motion-capture system consisting of eight 4-megapixel Vicon MX cameras, FMA enables prototyping of new control concepts and implementation of novel demonstrations. It has two versions: a permanent installation version in Zurich with an impressive large size of 10 10 10 and protective netting enclosing the workspace, and a mobile installation version that has been exhibited at some public events. This platform primarily uses the hummingbird quadrotor from ascending technologies as its flight vehicle, but other experimental systems (such as the distributed flight array [64] or the balancing cube [65] ) can also be tested in it.
The University of Bologna has developed a multiagent testbed [66] using an open-source and open-hardware quadrotor, i.e., Crazyflie 1.0 [67] . The core elements of the multi-agent testbed are given by the Crazyflie quadrotor, the Optitrack System, the human-machine interface and the ground station. In this testbed, quadrotors can communicate with the ground station computer over Bluetooth. The ground station receives the position information of each quadrotor from a commercial Opti-Track motion capture system (12 infrared cameras are employed). Besides, a human operator can communicate with the ground station using a joystick.
A multi-agent testbed called the Crazyswarm [68] has been developed by the University of Southern California. The testbed adopts the Crazyflie 2.0 quadrotor, which is  the successor to the Crazyflie 1.0 quadrotor used in the multi-agent testbed of University of Bologna. In Crazyswarm, quadrotors communicate with the ground station computer over a Bluetooth radio link with 39 quadrotors on just 3 radios. The pose of the quadrotors is given by a Vicon Vantage motion capture system, which consists of 24 cameras with a working area of 6 × 6 × 3 . However, the tracking software of Vicon is not used because the physical size of the quadrotor makes it difficult to create a lot of different marker arrangements. Instead, a tracking system based on the iterative closest point (ICP) algorithm [69] has been used, which allows every quadrotor to have the same marker arrangements. Reliable flights have been achieved with accurate tracking (less than 2 cm mean position error) by implementing the majority of computation onboard the quadrotor, including sensor fusion, control, and some trajectory planning. The software of Crazyswarm is published as opensource (available at https://github.com/USC-ACTLab/ crazyswarm), making this work easily reusable for other researchers.
The Autonomous Vehicles Research Studio (AVRS) developed by Quanser is a good solution for researchers, who want to start a multi-vehicle (quadrotors and ground vehicles) research program in a short time [70] . The quadrotor used is the successor of the QBall 2 and is equipped with powerful on-board Intel Aero Compute Board, multiple high resolution cameras and built-in Wifi capability. AVRS uses a commercial OptiTrack motion-capture system to locate the vehicles. This studio enables researchers to explore topics in advanced flight control, machine vision, simultaneous localization and mapping (SLAM), etc.
There are also some 3D visual tracking systems for multicopters using low-cost off-the-shelf cameras instead of expensive commercial Vicon or Optitrack cameras. For example, the Multi-Agent Test-bed for Real-time Indoor eXperiment (MATRIX) system is developed at Cranfield University to implement the control of an indoor unmanned aerial vehicle (UAV) [71,72] . It mainly consists of four parts: two firewire charge-coupled device (CCD) cameras, a ground computer, onboard color markers, and quadrotors. Experimental results show that the MAT-RIX system can provide an accurate and reliable pose estimation so that the pose can be used to control a quadrotor UAV.
In addition, a low-cost 3D visual tracking system for multicopters is developed at Beihang University to implement the indoor control of quadrotors [73] . It consists of three MUC36M (MGYYO) infrared cameras (850 nm), and three infrared LEDs (the power is up to 4 w). Experimental results demonstrate that with the help of this 3D VTSMs, quadrotor hovering and line-tracking control could be achieved.

Discussions
Comparison of the 3D visual tracking systems for multicopters mentioned above is summarized in Table 3. The accuracy in the last column is evaluated by the marker positioning accuracy. It can be found that most of the 3D visual tracking systems are based on commercial motion capture systems (Vicon and OptiTrack), which are proprietary, expensive and not specially designed for multicopters. More effort needs to be put into the research of 3D visual tracking systems for multicopters by using cheap and off-the-shelf cameras. On the other hand, most of the 3D visual tracking systems for multicopters adopte wired and hardware-synchronized cameras. This will not only reduce the system framerate [74] but also make the system layout cumbersome. Therefore, 3D visual tracking systems for multicopters using wireless cameras need to be studied.

Challenging issues
As mentioned above, one typical trend is to design 3D visual tracking systems for multicopters by using wireless cameras. One attempt made in [57] is to design an EKF method to estimate the head and hand pose of users in a virtual environment. However, the multi-camera placement problem and the multi-camera calibration problem are not discussed in this paper. Therefore, it would be promising to study how to effectively solve the optimal placement problem, the camera calibration problem and the robust pose estimation problem for multicopters when Another research trend is to design a 3D visual tracking system for multicopters that can be used outdoors. Nowadays most of the 3D visual tracking systems for multicopters can only be used indoors because of the interference of sunlight. This largely limits the application scenarios of these systems. Therefore, it would be important to study how to design an outdoor 3D visual tracking system for multicopters.
The third research trend is to design a 3D visual tracking system for multicopters that allows cameras to move. At present, most of the 3D visual tracking systems require a fixed installation with multiple cameras. This will largely limit the application scenarios. So it would be crucial to study how to design a 3D visual tracking system for multicopters with movable multiple cameras.

Conclusions
Three-dimensional visual tracking of multicopters plays an important role in the design and development of multicopters. This paper gives an overview of the three key technologies of a 3D visual tracking system for multicopters: multi-camera placement, multi-camera calibration and pose estimation for multicopters. Existing problems and development trends of the 3D visual tracking systems are also pointed out. This paper aims to be helpful for researchers who want to build a 3D visual tracking system for multicopters by using cheap and off-theshelf cameras.