Natural Feature-based Visual Servoing for Grasping Target with an Aerial Manipulator

Aerial transportation and manipulation have attracted increasing attention in the unmanned aerial vehicle field, and visual servoing methodology is widely used to achieve the autonomous aerial grasping of a target object. However, the existing marker-based solutions pose a challenge to the practical application of target grasping owing to the difficulty in attaching markers on targets. To address this problem, this study proposes a novel image-based visual servoing controller based on natural features instead of artificial markers. The natural features are extracted from the target images and further processed to provide servoing feature points. A six degree-of-freedom (6-DoF) aerial manipulator system is proposed with differential kinematics deduced to achieve aerial grasping. Furthermore, a controller is designed when the target object is outside a manipulator’s workspace by utilizing both the degrees-of-freedom of unmanned aerial vehicle and manipulator joints. Thereafter, a weight matrix is used as basis to develop a multi-tasking visual servoing framework to integrate the controllers inside and outside the manipulator’s workspace. Lastly, experimental results are provided to verify the effectiveness of the proposed approach.


Introduction
Unmanned Aerial Manipulator (UAM), which is a type of Unmanned Aerial Vehicles (UAV) equipped with a multiple Degrees-of-Freedom (DoF) robotic arm, is bio-inspired from flying birds and attracts increasing attention in robotics research. UAMs have immense potential for various applications, including express transportation [1][2][3] , construction and maintenance [4][5][6] , manipulation [7][8][9] , and cooperative operations [10][11][12] in places that are dangerous and difficult to reach by humans or ground mobile robots. Grasping is an important application for mobile manipulators. Compared with mobile manipulators based on mobile ground robots, UAMs continue to present significant challenges in perception and control mainly because of the considerably complex kinematics/dynamics and motion constraints of the coupled UAV-manipulator system.
Several approaches have been developed to achieve aerial grasping. Garimella et al. [13] proposed a nonlinear model-predictive control method to exploit multi-body system dynamics and achieve optimized performance. In Ref. [14], a controller that uses a multi-level architecture was proposed, in which the outer loop is composed of a trajectory generator and an impedance filter that modifies the trajectory to achieve compliant behavior in an end-effector space. Seo et al. [13] developed a method of generating locally optimal trajectory for grasping in constrained environments. In Ref. [16], an online active set strategy was applied to generate a feasible trajectory of the manipulator joints and a series of tasks with defined limitations and constraint inequalities were implemented. In Ref. [17], the trajectory planning of aerial grasping was modeled as a multi-objective optimization problem and motion constraints and collision avoidance were considered in the optimization. The aforementioned approaches should provide a priori knowledge on the target position and are difficult to apply when the target's position is unknown. In addition, their reference trajectories are generated in advance, thereby possibly leading to grasping failure due to movements of the target or the actual UAM.
By contrast, some approaches have been developed for grasping on the basis of visual information, thereby compensating for the disturbance and control dynamic system. The existing approaches generally rely on extracting, tracking, and matching a set of visual features, such as points, lines, circles, or moments. Thomas et al. [18,19] proposed a vision-based controller inspired by a rapidly moving hawk in the act of catching fish. The grasping controller was accomplished by mapping the dynamics of a quadrotor to a virtual image plane, thereby enabling dynamically feasible trajectory planning in the image space. The target to be grasped should be covered with pure color. Kim et al. [20] developed an Image-Based Visual Servoing (IBVS) controller with image moment to obtain velocity control on the basis of the dynamic model of the entire system. Seo et al. [21] formulated the visual servoing problem as a stochastic model-predictive control framework to grasp a cylindrical object using an Aerial Manipulator (AM). Lippiello et al. [22,23] proposed a hybrid Visual Servoing (VS) with task priority control by sequentially considering several constraints [24] . In Ref. [25], an uncalibrated IBVS strategy was developed on the basis of a safety-related primary task. In Ref. [26], a complete second-order visual-impedance control law was designed for a dual-arm UAM. The visual information from the camera of one robotic arm was used to assist the assigned task to be executed by the other robotic arm. Fang et al. [17] modeled the grasping operation as an optimization problem and utilized visual information to compensate for the motion disturbance from the UAV body.
Note that the existing vision-based approaches for aerial grasping require artificial markers attached on target objects. However, these approaches could not be applied in the majority of the practical applications because of the inconvenience of attaching artificial markers on targets. Accordingly, some approaches have developed direct VS, which does not use classic image features but other image information, such as the luminance of all pixels [27] , histograms [28] , and deep neural networks [29] . In Ref. [27], the luminance of all pixels in the image was considered and did not require any tracking or matching process. In Ref. [28], the probability of occurrence in intensity, color, or gradient orientation in an image was calculated as a histogram and used for VS. In Ref. [29], a convolutional neural network was used to estimate the position and orientation between the current and desired images and a Position-Based Visual Servoing (PBVS) controller was considered to reach the desired pose. However, these approaches are developed for large targets and suffer from high computation cost. Consequently, directly using them for application in aerial grasping is difficult.
The present study aims to develop a novel VS controller for real-time aerial grasping control using natural features. The features of the VS are extracted in real time from a camera mounted on an end-effector of a manipulator, thereby avoiding the need to attach artificial markers on targets. To further address the limited workspace of the manipulator, we develop a hybrid VS framework on the basis of a weight matrix, thereby achieving VS control either outside or inside the manipulator's workspace. The main contributions of this research are threefold.
(1) The differential kinematics of the proposed AM is derived and describes the relationship between the camera's velocity and UAMs' joint and body velocities. The kinematics is used for the design of the VS controller.
(2) A novel VS controller is designed for aerial grasping based on ORB features, which can be extracted faster than SIFT and SURF. The designed controller can achieve successful grasping of an object without attaching markers. Experiments are performed to illustrate the effectiveness of the new controller compared with the one with attached artificial markers.
(3) A hybrid servoing strategy is further developed to solve the problem of limited workspace of the manipulator. When UAM is distant from the target, one of the manipulator joints is used for servoing to retain the target in the camera view. After the target is within the workspace of its manipulator, only the controller with the manipulator is used to grasp the target. The two control processes are combined into a hybrid formulation by utilizing a weight matrix.

Modeling of AM
Inspired by flying birds, an UAM is developed in the  current study to achieve aerial grasping (Fig. 1). UAM consists of an X8 coaxial octocopter and a 6-DoF serial manipulator with a camera mounted on the end-effector. The 3D CAD model of the UAM system is labeled with the coordinate frames (Fig. 2). Let Σ o be the inertial frame, Σ b the body-fixed frame of the octocopter, Σ a the base frame of the manipulator, Σ e the end-effector frame of the manipulator, Σ c the camera frame, Σ t the object frame, and Σ d the object's handle frame. Let  the angle vector of the manipulator joints. We define as the system state and  is the angular velocity vector of frame Σ b and q  is the angular velocity vector of the manipulator joints. Fig. 3 shows the coordinate system derived from the Denavit-Hartenberg (DH) parameters of the manipulator. Table 1 provides the DH parameters.
Given the DH parameters, the end-effector pose in the manipulator's base frame Σ a can be obtained from the following transformation matrix:  Fig. 3 Coordinate frames of the 6-DoF manipulator. Table 1 Denavit-Hartenberg parameters of the 6-DoF manipulator cos sin cos sin sin cos sin cos cos cos sin sin = 0 sin c os The differential kinematics of the manipulator is required to propagate the camera velocity derived from the VS controller to the state velocity s x  . The definition of the serial manipulator's forward kinematics [30] indicates that the relationship between the velocities of the joints and end-effector is given as:  denotes the velocity of the end-effector in the base frame Σ a and a e J denotes the geometric Jacobian matrix as: 1 2 6  is the position Jacobian matrix com-ponents and 3 1   oi J  denotes the orientation of the Jacobian matrix components.
To facilitate the model deduction, the geometric Jacobian matrix is transferred to the analytical Jacobian matrix J e (q) with respect to the end-effector coordinate frame, which is given as: where 6 6   t J  denotes the transformation matrix from the geometric Jacobian matrix to the analytical Jacobian matrix.
During aerial manipulations, the movements of the octocopter body and equipped manipulator are combined, thereby affecting each other. The velocity 6 1   e v  of the manipulator's end-effector in its own coordinate frame is equal to the resultant of the velocity vectors derived from the octocopter body and manipulation. The velocity e v is provided as: t  denote the orientation and position, respectively, of the body-fixed frame in the end-effector coordinate frame; o  denotes the linear and angular velocity vectors in the body-fixed frame of UAV; and S() denotes the antisymmetric matrix operation.
For another special case where the octocopter hovers with only the manipulator moving, e v is obtained from Eqs. (3) and (5) as: Combining Eqs. (6) - (8), the differential kinematic model of the UAM system is derived as: is the joint Jacobian matrix.
As an under-actuated system, only the position and yaw angle of the octocopter are controllable. Accordingly, we define the controlled variables as a new vector and the uncontrollable angular velocities as . The value of uc v can be measured using the onboard Inertial Measurement Unit (IMU). Eq. (9) is rewritten as follows under the assumption of a classical time-scale separation between the attitude controller and velocity controller [22] : where 6 10   c J  and 6 2   uc J  are two task Jacobian matrices for the controlled and uncontrolled state variables, respectively. In practical applications, UAV will generally remain stable (i.e., the pitch and roll angles approximate 0). Therefore, we assume that uncontrolled state variables do not affect the velocity. Hence, Eq. (10) is approximated as:

Visual servoing based on natural features
The VS methodology [26] can continuously and effectively control a robot to approach a target on the basis of the target's image or position information. To control the AM to automatically grasp a target object, VS technology is utilized owing to its high performance. In addition, as we know, the manipulator will degrade the UAV stability due to the force coupling between the UAV body and manipulator, and thus the grasping control needs being carefully designed. This problem will be also addressed by using the VS control technology. Because the tracking errors of feature points in the image frame are considered to achieve the grasping control, the force coupling between the UAV body and manipulator will not affect the grasping process. The traditional IBVS and PBVS approaches [31] typically use artificial patterns, such as color patches or QR tags, to obtain features and location information. Such patterns are required to be attached to a target. However, attaching an artificial pattern on an object is generally impossible in many practical applications. To address this problem, the current study designed a novel approach by using natural features on the targets instead of attaching artificial patterns. Fig. 4 illustrates the algorithm framework of the proposed natural feature-based VS controller for aerial grasping. In the section, the VS controller for aerial grasping is first designed, followed with the feature detection method that provides critical feedback information for the servoing controller.

Design of the visual servoing control law
IBVS is utilized to realize the VS control of aerial grasping, because it exhibits advantages of low computation burden and high accuracy. Similar with the classic controller design [31] , we utilize feature points in the image coordinate frame to control the end-effector toward its target pose. The challenge of this study, which differs from the existing research, lies in the natural feature detection and coupled effect between the UAV body and manipulator. From the camera model [31] and given a feature point, we obtain the following equation: where   T 2 2 6 2 denotes the interaction matrix, where 8 1   X   denotes the velocity vector of the four points in the image frame; , (1, , 4) i i   L is the interaction matrix for the ith feature, as defined in Eq. (13); and 8 6   c L  denotes the stacked interaction matrix. The error between the current position X and desired constant position X * is defined as: Thereafter, we obtain the following equation: .
We design    e e  to guarantee an exponentially asymptotical convergence of the error. Lastly, the VS control law is designed as: where  c L denotes the pseudo-inverse of c L . Eq. (13) shows that the coordinates of the four features in the image frame for each control loop and their depth values should be obtained. The following subsection presents the calculation of the features' image coordinates and depth values based on natural features.

Servoing points generation based on the ORB feature extractor
Many feature extraction algorithms have been developed in the field of computer vision [32] , where SIFT [33] , SURF [34] and ORB [35] are the three most popular algorithms. Although the SIFT and SURF features have good robustness and stability, they are difficult to achieve in terms of real-time performance on feature detection. The ORB features are highly invariant to viewpoint and robust for feature matching among the different visual views. Compared with SIFT and SURF, ORB exhibits faster computation and matching. The existing literature provides a detailed comparison [35] . The ORB feature extractor is applied in the current study to provide the natural features because viewpoint invariance and real-time performance are required for VS. ORB, i.e., oriented FAST and Rotated BRIEF, is developed based on FAST corners [35] but assigning the corners with orientation. Furthermore, the feature descriptor rotated BRIEF is also used to ensure the feature matching process with low computation burden and high matching accuracy.
The original ORB extractor suffers from such problems as uneven distribution and feature redundancy. To address this concern, we utilize the quad-tree homogenization algorithm [36] to sparse and homogenize the  feature distribution on an image. To detect sufficient feature points in each region on an image, a small threshold is selected to re-detect ORB features on areas with only a few corners. Thereafter, the best key points are selected through non-maximum suppression based on the response values. A demonstration result is shown in Fig. 5b, where the distribution of point features has been improved compared with the original one (Fig. 5a). From the figure, feature points are clustered on objects with uneven distribution under the original method, whereas the distribution of feature points is more uniform under the improved one. The improved distribution benefits the feature matching process. The flow diagram of the feature extraction process is illustrated in Fig. 6.
It is noting that ORB is unable to provide 3D information with only a monocular system. In addition, ORB has difficulty in providing life-time stable features through the entire servoing process especially in high dynamic range environments. Therefore, the extracted ORB features cannot be used directly as the servoing points mentioned in Eq. (14). To solve this problem, we develop an efficient method for generating a servoing point based on the homography matrix, which is calculated from the matched ORB features between frames. In a manner similar to that of the classical VS methods, an image containing the target object is selected as the desired image and the target's size in the world coordinate frame is assumed to be a priori known. The ORB features of the target object are extracted from the desired image and stored for further feature matching. Four servoing points are also defined around the ORB features on the target. Note that these points are virtual and undetected by the ORB features.
The ORB features of the consequent images captured from the camera are extracted in real time and matched with the stored ORB features in the desired image. Thereafter, several pairs of the ORB features are obtained on the bases of the matching results. Given the feature pairs, a homography matrix 3 where  s  denotes the scale factor; 3 4   K  represents the camera projection matrix; f x and f y refer to the focal length of the x-and y-axes, respectively; and 4 3   E  denotes the truncated extrinsic matrix, and R ij denotes ith row and jth column element of the rotation part in the extrinsic matrix. The homography matrix contains information on the position and orientation of the target frame relative to the camera frame. The extraction of extrinsic information from the homography matrix requires other parameters, including camera intrinsic matrix and physical size of the object. The extrinsic transformation T  SE(3) between the current and desired camera frames is calculated without scale problem [37] because the size of the target object is a priori known. RANSAC is applied to produce robust results because outliers may exist in the feature pairs. Thereafter, the corresponding points in the current image are calculated by reprojecting the desired servoing points to the current image using the extrinsic transformation T owing to a priori defined servoing points on the target in the desired image.

Multi-task visual servoing strategy
The manipulator is equipped at the bottom of the UAV body. Thus, the motions of the UAV body and manipulator are coupled. Therefore, controlling the aerial grasping is easier than the dynamic grasping when UAV is hovering without motion. However, when UAV is distant from the target, target grasping is outside the workspace of the manipulator, thereby preventing the accomplishment of grasping with UAV hovering. To solve this problem, the VS control methodology is utilized to drive the UAM system to enter the workspace. Moreover, DoF of the UAV and manipulator are simultaneously controlled. The camera at the end-effector is reused as the VS sensor for convenience. If only the DoF of UAV is used for VS, then the target easily leaves the field view of the camera. Thus, we use the manipulator joints to address the aforementioned concern. To maintain the field of view, the idea is to use UAV to provide yaw servoing while using the manipulator to provide pitch servoing. Once the AM system enters into the manipulator workspace, VS (Eq. (17)) is utilized thereafter.
The manipulator has 6 DoFs, which are redundant for achieving pitch servoing. The 2nd, 3rd, and 5th joints can be used to drive the camera and maintain the field of view (Fig. 3). Thereafter, a weight matrix is introduced in Eq. (11) as: (20) where 10  n W  denotes the weight matrix, T c W v pertains to the vector of the actual control variable, and n is the number of the actual control variables. Matrix W is used to activate different control variables based on the control strategy. For simplicity, we use only one of the manipulator joints to provide additional freedom for the camera. The 5th joint is selected and used for pitch angle servoing because the 2nd and 3rd joints are distant from the camera's installation position and their movements cause a substantial change in the camera's field view. Thereafter, the corresponding weight matrix W is provided as: In addition, Eq. (20) is simplified as: where J 1 (1:3) denotes the first three columns of J 1 ; J 1 (6) stands for the 6th column of J 1 ; and J 2 (5) represents the 5th column of J 2 . Moreover, J 1 and J 2 are defined in Eqs. (7) and (8), respectively. With the W, UAV's state variables x, y, z, and yaw, and the angular velocity of the 5th manipulator joint are controlled. On the basis of the controller (Eq. (22)), UAM will approach the target object. Once the target object is located in the workspace of the manipulator, the VS control will be switched to utilize the manipulator only while maintaining UAV hovering. We can design these two control tasks (i.e., VS inside and outside the manipulator's workspace) using a common framework by using a weight matrix. Thus, we design the aforementioned weight matrix as: Lastly, the following hybrid control law is provided on the bases of Eqs. (17) and (20): The desired VS controls for VS inside and outside the manipulator's workspace are achieved by selecting the corresponding weight matrices in Eqs. (23) and (21), respectively.

Experimental result
Several experiments are performed in this section to verify the effectiveness of the proposed system. The proposed UAM platform includes a coaxial eightpropeller structure (Fig. 1). The selected manipulator motors are Dynamixel servo motors and the manipulator links are custom-built using 3D printing for lightweight consideration. The total length of the manipulator is 0.5 m with a total weight of 0.9 kg. An Intel NUC computer is equipped for onboard image processing and control implementation. Our motion capture system, which consists of 10 OptiTrack cameras at 120 Hz, obtains the UAV's pose information. A camera is attached to the end-effector of the manipulator for VS control. Note that the entire control process is automatic except for the start command triggered by a human operator. Fig. 7 shows the signal flow of the developed AM system.

Evaluation of the ORB-based servoing point detection
An experiment was first designed to evaluate the detection of servoing points, which were detected from the target object without attaching artificial markers. A target object was selected for demonstration. The reference image and the detected servoing points are presented in Fig. 8a. The small circles represent the detected ORB features. In addition, the camera was moved 1.5 m away and rotating at different angles (Figs. 8b-8d). The results indicate that the target was successfully matched and detected in different views. Moreover, the servoing points were provided in a stable manner even with complex background and large movements of the camera.

Aerial grasping experiment
A validation experiment for aerial grasping was further performed on the basis of the robust detection of the servoing points. The grasping experiment was divided into three steps: VS outside the manipulator's workspace, VS inside the manipulator's workspace, and grasping. Fig. 9 shows the images of the grasping experiment. The AM automatically took off and flew toward the target under the proposed VS controller (Eq. (24)) with the weight matrix (Eq. (21)), where the target is 3 m away from the AM and outside the manipulator's workspace. When the camera was 0.4 m away from the target, the UAV hovered to perform VS and drove the end-effector to approach the target. When the camera moved to a distance of 20 cm away from the target, an open-loop grasping control based on the AM's dynamics was triggered to grasp the target object because the camera was unable to see the target. Lastly, the object was grasped successfully. Refer to the supplementary video for the grasping procedure.
To verify the performance, the VS using artificial markers [17,26] for servoing features was also performed for comparison. Without loss of generality, AprilTag was used as the marker attached on the target. Figs. 10 and 11 illustrate the convergence performance about the camera's translational and rotational velocities, joints' rotation speed, and tracking errors of the AprilTag-and natural feature-based visual servoing approaches, respectively,  when the UAV was hovering within the AM's workspace. The servoing points' trajectories in the image plane were also illustrated. From the figures, the convergence results of the servoing points in the image plane and joint angles in the joint space are similar. Natural featurebased VS can achieve the same control effect with the Apriltag-based approach, but without attaching a pattern on the target. Based on Fig. 11, we find that the camera and joint velocities converged to zero. Figs. 11c and 11d show that the position of the point features eventually converged to the desired position. Some errors, which may be caused by such as hand-eye calibration, camera calibration, and static error, were observed. The errors in the image frame denote small position errors of the end-effector in the world frame. Therefore, the endeffector was able to grasp the target object successfully.

Conclusion
This study develops an AM system that achieves object grasping without artificial landmarks on the target. The kinematic model of the proposed system is first deduced as the basis of the design of the VS controller. Thereafter, a novel VS controller is designed by utilizing ORB features detected from the captured images. This natural feature-based method does not need to attach artificial markers on targets. In addition, a VS controller when the AM is outside the manipulator's workspace is developed by utilizing the DoFs of the UAV and the manipulator joints. By involving a weight matrix, the two VS controllers inside and outside the manipulator's workspace is further designed into a common framework. Lastly, experiments are carried out to verify the effectiveness of the proposed approach.
The paper considers only target objects with rich texture; however, the targets may be lack of texture on the surface in practical applications. In addition, the motion capture system is used for the UAV stability control; however, it is generally unavailable. To realize the fully autonomous aerial grasping, there still needs deep research on the UAV localization and environmental 3D perception, robust object detection, grasping force control, and safety mechanism, etc. Therefore, our future work will be the development of algorithms to promote the autonomous ability of the aerial manipulator system.

Acknowledgment
This study was partially supported by grants from the National Natural Science Foundation of China (Nos. U1713206, 61673131 and 51975550), the Bureau of Industry and Information Technology of Shenzhen (No. 20170505160946600), and Hong Kong Research Grant Council (No. 14204814).
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.