Introduction

The growing interest in developing unmanned aircraft vehicles (UAVs) is due to their versatility in several applications such as rescue, transport, or surveillance. A particular type of UAV that becomes popular nowadays are micro aerial vehicles (MAVs) by their advantage to fly in closed and reduced spaces.

Robust guidance, navigation, and control systems for MAVs[1] depend on the input information obtained from on-board sensors as cameras. Undesired movements are usually generated during the fly as a result of complex aerodynamic characteristics of the UAV. Unnecessary image rotations and translations appear in the video sequence, increasing the difficulty to control the vehicle.

There are multiple techniques in the literature[25] designed to compensate the effects of undesired movements of the camera. Recently, the video stabilization algorithm ‘L1 Optimal’ provided by the YouTube editor was introduced in[6]. Another interesting proposal is the Parrot’s Director Mode, implemented as an iOS application (iPhone operative system) for post-processing of videos captured with Parrot’s AR.Drones.

Usually, offline video stabilization techniques are divided in three stages:

  •  Local motion estimation

  •  Motion intention estimation

  •  Motion compensation

Local motion estimation

In this phase, the parameters that relate the uncompensated image and the image defined as reference are determined frame by frame. Optical flow[7, 8] and geometric transformation models[911] are two common approaches for local motion estimation. Our algorithm uses the latter one.

Geometric transformation models are based on the estimation of the motion parameters. For this estimation, interest points should be detected and described. A list of techniques performing this task can be found in the literature[1214], but Binary Robust Invariant Scalable Keypoints (BRISK)[15], Fast Retina Keypoint (FREAK)[16], Oriented FAST and Rotated BRIEF (ORB)[17], Scale Invariant Feature Transform[18] (SIFT), and Speeded Up Robust Feature (SURF)[19] are common in solving computer vision problems[20]. We are using SURF in this phase as a state-of-the-art algorithm because our contribution is not focused on reducing delays due to the calculation of interest points, not being significant. The delay due to smoothing techniques is higher.

The second part of the motion estimation process is interest points matching across consecutive frames. This is a critical part because the estimated motion parameters are directly dependent on the reliability of matched points. False correspondences will be removed using an iterative technique called Random Sample Consensus (RANSAC), a widely used technique based on a model from a set of points[2124]. In our work, RANSAC uses a simple cost function based on gray level difference, minimizing the delay.

Motion intention estimation

In a second phase, for ensuring coherence in the complete motion sequence, the parameters estimated previously are validated in the global and not just in the relative motion between consecutive frames. The main objective of motion intention estimation is to obtain the desired motion in the video sequence suppressing high-frequency jitters from the accumulative global motion estimation.

Several motion smoothing methods are available for motion intention estimation such as particle filter[10], Kalman filter[11], Gaussian filter[25, 26], adaptive filter[26, 27], spline smoothing[28, 29], or point feature trajectory smoothing[30, 31]. In our approach, the control signal sent to the MAV is dealt as a known information; hence, a new and different methodology to those in the literature is considered. A combination of a second-order low-pass filter, using as few frames as possible, and action control input is employed to estimate a reliable motion intention. We achieve to reduce the number of frames (time window) required for the smoothing signal using an optimization process.

Most of the techniques cited perform well in video stabilization applications, but there is an additional challenge, not studied in the literature and introduced in this paper, which we have called ‘phantom movement’.

Phantom movement

A phantom movement is mainly a false displacement generated in the scale and/or translation parameters due to the compensation of the high-frequency movements in the motion smoothing. Sometimes the motion smoothing process removes real movements and/or introduces a delay in them. Both cases are defined as phantom movements. This phenomenon represents a problem when tele-operating the MAV, and its effects in other state-of-the-art algorithms will be shown in the ‘Results and discussion’ section.

Additionally, our proposal to solve this problem will be explained in the ‘Real-time video stabilization’ section.

Motion compensation

Finally, the current frame is warped using parameters obtained from the previous estimation phase to generate a stable video sequence.

This paper is organized as follows: the estimation of local motion parameters in our method is explained in the next section. In addition, we describe the combination of RANSAC and gray level difference-based cost function for robust local motion estimation. In the ‘Motion intention estimation’ section, we present a motion smoothing based on a low-pass filter. Then, the ‘Real-time video stabilization’ section focuses on the optimization of the algorithm with minimum number of frames to estimate the motion intention. Furthermore, we propose a novel approach to solve the problem of phantom movement. Experimental results and conclusions are presented in the last section.

Robust local motion estimation

To obtain a stabilized video sequence, we estimate the inter-frame geometric transformation, i.e., the local motion parameters. In this phase, we determine the relationship between the current and the reference frame as a mathematical model. This process can be structurally divided into two parts: (a) interest point detection, description, and matching and (b) inter-frame geometric transformation estimation using matched points. Additionally, an extra process to ensure robustness is (c) robust cumulative motion parameters.

Interest point detection, description, and matching

As mentioned in the latter section, there are several techniques for detecting and matching interest points. According to the results presented in[20], the computational cost of SURF is considerably lower than that of SIFT, with equivalent performance. Using the Hessian matrix and a space-scale function[32], SURF locates way-points and describes their features using a 64-dimensional vector. Once the vector descriptors are obtained, the interest point matching process is based on minimum Euclidian distance in the 64-dimensional feature space.

Inter-frame geometric transform estimation using matched points

From the matched interest points, motion parameters between the current and the reference frame can be estimated. Variations between two specific frames are mathematically expressed by the geometric transformation which relates feature points in the first frame with their correspondences in the second frame[3335],

I sp = H t · I t
(1)

where Isp = [xsp,ysp,1]T and I t  = [x t ,y t ,1]T are the coordinates of the interest points at the reference image and the uncompensated image, respectively, and H t is the 3 × 3 geometric transformation matrix. How to chose the reference image will be analyzed in the next subsection.

The geometric transformation represented by H t can be characterized by different models as appropriate. The most common models are translation, affine, and projective models.

In previous works[36, 37], we have found experimentally that in both cases, handheld devices and on-board cameras of flying robots, most of the undesired movements and parasitic vibrations in the image are considered significant only on the plane perpendicular to the roll axis (scale, rotation, and translations). This type of distortion can be modeled by a projective transformation, and we use the affine model in our algorithm as a particular case of the projective one[38]. The benefit is twofold: a lower computation time than for the projective model and its ability for direct extraction of relevant motion parameters (scale, rotation roll, and translations in the xy-plane).

The affine model will determine different roll angles for H t (1,1), H t (1,2), H t (2,1), and H t (2,2). However, we estimate the mean angle adjustable to these values. This model is called nonreflective similarity and is a particular case of the affine model,

H t = s cos ( ϕ ) - s sin ( ϕ ) t x s sin ( ϕ ) s cos ( ϕ ) t y 0 0 1
(2)

Robust cumulative motion parameters

Robustness in our algorithm directly depends on the correct matching of interest points in consecutive frames. RANSAC (see Algorithm 1) is a reliable iterative technique for outlier rejection on a mathematical model, in this case, the affine model.

Affine transform can be estimated from three pairs of noncollinear points, but SURF and other techniques obtain hundreds of these pairs of points. RANSAC is performed iteratively N times using three pairs for each H j , obtaining N different affine transforms. The value of N is based on the required speed of the algorithm. An alternative is to use an accuracy threshold, but this procedure can be slower.

The cost function of RANSAC algorithm is a key point. In our proposal, we use the absolute intensity difference, pixel by pixel, between the warped and the reference frame. The intensity on a pixel can be affected by common problems such as lighting changes; however, this is not significant in consecutive frames. Therefore, the parameters of the affine model Hopt that minimize the cost function are

arg min ϕ , s , t x , t y j Frame j - Frame sp
(3)

The affine model has been selected as the geometric transform between two frames, the one to be compensated and another used as reference, and there are several alternatives for selecting the reference frame. An experimental comparative study was carried out in[37] on three candidates to be the reference frame: the initial frame (Framesp = Frame0), the previous frame (Framesp = Framei-1), and the compensated previous frame (Framesp = Frame i-1′). The analysis for the three proposed approaches was performed by using data obtained from an on-board camera of a real micro aerial vehicle. The obtained results show that the approach based on the previous frame is the best candidate to reference.

Finally, the transformation matrix Hopt is calculated and applied on the current frame for obtaining a warped frame similar to the reference frame, i.e., a stable video sequence.

Motion intention estimation

The RANSAC algorithm, based on the minimization of the gray level difference, is enough for obtaining a high accuracy in the image compensation of static scenes (scenes without moving objects)[36, 37]. However, our goal is to achieve a robust stabilization of video sequences obtained with on-board cameras in micro-aerial vehicles. Most of the unstable videos captured with either flying robots or handheld devices contain dynamic scenes (scenes with moving objects) mainly due to the camera motion. In this way, some movements of the capture device should not be eliminated, but softly compensated, generating a stable motion video instead of a static scene.

The process of approximating the capture device’s movements is known as motion intention. Several video stabilization algorithms use smoothing methods for the motion intention estimation such as Kalman filter, Gaussian filter, and particle filter. Our approach is based on a second-order Butterworth filter, a low-pass filter used for smoothing of signals[39].

Our experimentation platform is a low-cost MAVs, which shows a complex dynamic behavior during indoor flight. Consequently, the videos captured with on-board cameras usually contain significant displacements and high speed movements on the plane perpendicular to the roll axis. Effects of wireless communication problems such as frame-by-frame display, low-frequency videos, and video freezing should be also considered.

Using the low-pass filter as a motion intention estimator, most of the problems associated to indoor flight are avoided. However, freezing effects can eventually still appear by a low communication quality. Motion parameters computed from frozen frames must be discarded before to continue with the estimation process.

Once the affine transformation parameters (scale, rotation, and translations x and y) are extracted, as well as the values of parameters from frozen screens are removed, the low-pass filter computes the motion intention as an output without high-frequency signals. Low frequencies are associated to the intentional motion, and high frequencies are referred to undesired movements, thus the cutting frequency depending on the application and system characteristics. For cutting frequency, a higher value means an output video similar to the original movements, including the undesired movements, while a lower value means that the output video eliminates intentional movements. In our case, we use a second-order filter with the same cutting frequency 66.67 Hz to smooth the signals of the four motion parameters. An alternative option is to use a different filter for each motion parameter.

The undesired movement can be estimated by the subtraction of the motion intention, obtaining a high-frequency signal. This signal is then used in image warping to compensate vibrations and, simultaneously, to keep intentional motions. It can be seen in Figure1 the motion intention signal estimated with the low-pass filter (top) and the high-frequency signal (warping parameter in the figure) to be compensated (down) for the parameter angle. Similar graphics can be obtained for the scale and translations in the x-axis and y-axis (Figures2,3, and4).

Figure 1
figure 1

Angle. Top: accumulative (blue) and intentional (green) motion signals estimated with the low-pass filter. Bottom: high-frequency signal to be compensated.

Figure 2
figure 2

Scale. Top: accumulative (blue) and intentional (green) motion signals estimated with the low-pass filter. Down: high-frequency signal to be compensated.

Figure 3
figure 3

Translation in the x -axis. Top: accumulative (blue) and intentional (green) motion signals estimated with the low-pass filter. Bottom: high-frequency signal to be compensated.

Figure 4
figure 4

Translation in the y -axis. Top: accumulative (blue) and intentional (green) motion signals estimated with the low-pass filter. Bottom: high-frequency signal to be compensated.

Real-time video stabilization

A robust post-processing algorithm for video stabilization has been detailed; however, the goal is a real-time version. In this context, it is worth noting that there are very few techniques for real-time video stabilization, and the first challenge to be solved being computational cost. Hence, calculation time is minimized in[40] by using efficient algorithms of interest point detection and description. This method reduces time in motion intention estimation by means of a Gaussian filter without accumulative global motion, using the stabilized frames in addition to the original frames. Our proposal uses an off-line optimization process for obtaining the minimum number of frames that can be applied in real time to the system at hands without decrease in initial off-line video stabilization performance. Furthermore, this filter will be combined with the known control action signal in order to eliminate the so-called ‘phantom’ movements (in fact, a sort of ‘freezing’) in the compensated video.

Optimized motion intention estimation

To minimize the number of frames required in the video stabilization process, an exhaustive search has been implemented by an algorithm that iteratively increases the number of frames used to estimate the motion intention, whose results are plotted in Figure5.

Figure 5
figure 5

Minimization of inter-frame transformation fidelity (ITF). We obtain, using four (4) previous and four (4) posterior frames or using only six (6) previous frames, a performance as good as using the complete video sequence to estimate the motion intention.

For the optimization process, it is necessary to define an evaluation metric of the video stabilization performance. Subjective evaluation metrics can be found in the literature, such as the mean opinion score (MOS), which is very common in the quality evaluation of the compressed multimedia[41]. The other possibility is to use objective evaluation metrics such as bounding boxes, referencing lines, or synthetic sequences[42]. The advantage of the three referred objective metrics is that estimated motion parameters can be directly compared against real motion. The inter-frame transformation fidelity (ITF)[40] is a widely used method to measure the effectiveness and performance of video stabilization, whose mathematics expression is

ITF= 1 N f - 1 k = 1 N f - 1 PSNR(k)
(4)

where Nf is the number of video frames and

PSNR(k)=10 log 10 Ip MAX MSE ( k )
(5)

is the peak signal-to-noise ratio between two consecutive frames, with

MSE(k)= 1 M · N i = 0 M - 1 j = 0 N - 1 Frame k ( i , j ) - Frame k - 1 ( i , j ) 2
(6)

being the mean square error between monochromatic images with size M · N and IpMAX the maximum pixel intensity in the frame.

Based on the optimization of the objective evaluation metric ITF, two solutions have been obtained with a performance as high as using the complete video sequence to estimate the motion intention: a) using four (4) previous and four (4) posterior frames and b) using only six (6) previous frames.

Our work is focused on the real-time application, so it is important to analyze how this issue is affected for both options: a) For the first case, to use four previous and four posteriors frames means that the algorithm will be launched four frames after the video sequence initialization, and the stabilized sequence will be ready after four frames. b) In the second case, the algorithm starts six frames after the video sequence initialization, two frame later than in the first case, but for the rest of the sequence, it can be applied without added delay. Considering that the sample frequency was 10 Hz for the system at hands, a 0.4-s delay would be introduced in the total computational time when using the first option. Our algorithm uses the second option, which relies only on precedent information.

Phantom movements

Previous video stabilization approaches have obtained good results eliminating undesired movements in images captured with handheld devices and complex systems, but all of them were evaluated using the ITF as cost function. Although the final video has achieved a good ITF performance, i.e., a stable video, the motion smoothing process has generated phantom movements.

For video post-processing applications, the main objective is to stabilize the video; hence, phantom movements do not represent a problem. Notwithstanding, for real-time applications, the objective is to obtain a stable video, but it should be as real as possible. In this sense, it is important to decrease the difference between the real and estimated motion intention, preserving the ITF performance.

The root mean square error (RMSE)[43] is adopted in order to evaluate the reliability of the estimated motion with respect to the observed motion. A low RMSE means that the estimated motion intention is similar to the real motion intention.

Consequently, the proposed objective evaluation metric to be optimized is the difference between the estimated global motion and the observed motion, measured as RMSE:

RMSE= 1 2 F j = 0 F E x , j - T x , j 2 + i = 0 F E y , i - T y , i 2
(7)

where Ex,j and Ey,i are the estimated global motion of the j th frame in the x-axis and y-axis, respectively, Tx,j and Ty,i are the observed motions of the j th frame in the x-axis and y-axis, respectively, and F denotes the number of frames in the sequence.

Two alternatives exist to merge information when computing real motion intention (Tx,j, Ty,i): information obtained from the on-board inertial measurement unit (IMU) or the control action data. Choice depends on the accuracy of the model. In our algorithm, the control action is employed since IMU information is not very reliable in most of the micro aerial vehicles. In this way, the observed motion is defined as a combination between the control action and the smoothed motion signal.

Our algorithm of motion intention estimation (see Algorithm 2) uses control action as a logical gate allowing the execution of the low-pass filter only when a tele-operated motion intention is present. Additionally, our algorithm inserts a hysteresis after the execution of the action control. The objective of this hysteresis is that the system reaches its maximal (or minimal, according to the control action signal) position before the effect of a new control action.

We have defined a reliability parameter 0 < R F  < 1. A value of R F close to one leads to achieve a higher ITF value, i.e., a more stable video with phantom movements. On the other hand, using a value of R F close to zero, we obtain a less stable video without phantom movements. Our complete algorithm is shown in Figure6.

Figure 6
figure 6

Flow chart. Our proposal for video stabilization. We use a combination of motion smoothing and the input control action.

Results and discussion

This section has been divided into three parts: experimental design, video stabilization performance, and comparison with another algorithm.

The experimental design

The AR.Drone 1.0, a low-cost quadrotor built by the French company Parrot (Paris, France), has been used as experimental platform for several reasons: low cost, energy conservation, safe flight, and vehicle size. The proposed methodology has been implemented in a laptop with the following characteristics: Intel Core i7-2670QM processor, 2.20 GHz with Turbo Boost up to 3.1 GHz and RAM 16.0 Gb. Real images of four different scenarios are obtained with the on-board camera (sample frequency = 10 Hz) and processed. Furthermore, a video has been recorded with a zenith camera to capture the real motion of the flying robot in the xy-plane. RMSE is selected as objective measure of motion reliability, comparing the estimated motion with the observed motion. In order to obtain a position measure, we use a tracker based on optical flow[44] and camera calibration method for radial distortion[45]. Next, the RMSE is computed by comparing the estimate with the observed motion (from the zenith camera).

Video stabilization performance

Now, a visual perception of the results obtained for each experimental environment is showna in Figures7,8,9, and10. Experiments demonstrate that the approach based on motion intention estimation presented in this paper is robust to the presence of nearby objects, scenes with moving objects, and common problems described in past sections from on-board cameras for MAVs during indoor flight.

Figure 7
figure 7

Scene 1. Top: original video. Bottom: stabilized video.

Figure 8
figure 8

Scene 2. Top: original video. Bottom: stabilized video.

Figure 9
figure 9

Scene 3. Top: original video. Bottom: stabilized video.

Figure 10
figure 10

Scene 4. Top: original video. Bottom: stabilized video.

Presence of nearby objects

Nearby objects in the scene represents one of the main problems of video stabilization, because most of interest points are generated in the objects’ region. The image compensation is computed using the objects’ motion instead of the scene motion. However, our process of matching interest points is based on the RANSAC algorithm and the gray level difference between consecutive frames as cost function. Consequently, the process of motion estimation is not performed on the objects’ interest points but on the whole scene.

Scenes with moving objects

Moving objects are another common problem. Some objects with many points cause, during the motion estimation, undesirable tracking of these objects. Once more, the RANSAC-based process of matching interest point is not only referenced to moving objects but to the whole image.

Problems from on-board cameras for quadrotors

Scenes frame by frame, significant displacements, low-frequency videos, freezing, and high-speed displacements are frequent problems in images captured with an on-board camera due to the complex dynamic of the quadrotors during indoor flight. In all of them, the change between two consecutive frames could be considerable, producing a critical problem in video stabilization. In our approach, motion intention estimation solves these problems, and previous rejection of data higher than a threshold provides additional robustness.

Phantom movements

They corresponds to a phenomenon present in previous video stabilization techniques, but not still reported. Independently of which approach is used, video stabilization process depends on a phase of motion intention estimation. The phantom movements are generated during the elimination of the high-frequency movements due to the motion intention estimation which reduces the frequency of the movements and the previous motion intention estimators which are not able to detect or correct these troubles. Our proposal eliminates this phantom movements using a combination of a low-pass filter, as a motion intention estimator, with the control action. However, this method slightly decreases the ITF value.

Comparison

Our approach has been compared with the off-line method L1-Optimal[6], which is applied in the YouTube Editor as a video stabilization option. Results on four different scenes are presented in Table1 using two evaluation metrics: ITF and RMSE.

Table 1 Evaluation metrics

The obtained results show that our algorithm is comparable with the L1-Optimal method. The performance of our approach with respect to the ITF measure is slightly lower than L1-Optimal, which means that the output video is a little less stable. However, the RMSE value, which represents the similarity between the stabilized video motion and the intentional motion, is higher in our approach.

It is worth noting that the ITF measure could be increased varying the reliability factor R F  ∈ [0,1], but the RMSE inevitably would decrease. For post-production applications, the value for R F can be equal to one, but for applications of motion control based on camera information, the realism of the movement is important, so the reliability factor should decrease to zero. Both, observed and estimated scales are graphically compared in Figure11 for the technique L1-Optimal and our approach.

Figure 11
figure 11

Comparison of the scales. L1-Optimal (blue), our approach (green), and observed (red).

Conclusions

After conducting an initial study of motion smoothing methods, it has been experimentally checked that the low-pass filter has a high performance as algorithm for motion intention estimation, eliminating undesired movements.

However, this method can be optimized using a lower number of frames without decreasing the ITF measure, as we have presented in this paper. The cutting frequency depends on the model characteristics; hence, the information about the capture system implicates a considerable contribution in a calibration phase.

The phantom movements are a phenomenon that had not yet been studied in the video stabilization literature, but it is a key point in the control of complex dynamic systems as micro aerial vehicles, where the realism in the movements could mean the difference that prevents an accident.

The reliability factor is adapted to the purpose of the application. This application can be a post-production with a high ITF value and a lower realism or the opposite situation, for real-time video stabilization used in tele-operation systems.

As a future work, we will extend our video stabilization method, using the quadrotor model estimated in[46], for aggressive environments with turbulence and communication problems. We will apply it for increasing the performance of detection and tracking algorithms. In[47] we presented a first application for face detection.

Endnote

aVideo results are provided:http://www.youtube.com/user/VideoStabilizerMAV/videos.