Camera and inertial sensor fusion for the PnP problem: algorithms and experimental results

In this work, we face the problem of estimating the relative position and orientation of a camera and an object, when they are both equipped with inertial measurement units (IMUs), and the object exhibits a set of n landmark points with known coordinates (the so-called Pose estimation or PnP Problem). We present two algorithms that, fusing the information provided by the camera and the IMUs, solve the PnP problem with good accuracy. These algorithms only use the measurements given by IMUs’ inclinometers, as the magnetometers usually give inaccurate estimates of the Earth magnetic vector. The effectiveness of the proposed methods is assessed by numerical simulations and experimental tests. The results of the tests are compared with the most recent methods proposed in the literature.


Introduction
The pose estimation problem, or perspective-n-point (PnP) problem, was first introduced in [1] as: Given a set of 'landmarks' (control points), whose locations are known in some coordinate frame, determine the location (relative to the coordinate frame of the landmarks) of that point in space from which an image of the landmarks was obtained.
In other words, the problem consists in obtaining the relative position and orientation of an object (for example, a drone, or a robotic hand) with respect to a camera, using an image provided by the camera and the knowledge of a set of n landmarks on the object. Figure 1, where landmarks are displayed in blue, shows a typical physical configuration. This problem is relevant in computer vision [2,3], augmented reality [4], photogrammetry [5], and robotics [6][7][8].
It is worth noticing that the terms 'landmarks', 'landmark points' and 'control points' are used interchangeably in the literature, and the set of those points is often referred to as a 'feature' on the object. Because of that, the landmarks are often referred to as a 'feature points'.
Although the PnP problem dates back to the '80s, some recent papers witness the perduring interest in this topic and its applications. In [9], a PnP module is used in a vision-based SLAM algorithm to recover the camera pose after a tracking failure, and the EPnP method presented in [10] is used to solve that problem. In [11], the problem of an unmanned aerial vehicle (UAV) landing on a runway is faced, and it is shown that a PnP algorithm using landmark points on the runway outperforms the landing procedure based on GPS only. There, the PnP algorithm presented in [12] is used to compute the solution. In [13], a PnP framework is used to solve the perspective-n-lines (PnL) problem, which is a generalization of the former. The similitude between the two problems, signately with the approach of [10], gives rise to many PnL algorithms "inspired by their PnP counterparts". Also, the consumer electronic industry nowadays makes wide use of PnP modules. As an example, Isaac, NVIDIA's open platform for intelligent robots, has a PnP module for navigation and manipulation [14].

Standard PnP problem
Owing to its relevance in the applications, the PnP problem has been deeply investigated. It has been shown that the two points' feature (P2P) leads to infinitely many solutions for the position and the orientation and that a three points one (P3P) leads to a finite number of solutions [1]. In [15], it is shown that the P3P has at most four solutions, and conditions for the unicity of the solution are provided. For n = 4, a unique solution is obtained if the points are coplanar and at most two of them lie on the same line [16].
Because of the unicity of its solution, the P4P problem has been thoroughly investigated. Paper [1] has introduced the RANSAC algorithm, which faces the P4P by solving the P3P problem for each groups of three points out of the four and then taking the intersection of their solutions.
PnP problems with n > 4 have also been considered, to obtain a solution less sensitive to measurement noise. This has been done, e.g., in [10,18], using linear algebra techniques, and in [12] using an iterative optimization method.

PnP problems and IMUs
Standard PnP methods only use the pictures taken by the camera; as of today, however, many 'intelligent objects' (e.g., mobile robots and drones) and cameras are equipped with inertial measurement units (IMUs), which are able to measure the Earth's gravity and magnetic vectors. This information can be used to improve the accuracy of the relative pose estimation. For instance, in [19,20] an extended Kalman filter is used to fuse the data coming from the camera and the IMUs and to obtain the relative pose, whilst in [21] the orientation provided by the IMUs is used to compute the translation vector between the object and the camera.
In this paper, we show how the measurements taken from the IMUs and the camera can be fused to solve the PnP problem, without using the compass data from the IMUs, which is usually noisy, and due to the high resolution of IMUs' accelerometers, we can obtain an accurate solution.
An earlier work along this approach is [22], but there the authors assume to know the coordinates of two landmarks in the absolute reference frame; then, using the roll and pitch provided by the IMU on the camera, they solve the P2P.
The present paper extends the preliminary results given in [23,24]; as in [22], we assume that both the camera and the object are equipped with IMUs, but no hypothesis on their coordinates is made. First, we face the P2P problem using inclinometers only: we prove that it admits two solutions and propose a geometrical test to mitigate the ambiguity between them. Also, we show that singular configurations and ambiguities can be avoided using three non-collinear landmarks (P3P). Then, we face the general PnP, formulated as the optimization of a least square index.
The performance of the proposed algorithm will be analyzed by both numerical simulations and experimental tests, and it will be contrasted against the most recent methods from the literature.

Problem statement
Consider an object in the field of view of a camera and define two reference frames: one for the camera, denoted by O xyz , whose origin is in the camera focus, and one for the object, denoted by O uvw .
Moreover, let R = [r T 1 ; r T 2 ; r T 3 ] and t = [t x , t y , t z ] T be the rotation matrix and the translation vector from the object reference frame to the camera reference frame, respectively (r T i denotes the i-th row of R).
Assume that the object is provided with n landmarks P i , i = 1, . . . , n, known a priori in the object reference frame; the coordinates π i of the landmark P i in the camera reference frame O xyz are Both the object and the camera are equipped with IMUs, which are able to measure the gravity and the magnetic unit vectors in the object reference frame O uvw , and in the camera reference frame O xyz , respectively; the couples (g c , m c ) and (g o , m o ) denote the same vectors in the different coordinate frames, i.e., g c = R g o and m c = R m o . We model the camera as the pinhole [25], defined by the distance-per-pixel dpx and the focal length f . The bidimensional image reference frame is denoted by O c x c y c , the image center is C I = (x C I , y C I ) and the image plane is z = f . Accordingly, each 3-D feature point P i is projected into the 2-D pixel which is the intersection between the image plane and the line connecting P i and the camera focus: see Fig. 1. Our goal is to solve the PnP using data g o and g c given by the IMUs, the landmarks' coordinates in the object reference frame and their projections on the image plane, thus obtaining the transformation matrix R t from O uvw to O xyz :

The method
If the whole information from IMUs was used, the rotation matrix R would be known immediately; however, we wish to use only the information coming from IMUs' accelerometers (g c and g o ), neglecting m c and m o for the noise reason discussed before. To obtain the matrix R t in this setting, we will find a parametrization of R w.r.t. the neglected data. Let a reference frame O h be given. If the rotation matrix R h c from this reference frame to the camera's one, and the rotation matrix R h o from this frame to the object's one were known, then the overall rotation matrix, independently of O h , would be: To obtain the rotation matrix from one frame to another, we must represent the basis of the first frame in the second's one. Thus, let us define an intermediate, artificial reference frame O h , denoted as North-East-Down (NED), where the Down vector is the gravity and the North vector is arbitrary. Using the O h frame, the rotation matrices R h c and R h o will be found and R obtained by Eq. (3). The first column of R h c will be the vector g c . As for the second column, an artificial North vectorm c in the camera reference frame is taken, orthogonal to g c : The third column of R h c will ben c = g c ×m c , i.e., and thus the rotation matrix from O h to O xyz will be Now we build the rotation matrix R h o = [R o h ] T from O h to the object's reference frame. The first column of R h o will be g o ; the second column will containm o , the artificial North in the object's reference frame.
The coordinates ofm o in O uvw are unknown, but it has to lie on the intersection between the plane orthogonal to g o and a unit sphere: see Fig. 2. This gives a parametrization: where {m 1 ,m 2 } is an orthonormal basis of the plane orthogonal to g o and α is the unknown angle characterizing the artificial North vector. For the sake of simplicity,m 1 is chosen aŝ and thusm 2 is given bym 2 = g o ×m 1 , which implieŝ (8) and the rotation matrix from the O h reference frame to O uvw is Finally, given R h o (α) and R h c , we have and the whole transformation matrix will be:

P2P and P3P using accelerometers only
As for the P2P, let (2), the two pixels arẽ Using R(α) in Eq. (12), a system in the four unknowns (t x , t y , t z , α) is obtained, which is solved by first computing α and then getting R(α) by Eq. (10). Then, a unique translation t is obtained: see Sect. 2.3 in the sequel or Ref. [21].
To compute α, under the assumption that either Δ x = P x,1 −P x,2 = 0 or Δ y =P y,1 −P y,2 = 0, the following equation can be obtained: are quantities that can be computed from the measured data (see [23] for the complete expression of a, b, and c). At this point, using we obtain: Equation (14) has two solutions, denoted by α 1 and α 2 , from which we can compute two couples: The case Δ x = 0 and Δ y = 0 happens when the two landmarks and O xyz lie on the same line. The vector defining that line can be found if it is not aligned with the gravity. Depending on its verse, there are two possible values for α, but, if we can distinguish pixelsP 1 andP 2 (e.g., by colored landmarks), we get the correct one. If the line and gravity vectors are aligned, then the information on the line vector is the same provided by the IMUs, and we have infinitely many solutions for the rotation matrix.
The case Δ x = 0 and Δ y = 0 means that the two pixels P 1 andP 2 coincide: as the camera sees a single point, the distance between the reference frames cannot be obtained. We thus have the quite intuitive result: Theorem 1 Using the image provided by the camera and the gravity vectors given by the IMUs, the P2P returns: -two solutions for R and t if Δ x = 0 or Δ y = 0; -a unique solution for R and infinitely many solutions for t whenP 1 =P 2 , the line between the two landmarks and O xyz is not aligned with g, and the closest point is recognized (see Fig. 3); -infinitely many solutions for R and t whenP 1 =P 2 and the line between the two landmarks and O xyz is aligned with g. From this analysis, it results that the P2P with accelerometers has two solutions whenever the camera sees two distinct points. However, each time one of the transformation matrices maps a feature point out of the camera view, one of them can be discarded.
The camera view is the conic combination of vectors {v 1 , v 2 , v 3 , v 4 } going from the camera focus to the image corners. A point P will be in the camera view if and only if it can be expressed as a conic combination of either This fact has been used to define Algorithm 1, to try to discard one of the couples (R, t). To evaluate how effective is this way to discard impossible configuration, we have performed ten thousand numerical simulations using that algorithm, with random configurations of the object and the camera. The results showed that Algorithm 1 is able to discard the wrong solution in 47% of the tests.

A heuristic for the P3P
Another way to mitigate the ambiguities in the solutions of the P2P problem is to use a third feature point P 3 , not collinear with P 1 and P 2 . This also ensures that it is not possible to see a unique pixel in the camera image. Assume, without loss of generality, that points P 1 and P 2 are not aligned with the camera focus. The proposed method is reported in Algorithm 2.
Although this is only a heuristic approach, in all the ten thousand simulations used to test Algorithm 1 it always identified the correct solution. We remark that, to increase robustness w.r.t. pixel noise, it is advisable to perform the P2P on the couple of points with the largest distance in the image plane.

Computing the translation vector for the P2P
Once matrix R has been estimated, the translation vector t must be computed; to this sake, the method in reference [21] can be used. This method solves Eq. (12) to obtain t x , t y , t z : where A z is computed as follows: This solution, however, can be sensitive to pixel noise, as we will show in the sequel. An alternative way to compute t is the minimization of the reprojection error whereP 1 (t) andP 2 (t) are the pixels where points P 1 and P 2 are projected using the rotation matrix previously computed, the unknown translation vector t, and Eq. (2). The reprojection error is widely used in the literature to test the performance of PnP solutions (see [10,21,22]), but the resulting optimization problem must be solved numerically. To overcome this difficulty, we propose a different least square index. To this end, we write Eqs. (12) in the explicit Algorithm 2: P3P using inclinometers.
which defines the least mean square (LMS) solution of Q t = s; this is given by the pseudo-inverse [26]: Owing to the nature of the LMS method, the solution t * is expected to be less sensitive to pixel noise than the one obtained using Eq. (17). This fact has been verified by ten thousand numerical tests, simulating a feature with two points in P 1 = [0, 0, 0] m and P 2 = [0.1, 0.1, 0] m, and a zero mean Gaussian noise with standard deviation of five pixels on the components ofP 1 andP 2 . In each test, the matrix R t has been taken randomly. Comparison has been made through the absolute error where t s is the solution from either Eq. (17) or Eq. (20). As expected (see Fig. 4), the errors obtained using the pseudo-inverse are much smaller than the ones obtained by Eq. (17) Remark Indexes E and E 2 satisfy E = ||M|| 2 E 2 , where For this reason, under the assumption t z r T 3 (P 1 − P 2 ), the optimal E 2 can be taken as an estimate of the optimal E. Figure 5 shows, for some of the ten thousand tests, the values of ε t computed using both vector t * by Eq. (20) and vector t by the (computationally expensive) numerical optimization of E. No estimation advantage is evident in using

Computing the translation vector for the PnP
In this section, we consider again the case where the rotation matrix is assumed known, and we generalize the results presented in the previous subsection to the case n ≥ 3, assuming that at least three feature points are not collinear. As before, the problem of estimating the vector t will be formulated as a set of linear equations. Given the n correspondences between landmarks and pixels, the matrix Q and the vector s defined before now contain 2n rows (two for each correspondence): the matrix Q and the vector s now will have the form: and again the LMS solution will be given by t * = Q † s. If matrix R has not yet been estimated, both Q and s depend on the unknown parameter α, i.e., Q = Q(α) and s = s(α); as a consequence, also t = t(α). This will be exploited in the next section to solve the general PnP Problem.

The general PnP using accelerometers only
To solve the PnP problem in our setting, we first have to find the angle α and then to substitute it in the rotation matrix R(α) and in the translation vector t = t(α), given the knowledge of the n feature points P i in the object reference frame and the related pixelsP i . Also, in this case a possible solution is to define the reprojection error P i (α, t) being the pixel where feature point P i is mapped by the transformation matrix R t (α), and to solve the problem Once the optimal pair (α * , t * ) has been estimated, the related transformation matrix R t (α * ) will be the solution of the PnP problem. Again, the alternative LMS cost function is proposed: where Q(α) and s(α) are given by Eq. (22) with R = R(α).
The same considerations about the robustness of the LMS solution w.r.t. pixel noise made for the P2P case apply. The LMS problem associated with the PnP becomes and this four-variable problem is reformulated as a single variable one: by Eq. (20), translation vector t depends on α only. Thus, t * = t(α * ) and Problem (26) translates to: The advantage of this formulation is that Problem (27) is a single variable, unconstrained optimization problem, which can be solved very efficiently, e.g., using Nelder-Mead's method [27]. A good initial guess for the solution can be obtained by solving a P3P problem on three arbitrary feature points.
In Fig. 6, the shape of function J 2 (α, t(α)) is shown, with α ∈ [−π, π), for some of the experimental tests. In all cases, the function J 2 looks very much like a sinusoid, and the two slightly different minima always lie one in [−π, 0) and the other one in [0, π).
The results have been contrasted using the two indexes suggested in [28,31]: rotation error: where c k and c * k are the k-th columns of R and of R(α * ); relative translation error: where t and t * are the real t and t(α * ), respectively.

Numerical simulations
In the numerical simulations, feature points have been taken as P 1 = (0, 0, 0), P 2 = (0.1, 0.1, 0), P 3 = (0.1, 0, 0), and the remaining ones taken randomly in the box For each test, random relative rotation and translation between the camera and the object have been generated, with roll, pitch and yaw angles in the range (−π, π], and transla- A camera with ratio f /dpx = 800 pixels, a resolution of [640 × 480] pixels and center C I = (320, 240) has been simulated, and the data have been corrupted by a zero-mean Gaussian noise with variances It is evident that the α-PnP algorithm has the best result in terms of accuracy, although some others are close, in particular for a small number of feature points. Even more evident is the α-PnP CPU time performance: it is up to two orders of magnitude faster than all the others. An interesting fact to observe is also that, after about 30 feature points, adding more points to the feature does not improve noticeably the accuracy of the solution.

Real experiments-static
In the experimental setting, we have used the following hardware: To mitigate the systematic errors due to the experimental setup, all the tests have been performed according to the following procedure: 1. the camera and the object have been set in an unknown configuration, and an estimateR t of the transformation matrix R t has been obtained by the various algorithms; 2. the object has been rotated and translated with a given matrix R t , and an estimateR t of such matrix has been obtained. To enforce accurate placement, the rotation has been taken by repeated bisections of an initial π/2 angle; 3. an estimateR t of the displacement matrix between the two configurations has been computed asR t = The performance of the eight algorithms reported in [28] and the α-PnP has been compared in a set of fifty experiments, according to the above procedure. Each picture has been used to run the various algorithms using n = {4, 8, 12, 16}. The average orientation and translation errors computed over R t andR t are shown in Figs. 11 and 12. 1 We see that even in the tests using real data the α-PnP algorithm exhibits a sound performance (the best one in estimating the translation and the third best in estimating the rotation), with a computational cost, see Fig. 13, at least twenty times smaller than the two best ones in estimating the rotation.

Real experiments-dynamic
To further investigate the performance of the proposed technique, a second experiment with real data has been carried out. The four-square feature shown in Fig. 10 and one IMU have been placed on the mobile robot Khepera III [39], and the robot has been programmed to perform, in 40 sec, a rectangular trajectory of 1.5 × 1 m with trapezoidal speed.
The camera, equipped with the second IMU, has been placed on the top of the motion plane at a distance of 3 m; sixteen equally time-spaced pictures have been taken during the journey; and the rotation and translation errors have been computed and averaged on those pictures.
The overall results with n = {4, 8, 12, 16} feature points are reported in Figs. 14 to 16. Also in this case there is not a clear winner between the various algorithms in terms 1 There, the results from the PPnP method have not been reported to increase readability, as they were much worse than all the others. However, the proposed method performs computationally much better than the competitors also in this experiment, as it is 100 times faster of the DLS and 20 times faster of the LHM. In some sense, it guarantees a good 'balanced performance' between translation and rotation estimation, and it obtains the estimates at a very high speed.

Discussion
The simulative and experimental results have shown that fusing the measurements given by a camera and two IMUs allows to obtain a fast and accurate estimate of the relative position and orientation between two objects.
In both the simulations and the experimental tests, the proposed α-PnP algorithm outperforms most of the methods listed in [28] in terms of estimation error. Only the LHM, and occasionally the recent SRPnP, have given a better estimation of the rotation, at the price of a much higher computational effort (about 30 times or more).
The CPU time required by the α-PnP is from 20 to 100 times less than that of all the others methods, making that method suitable for real-time applications. Also, the accuracy of the estimates it gives is practically independent of the number of feature points used.
We also emphasize that, to the authors' knowledge, the results presented here are the first ones where the experimental results are evaluated in terms of rotation and translation errors. Usually, when dealing with experimental data, only the reprojection index is used to test the performance. How-ever, such index is not a robust evaluation method, owing to the strong nonlinearities and ambiguities embedded in the pinhole reprojection model.
On considering the low cost of IMUs and the standard linear algebra techniques used, we believe that the proposed method is a useful tool to face the pose estimation problem.