1 Introduction

The pose estimation problem, or perspective-n-point (PnP) problem, was first introduced in [1] as:

Given a set of ‘landmarks’ (control points), whose locations are known in some coordinate frame, determine the location (relative to the coordinate frame of the landmarks) of that point in space from which an image of the landmarks was obtained.

In other words, the problem consists in obtaining the relative position and orientation of an object (for example, a drone, or a robotic hand) with respect to a camera, using an image provided by the camera and the knowledge of a set of n landmarks on the object. Figure 1, where landmarks are displayed in blue, shows a typical physical configuration. This problem is relevant in computer vision [2, 3], augmented reality [4], photogrammetry [5], and robotics [6,7,8].

It is worth noticing that the terms ‘landmarks’, ‘landmark points’ and ‘control points’ are used interchangeably in the literature, and the set of those points is often referred to as a ‘feature’ on the object. Because of that, the landmarks are often referred to as a ‘feature points’.

Although the PnP problem dates back to the ‘80s, some recent papers witness the perduring interest in this topic and its applications. In [9], a PnP module is used in a vision-based SLAM algorithm to recover the camera pose after a tracking failure, and the EPnP method presented in [10] is used to solve that problem. In [11], the problem of an unmanned aerial vehicle (UAV) landing on a runway is faced, and it is shown that a PnP algorithm using landmark points on the runway outperforms the landing procedure based on GPS only. There, the PnP algorithm presented in [12] is used to compute the solution. In [13], a PnP framework is used to solve the perspective-n-lines (PnL) problem, which is a generalization of the former. The similitude between the two problems, signately with the approach of [10], gives rise to many PnL algorithms “inspired by their PnP counterparts”.

Also, the consumer electronic industry nowadays makes wide use of PnP modules. As an example, Isaac, NVIDIA’s open platform for intelligent robots, has a PnP module for navigation and manipulation [14].

1.1 Standard PnP problem

Owing to its relevance in the applications, the PnP problem has been deeply investigated. It has been shown that the two points’ feature (P2P) leads to infinitely many solutions for the position and the orientation and that a three points one (P3P) leads to a finite number of solutions [1]. In [15], it is shown that the P3P has at most four solutions, and conditions for the unicity of the solution are provided. For \(n = 4\), a unique solution is obtained if the points are coplanar and at most two of them lie on the same line [16].

Because of the unicity of its solution, the P4P problem has been thoroughly investigated. Paper [1] has introduced the RANSAC algorithm, which faces the P4P by solving the P3P problem for each groups of three points out of the four and then taking the intersection of their solutions.

PnP problems with \(n > 4\) have also been considered, to obtain a solution less sensitive to measurement noise. This has been done, e.g., in [10, 18], using linear algebra techniques, and in [12] using an iterative optimization method.

1.2 PnP problems and IMUs

Standard PnP methods only use the pictures taken by the camera; as of today, however, many ‘intelligent objects’ (e.g., mobile robots and drones) and cameras are equipped with inertial measurement units (IMUs), which are able to measure the Earth’s gravity and magnetic vectors. This information can be used to improve the accuracy of the relative pose estimation. For instance, in [19, 20] an extended Kalman filter is used to fuse the data coming from the camera and the IMUs and to obtain the relative pose, whilst in [21] the orientation provided by the IMUs is used to compute the translation vector between the object and the camera.

In this paper, we show how the measurements taken from the IMUs and the camera can be fused to solve the PnP problem, without using the compass data from the IMUs, which is usually noisy, and due to the high resolution of IMUs’ accelerometers, we can obtain an accurate solution.

An earlier work along this approach is [22], but there the authors assume to know the coordinates of two landmarks in the absolute reference frame; then, using the roll and pitch provided by the IMU on the camera, they solve the P2P.

The present paper extends the preliminary results given in [23, 24]; as in [22], we assume that both the camera and the object are equipped with IMUs, but no hypothesis on their coordinates is made. First, we face the P2P problem using inclinometers only: we prove that it admits two solutions and propose a geometrical test to mitigate the ambiguity between them. Also, we show that singular configurations and ambiguities can be avoided using three non-collinear landmarks (P3P). Then, we face the general PnP, formulated as the optimization of a least square index.

The performance of the proposed algorithm will be analyzed by both numerical simulations and experimental tests, and it will be contrasted against the most recent methods from the literature.

1.3 Problem statement

Consider an object in the field of view of a camera and define two reference frames: one for the camera, denoted by \(O_{xyz}\), whose origin is in the camera focus, and one for the object, denoted by \(O'_{uvw}\).

Moreover, let \(R=[r_1^{T}; r_2^{T}; r_3^{T} ]\) and \(t = [t_x, t_y, t_z]^{T}\) be the rotation matrix and the translation vector from the object reference frame to the camera reference frame, respectively (\(r_i^{T}\) denotes the i-th row of R).

Assume that the object is provided with n landmarks \(P_{i}\), \(i=1, \dots , n\), known a priori in the object reference frame; the coordinates \(\pi _i\) of the landmark \(P_i\) in the camera reference frame \(O_{xyz}\) are

$$\begin{aligned} \pi _i = [\pi _{x,i},\pi _{y,i},\pi _{z,i}]^T = R P_i + t. \end{aligned}$$
(1)

Both the object and the camera are equipped with IMUs, which are able to measure the gravity and the magnetic unit vectors

$$\begin{aligned} g_{o} = [g_u,g_v,g_w]^T, \quad m_{o}=[m_u,m_v,m_w]^T, \end{aligned}$$

in the object reference frame \(O'_{uvw}\), and

$$\begin{aligned} g_c = [g_x,g_y,g_z]^T, \quad m_{c}=[m_x,m_y,m_z]^T, \end{aligned}$$

in the camera reference frame \(O_{xyz}\), respectively; the couples \((g_{c},m_{c})\) and \((g_{o},m_{o})\) denote the same vectors in the different coordinate frames, i.e., \(g_c=R\, g_o\) and \(m_c=R\, m_o\).

Fig. 1
figure 1

Camera, image and object reference frames

We model the camera as the pinhole [25], defined by the distance-per-pixel dpx and the focal length f. The bi-dimensional image reference frame is denoted by \(O^c_{x_cy_c}\), the image center is \(C_I = (x_{C_I},y_{C_I})\) and the image plane is \(z=f\). Accordingly, each 3-D feature point \(P_i\) is projected into the 2-D pixel

$$\begin{aligned} P^*_i = \left[ \frac{f}{dpx}\,\frac{\pi _{x,i}}{\pi _{z,i}} + x_{C_I}, \frac{f}{dpx}\,\frac{\pi _{y,i}}{\pi _{z,i}} + y_{C_I}\right] , \end{aligned}$$
(2)

which is the intersection between the image plane and the line connecting \(P_i\) and the camera focus: see Fig. 1.

Our goal is to solve the PnP using data \(g_o\) and \(g_c\) given by the IMUs, the landmarks’ coordinates in the object reference frame and their projections on the image plane, thus obtaining the transformation matrix \(R_t\) from \(O'_{uvw}\) to \(O_{xyz}\):

$$\begin{aligned} R_t = \left[ \begin{array}{cccc|ll} &{} R &{} &{} &{} &{} t\\ 0 &{} 0 &{} 0 &{} &{} &{} 1 \end{array}\right] . \end{aligned}$$

2 The method

If the whole information from IMUs was used, the rotation matrix R would be known immediately; however, we wish to use only the information coming from IMUs’ accelerometers (\(g_c\) and \(g_o\)), neglecting \(m_c\) and \(m_o\) for the noise reason discussed before. To obtain the matrix \(R_t\) in this setting, we will find a parametrization of R w.r.t. the neglected data.

Let a reference frame \(O^h\) be given. If the rotation matrix \(R^h_{c}\) from this reference frame to the camera’s one, and the rotation matrix \(R^h_{o}\) from this frame to the object’s one were known, then the overall rotation matrix, independently of \(O^h\), would be:

$$\begin{aligned} R = R^o_c = R^h_c \,R^o_h. \end{aligned}$$
(3)

To obtain the rotation matrix from one frame to another, we must represent the basis of the first frame in the second’s one. Thus, let us define an intermediate, artificial reference frame \(O^h\), denoted as North-East-Down (NED), where the Down vector is the gravity and the North vector is arbitrary. Using the \(O^h\) frame, the rotation matrices \(R^h_{c}\) and \(R^h_{o}\) will be found and R obtained by Eq. (3).

The first column of \(R^h_{c}\) will be the vector \(g_{c}\). As for the second column, an artificial North vector \(\hat{m}_{c}\) in the camera reference frame is taken, orthogonal to \(g_{c}\):

$$\begin{aligned} \hat{m}_c = \left[ g_z, \; 0, \; -g_x{}\right] ^T \big / \; (g_x^2 + g_z^2)^{1/2}. \end{aligned}$$
(4)

The third column of \(R^{h}_c\) will be \(\hat{n}_{c}=g_{c}\times \hat{m}_{c}\), i.e.,

$$\begin{aligned} \hat{n}_c = \left[ -g_x g_y, \; g_x^2 + g_z^2, \; -g_y g_z\right] ^T \big / \; (g_x^2 + g_z^2)^{1/2}, \end{aligned}$$
(5)

and thus the rotation matrix from \(O^h\) to \(O_{xyz}\) will be

$$\begin{aligned} R^{h}_c = \left[ g_{c}\;\; \hat{m}_{c}\;\; \hat{n}_{c} \right] . \end{aligned}$$
(6)

Now we build the rotation matrix \(R^h_o = [R^o_h]^T\) from \(O^h\) to the object’s reference frame. The first column of \(R^h_o\) will be \(g_{o}\); the second column will contain \(\hat{m}_{o}\), the artificial North in the object’s reference frame.

Fig. 2
figure 2

Unit vectors \(g_{o}\) and \(\hat{m}_{o}\)

The coordinates of \(\hat{m}_{o}\) in \(O'_{uvw}\) are unknown, but it has to lie on the intersection between the plane orthogonal to \(g_{o}\) and a unit sphere: see Fig. 2. This gives a parametrization:

$$\begin{aligned} \hat{m}_{o}=\hat{m}_{o}(\alpha )=\hat{m}_1\sin \alpha +\hat{m}_2\cos \alpha , \end{aligned}$$
(7)

where \(\{\hat{m}_1,\hat{m}_2\}\) is an orthonormal basis of the plane orthogonal to \(g_{o}\) and \(\alpha \) is the unknown angle characterizing the artificial North vector.

For the sake of simplicity, \(\hat{m}_1\) is chosen as

$$\begin{aligned} \hat{m}_1 = \left[ g_w, \; 0, \; -g_u\right] ^T \big /\; (g_u^2+g_w^2)^{1/2}, \end{aligned}$$

and thus \(\hat{m}_2\) is given by \(\hat{m}_2 = g_{o} \times \hat{m}_1\), which implies

$$\begin{aligned} \hat{m}_2 = \left[ -g_u g_v, \ g_u^2+g_w^2, -g_v g_w\right] ^T \big / \; (g_u^2+g_w^2)^{1/2}. \end{aligned}$$

The third vector is \(\hat{n}_{o}(\alpha ) = g_{o} \times \hat{m}_{o}(\alpha )\):

$$\begin{aligned} \hat{n}_{o}(\alpha ) = \left[ \begin{array}{c} -(g_w\cos \alpha +g_ug_v\sin \alpha )\\ (g_u^2+g_w^2)\sin \alpha \\ g_u\cos \alpha -g_vg_w\sin \alpha \end{array}\right] \big / \; (g_u^2+g_w^2)^{1/2}, \end{aligned}$$
(8)

and the rotation matrix from the \(O^h\) reference frame to \(O'_{uvw}\) is

$$\begin{aligned} R^h_o(\alpha ) = \left[ g_{o}\; \hat{m}_{o}(\alpha ) \; \hat{n}_{o}(\alpha ) \right] . \end{aligned}$$
(9)

Finally, given \(R^h_o(\alpha )\) and \(R^h_c\), we have

$$\begin{aligned} R = R(\alpha ) = R^h_c \, [R^h_o(\alpha )]^T, \end{aligned}$$
(10)

and the whole transformation matrix will be:

$$\begin{aligned} R_t(\alpha )=\left[ \begin{array}{cccc|cc} \ &{} R(\alpha ) &{} &{} \ &{} &{} t\\ 0 &{} 0 &{} 0 &{} &{} &{} 1 \end{array}\right] . \end{aligned}$$
(11)

2.1 P2P and P3P using accelerometers only

As for the P2P, let \(P_i = (P_{u,i},P_{v,i},P_{w,i})\), \(i =1,2\), the (known) coordinates of the two landmarks in \(O'_{uvw}\). From Eq. (2), each point is projected on the image plane in the pixel \(P^*_i\). Defining \(\tilde{P}_i = (P^*_i - C_I)\, dpx /{f}\), by Eqs. (1), (2), the two pixels are

$$\begin{aligned} \tilde{P}_i = (\tilde{P}_{x,i},\tilde{P}_{y,i}) =\left( \displaystyle \frac{r_1^T P_i + t_x}{r_3^T P_i + t_z}, \displaystyle \frac{r_2^T P_i + t_y}{r_3^T P_i + t_z}\right) . \end{aligned}$$
(12)

Using \(R(\alpha )\) in Eq. (12), a system in the four unknowns (\(t_x,t_y,t_z,\alpha \)) is obtained, which is solved by first computing \(\alpha \) and then getting \(R(\alpha )\) by Eq. (10). Then, a unique translation t is obtained: see Sect. 2.3 in the sequel or Ref. [21].

To compute \(\alpha \), under the assumption that either \(\varDelta _x =\tilde{P}_{x,1} - \tilde{P}_{x,2} \ne 0\) or \(\varDelta _y =\tilde{P}_{y,1} - \tilde{P}_{y,2} \ne 0\), the following equation can be obtained:

$$\begin{aligned} a\sin \alpha +b\cos \alpha +c=0 \end{aligned}$$
(13)

where

$$\begin{aligned}&a=a(P_1,P_2,\tilde{P}_{x,1},\tilde{P}_{y,1},\tilde{P}_{x,2},\tilde{P}_{y,2},g_{c},g_{o}),\\&b=b(P_1,P_2,\tilde{P}_{x,1},\tilde{P}_{y,1},\tilde{P}_{x,2},\tilde{P}_{y,2},g_{c},g_{o}),\\&c=c(P_1,P_2,\tilde{P}_{x,1},\tilde{P}_{y,1},\tilde{P}_{x,2},\tilde{P}_{y,2},g_{c},g_{o}), \end{aligned}$$

are quantities that can be computed from the measured data (see [23] for the complete expression of ab, and c). At this point, using

$$\begin{aligned} a\sin \alpha + b\cos \alpha = M\cos (\alpha - \beta ), \end{aligned}$$

where

$$\begin{aligned} M = \sqrt{a^2+b^2}, \quad \beta = \arctan (a/b), \end{aligned}$$

we obtain:

$$\begin{aligned} \alpha = \arccos (-c/M) + \beta . \end{aligned}$$
(14)

Equation (14) has two solutions, denoted by \(\alpha _1\) and \(\alpha _2\), from which we can compute two couples: \((R_1,t_1)\) and \((R_2,t_2)\).

The case \(\varDelta _x=0\) and \(\varDelta _y=0\) happens when the two landmarks and \(O_{xyz}\) lie on the same line. The vector defining that line can be found if it is not aligned with the gravity. Depending on its verse, there are two possible values for \(\alpha \), but, if we can distinguish pixels \(\tilde{P}_{1}\) and \(\tilde{P}_{2}\) (e.g., by colored landmarks), we get the correct one. If the line and gravity vectors are aligned, then the information on the line vector is the same provided by the IMUs, and we have infinitely many solutions for the rotation matrix.

Fig. 3
figure 3

P2P with \(\varDelta _x = 0\), \(\varDelta _y = 0\): line vector not aligned with gravity

The case \(\varDelta _x=0\) and \(\varDelta _y=0\) means that the two pixels \(\tilde{P}_{1}\) and \(\tilde{P}_{2}\) coincide: as the camera sees a single point, the distance between the reference frames cannot be obtained. We thus have the quite intuitive result:

Theorem 1

Using the image provided by the camera and the gravity vectors given by the IMUs, the P2P returns:

  • two solutions for R and t if \(\varDelta _x\ne 0\) or \(\varDelta _y\ne 0\);

  • a unique solution for R and infinitely many solutions for t when \(\tilde{P}_{1}=\tilde{P}_{2}\), the line between the two landmarks and \(O_{xyz}\) is not aligned with g, and the closest point is recognized (see Fig. 3);

  • infinitely many solutions for R and t when \(\tilde{P}_{1} = \tilde{P}_{2}\) and the line between the two landmarks and \(O_{xyz}\) is aligned with g.

From this analysis, it results that the P2P with accelerometers has two solutions whenever the camera sees two distinct points. However, each time one of the transformation matrices maps a feature point out of the camera view, one of them can be discarded.

The camera view is the conic combination of vectors \(\{v_1,v_2,v_3,v_4\}\) going from the camera focus to the image corners. A point P will be in the camera view if and only if it can be expressed as a conic combination of either \(\{v_1,v_2,v_3\}\) or \(\{v_1,v_3,v_4\}\), or that either

$$\begin{aligned} \exists \; [k_1, k_2, k_3] \ge 0: [v_1\; v_2\; v_3]\,[k_1, k_2, k_3]^T = P \end{aligned}$$

or

$$\begin{aligned} \exists \; [k_1, k_3, k_4] \ge 0: [v_1\; v_3\; v_4]\, [k_1, k_3, k_4]^T = P. \end{aligned}$$

Being \(\{v_1,v_2,v_3\}\) and \(\{v_1,v_3,v_4\}\) two sets of three independent vectors of \({\mathbb {R}}^3\), such conditions are equivalent to

$$\begin{aligned} \left[ v_1\; v_2\; v_3 \right] ^{-1}\,P \ge 0 \end{aligned}$$
(15)

or

$$\begin{aligned} \left[ v_1\; v_3\; v_4 \right] ^{-1}\,P \ge 0. \end{aligned}$$
(16)

This fact has been used to define Algorithm 1, to try to discard one of the couples (Rt). To evaluate how effective is this way to discard impossible configuration, we have performed ten thousand numerical simulations using that algorithm, with random configurations of the object and the camera. The results showed that Algorithm 1 is able to discard the wrong solution in \(47\%\) of the tests.

figure a

2.2 A heuristic for the P3P

Another way to mitigate the ambiguities in the solutions of the P2P problem is to use a third feature point \(P_3\), not collinear with \(P_1\) and \(P_2\). This also ensures that it is not possible to see a unique pixel in the camera image. Assume, without loss of generality, that points \(P_1\) and \(P_2\) are not aligned with the camera focus. The proposed method is reported in Algorithm 2.

figure b

Although this is only a heuristic approach, in all the ten thousand simulations used to test Algorithm 1 it always identified the correct solution. We remark that, to increase robustness w.r.t. pixel noise, it is advisable to perform the P2P on the couple of points with the largest distance in the image plane.

2.3 Computing the translation vector for the P2P

Once matrix R has been estimated, the translation vector t must be computed; to this sake, the method in reference [21] can be used. This method solves Eq. (12) to obtain \(t_x, t_y, t_z\):

$$\begin{aligned}&t_x = A_z \tilde{P}_{x,1} - r_1^T P_1 \nonumber \\&t_y = A_z \tilde{P}_{y,1} - r_2^T P_1 ,\nonumber \\&t_z = A_z - r_3^T P_1 \end{aligned}$$
(17)

where \(A_z\) is computed as follows:

$$\begin{aligned}&\mathbf{if} \;\; \tilde{P}_{x,2}\ne \tilde{P}_{x,1} \;\; \mathbf{then} \;\; A_z=\displaystyle \frac{(r_1^T-\tilde{P}_{x,2}r_3^T) \, (P_1-P_2)}{\tilde{P}_{x,1}-\tilde{P}_{x,2}}, \\&\mathbf{otherwise} \;\; A_z=\displaystyle \frac{(r_2^T-\tilde{P}_{y,2}r_3^T) (P_1-P_2)}{\tilde{P}_{y,1}-\tilde{P}_{y,2}}. \end{aligned}$$

This solution, however, can be sensitive to pixel noise, as we will show in the sequel. An alternative way to compute t is the minimization of the reprojection error

$$\begin{aligned} \mathbb {E}(t) = ||\tilde{P}_{1}(t) - \tilde{P}_{1}||^2 + ||\tilde{P}_{2}(t) -\tilde{P}_{2}||^2, \end{aligned}$$
(18)

where \(\tilde{P}_{1}(t)\) and \(\tilde{P}_{2}(t)\) are the pixels where points \(P_1\) and \(P_2\) are projected using the rotation matrix previously computed, the unknown translation vector t, and Eq. (2).

The reprojection error is widely used in the literature to test the performance of PnP solutions (see [10, 21, 22]), but the resulting optimization problem must be solved numerically. To overcome this difficulty, we propose a different least square index. To this end, we write Eqs. (12) in the explicit form \(Q\,t = s\), where

$$\begin{aligned} Q=\left[ \begin{array}{rrr} -1 &{} 0 &{}\;\; \tilde{P}_{x,1}\\ 0 &{}\;\; -1 &{} \tilde{P}_{y,1}\\ -1 &{} 0 &{} \tilde{P}_{x,2}\\ 0 &{} -1 &{} \tilde{P}_{y,2} \end{array}\right] ; \; s = \left[ \begin{array}{l} r_1^T P_1-r_3^T P_1\tilde{P}_{x,1}\\ r_2^T P_1-r_3^T P_1\tilde{P}_{y,1}\\ r_1^T P_2-r_3^T P_2\tilde{P}_{x,2}\\ r_2^T P_2-r_3^T P_2\tilde{P}_{y,2} \end{array}\right] s, \end{aligned}$$

and we solve for t minimizing an index alternative to (18):

$$\begin{aligned} \mathbb {E}_2(t) = ||Q\,t - s||^2, \end{aligned}$$
(19)

which defines the least mean square (LMS) solution of \(Q\,t = s\); this is given by the pseudo-inverse [26]:

$$\begin{aligned} t^{*} = Q^{\dag } s. \end{aligned}$$
(20)

Owing to the nature of the LMS method, the solution \(t^*\) is expected to be less sensitive to pixel noise than the one obtained using Eq. (17). This fact has been verified by ten thousand numerical tests, simulating a feature with two points in \(P_1 = [0,0,0]\) m and \(P_2=[0.1,0.1,0]\) m, and a zero mean Gaussian noise with standard deviation of five pixels on the components of \(\tilde{P}_1\) and \(\tilde{P}_2\) . In each test, the matrix \(R_t\) has been taken randomly. Comparison has been made through the absolute error

$$\begin{aligned} \varepsilon _t =||t - t_s||, \end{aligned}$$
(21)

where \(t_s\) is the solution from either Eq. (17) or Eq. (20).

Fig. 4
figure 4

Values of index \(\varepsilon _t\) in some of the performed tests

As expected (see Fig. 4), the errors obtained using the pseudo-inverse are much smaller than the ones obtained by Eq. (17): an average error of 0.029 m results from Eq. (20), versus an average error of 0.24 m from Eq. (17).

Remark

Indexes \(\mathbb {E}\) and \(\mathbb {E}_2\) satisfy \(\mathbb {E}=||M||^2\, \mathbb {E}_2\), where

$$\begin{aligned} M = \mathrm{diag}\left\{ \frac{1}{r_3^T P_1+t_z}, \frac{1}{r_3^T P_1+t_z}, \frac{1}{r_3^T P_2+t_z}, \frac{1}{r_3^T P_2+t_z} \right\} . \end{aligned}$$

For this reason, under the assumption \(t_z \gg r_3^T (P_1-P_2)\), the optimal \(\mathbb {E}_2\) can be taken as an estimate of the optimal \(\mathbb {E}\).

Fig. 5
figure 5

Values of index \(\varepsilon _t\) when minimizing \(\mathbb {E}\) and \(\mathbb {E}_2\)

Figure 5 shows, for some of the ten thousand tests, the values of \(\varepsilon _t\) computed using both vector \(t^*\) by Eq. (20) and vector t by the (computationally expensive) numerical optimization of \(\mathbb {E}\). No estimation advantage is evident in using \(\mathbb {E}\); conversely, \(\mathbb {E}_2\) outperforms \(\mathbb {E}\) in terms of computational time, with an average CPU time of 0.002 s against 0.556 s.

2.4 Computing the translation vector for the PnP

In this section, we consider again the case where the rotation matrix is assumed known, and we generalize the results presented in the previous subsection to the case \(n \ge 3\), assuming that at least three feature points are not collinear. As before, the problem of estimating the vector t will be formulated as a set of linear equations. Given the n correspondences between landmarks and pixels, the matrix Q and the vector s defined before now contain 2n rows (two for each correspondence):

$$\begin{aligned}&Q_{i,1}=[-1, \ 0, \tilde{P}_{x,i}], \quad s_{i,1}=r_1^TP_i - r_3^TP_i\tilde{P}_{x,i}\\&Q_{i,2}=[ 0, \ -1, \tilde{P}_{y,i}], \quad s_{i,2}=r_2^TP_i - r_3^TP_i\tilde{P}_{y,i}, \end{aligned}$$

the matrix Q and the vector s now will have the form:

$$\begin{aligned} Q = \left[ \begin{array}{c} Q_{1,1}\\ Q_{1,2}\\ \vdots \\ Q_{n,1}\\ Q_{n,2}\\ \end{array}\right] ,\quad s =\left[ \begin{array}{c} s_{1,1}\\ s_{1,2}\\ \vdots \\ s_{n,1}\\ s_{n,2}\\ \end{array}\right] , \end{aligned}$$
(22)

and again the LMS solution will be given by \(t^* = Q^{\dag } s\).

If matrix R has not yet been estimated, both Q and s depend on the unknown parameter \(\alpha \), i.e., \(Q = Q(\alpha )\) and \(s = s(\alpha )\); as a consequence, also \(t = t(\alpha )\). This will be exploited in the next section to solve the general PnP Problem.

2.5 The general PnP using accelerometers only

To solve the PnP problem in our setting, we first have to find the angle \(\alpha \) and then to substitute it in the rotation matrix \(R(\alpha )\) and in the translation vector \(t = t(\alpha )\), given the knowledge of the n feature points \(P_i\) in the object reference frame and the related pixels \(\tilde{P}_i\).

Also, in this case a possible solution is to define the reprojection error

$$\begin{aligned} J(\alpha ,t) = \sum _{i=1}^n||\tilde{P}_{i}(\alpha ,t)-\tilde{P}_{i}||^2, \end{aligned}$$
(23)

\(\tilde{P}_{i}(\alpha ,t)\) being the pixel where feature point \(P_i\) is mapped by the transformation matrix \(R_t(\alpha )\), and to solve the problem

$$\begin{aligned} \{\alpha ^*,t^*\} = \arg \min _{\alpha ,t}J(\alpha ,t). \end{aligned}$$
(24)

Once the optimal pair \((\alpha ^*,t^*)\) has been estimated, the related transformation matrix \(R_t(\alpha ^*)\) will be the solution of the PnP problem.

Again, the alternative LMS cost function is proposed:

$$\begin{aligned} J_2(\alpha ,t) = ||Q(\alpha )\,t - s(\alpha )||^2, \end{aligned}$$
(25)

where \(Q(\alpha )\) and \(s(\alpha )\) are given by Eq. (22) with \(R = R(\alpha )\). The same considerations about the robustness of the LMS solution w.r.t. pixel noise made for the P2P case apply.

The LMS problem associated with the PnP becomes

$$\begin{aligned} \{\alpha ^*,t^*\} = \arg \min _{\alpha ,t }J_2(\alpha ,t), \end{aligned}$$
(26)

and this four-variable problem is reformulated as a single variable one: by Eq. (20), translation vector t depends on \(\alpha \) only. Thus, \(t^* = t(\alpha ^*)\) and Problem (26) translates to:

$$\begin{aligned} \alpha ^* = \arg \min _{\alpha }J_2(\alpha ,t(\alpha )). \end{aligned}$$
(27)

The advantage of this formulation is that Problem (27) is a single variable, unconstrained optimization problem, which can be solved very efficiently, e.g., using Nelder–Mead’s method [27]. A good initial guess for the solution can be obtained by solving a P3P problem on three arbitrary feature points.

In Fig. 6, the shape of function \(J_2(\alpha ,t(\alpha ))\) is shown, with \(\alpha \in [-\pi ,\pi )\), for some of the experimental tests. In all cases, the function \(J_2\) looks very much like a sinusoid, and the two slightly different minima always lie one in \([-\pi ,0)\) and the other one in \([0,\pi )\).

3 Experimental results

To assess the performance of the method, we have performed numerical simulations and experiments on the real ground. In particular, the eight algorithms reported in [28]: DLS [29], ASPnP [30], OPnP [31], RPnP [32], PPnP [33], EPnP [10], LHM [12] and finally EPPnP [28], all implemented in the REPPnP Toolbox [35], have been compared to the solution obtained solving Problem (27), denoted by \(\alpha \)-P nP, solved using MATLAB function fminsearch, which uses Nelder–Mead’s method. In the experiments using real data, the recent method from [34], denoted by SRPnP, has been added to the the previous eight.

The results have been contrasted using the two indexes suggested in [28, 31]:

  • rotation error:

    $$\begin{aligned} \epsilon _{rot} = 180 \max _{k = 1, 2, 3}\arccos (c_k^T c^*_k)\,/\pi , \end{aligned}$$

    where \(c_k\) and \(c^*_k\) are the k-th columns of R and of \(R(\alpha ^*)\);

  • relative translation error:

    $$\begin{aligned} \epsilon _t(\%) = \displaystyle 100 \, ||t - t^*||/||t||, \end{aligned}$$

    where t and \(t^*\) are the real t and \(t(\alpha ^*)\), respectively.

Fig. 6
figure 6

Shape of function \(J_2(\alpha ,t(\alpha ))\) from some real data

3.1 Numerical simulations

In the numerical simulations, feature points have been taken as \(P_1 = (0,0,0)\), \(P_2 = (0.1,0.1,0)\), \(P_3 = (0.1,0,0)\), and the remaining ones taken randomly in the box \(B_f = [-0.2,0.2]^3\).

For each test, random relative rotation and translation between the camera and the object have been generated, with roll, pitch and yaw angles in the range \((-\pi ,\pi ]\), and translations in \(B_t =[-0.5,0.5] \times [-0.5,0.5] \times [0.5,2.5]\).

A camera with ratio \({f}/{dpx} = 800\) pixels, a resolution of \([640 \times 480]\) pixels and center \(C_I=(320,240)\) has been simulated, and the data have been corrupted by a zero-mean Gaussian noise with variances

  • \(\varSigma _g = \sigma _g\, I_3\), \(\sigma _g=0.001\) m, for \(\hat{g}_{c}\) and \(\hat{g}_{o}\);

  • \(\sigma _{p} = 4\) pixels in the image.

Fig. 7
figure 7

Mean rotation error (degrees) versus number of feature points: simulative results

Fig. 8
figure 8

Mean translation error (percent) versus number of feature points: simulative results

The performance of the various algorithms has been assessed using n-point features, with \(n \in \{10, 30, 50, 70, 90\}\), in 200 random configurations for each value of n. Figures 7 and 8 show the average values of \(\epsilon _{rot}\) and \(\epsilon _{t}(\%)\) versus n, and Fig. 9 shows the mean CPU time. The time performance has been estimated using MATLAB tic and toc functions.

It is evident that the \(\alpha \)-P nP algorithm has the best result in terms of accuracy, although some others are close, in particular for a small number of feature points. Even more evident is the \(\alpha \)-P nP CPU time performance: it is up to two orders of magnitude faster than all the others. An interesting fact to observe is also that, after about 30 feature points, adding more points to the feature does not improve noticeably the accuracy of the solution.

3.2 Real experiments—static

In the experimental setting, we have used the following hardware:

  • a Logitech C310 HD camera; the parameters of the camera have been estimated using the CC Toolbox [37];

  • two ArduIMU v3 IMUs [38];

  • the four-square feature shown in Figure 10. Each corner of the squares can be used as a feature point.

Fig. 9
figure 9

Mean CPU time (milliseconds) versus number of feature points: simulative results

Fig. 10
figure 10

The four-square feature used in the real experiments

To mitigate the systematic errors due to the experimental setup, all the tests have been performed according to the following procedure:

  1. 1.

    the camera and the object have been set in an unknown configuration, and an estimate \(\hat{R}'_t\) of the transformation matrix \(R'_t\) has been obtained by the various algorithms;

  2. 2.

    the object has been rotated and translated with a given matrix \(R''_t\), and an estimate \(\hat{R}''_t\) of such matrix has been obtained. To enforce accurate placement, the rotation has been taken by repeated bisections of an initial \({\pi }/{2}\) angle;

  3. 3.

    an estimate \(\hat{R}_t\) of the displacement matrix between the two configurations has been computed as \(\hat{R}_t =[\hat{R}'_t]^{-1} \hat{R}''_t\).

The performance of the eight algorithms reported in [28] and the \(\alpha \)-P nP has been compared in a set of fifty experiments, according to the above procedure. Each picture has been used to run the various algorithms using \(n = \{4, 8, 12, 16\}\).

The average orientation and translation errors computed over \(R_t\) and \(\hat{R}_t\) are shown in Figs. 11 and 12.Footnote 1 We see that even in the tests using real data the \(\alpha \)-P nP algorithm exhibits a sound performance (the best one in estimating the translation and the third best in estimating the rotation), with a computational cost, see Fig. 13, at least twenty times smaller than the two best ones in estimating the rotation.

3.3 Real experiments—dynamic

To further investigate the performance of the proposed technique, a second experiment with real data has been carried out. The four-square feature shown in Fig. 10 and one IMU have been placed on the mobile robot Khepera III [39], and the robot has been programmed to perform, in 40 sec, a rectangular trajectory of \(1.5 \times 1\) m with trapezoidal speed.

The camera, equipped with the second IMU, has been placed on the top of the motion plane at a distance of 3 m; sixteen equally time-spaced pictures have been taken during the journey; and the rotation and translation errors have been computed and averaged on those pictures.

Fig. 11
figure 11

Mean rotation error (degrees) versus number of feature points: real experiment

Fig. 12
figure 12

Mean translation error (percent) versus number of feature points: real experiment

Fig. 13
figure 13

Mean CPU time (milliseconds) versus number of feature points: real experiment

Fig. 14
figure 14

Mean rotation error (degrees) versus number of feature points: dynamic experiment

Fig. 15
figure 15

Mean translation error (percent) versus number of feature points: dynamic experiment

Fig. 16
figure 16

Mean CPU time (milliseconds) versus number of feature points: dynamic experiment

The overall results with \(n = \{4, 8, 12, 16\}\) feature points are reported in Figs. 14 to 16. Also in this case there is not a clear winner between the various algorithms in terms of estimation performance. (The LHM wins on the rotation estimation and the DLS wins on the translation estimation.) However, the proposed method performs computationally much better than the competitors also in this experiment, as it is 100 times faster of the DLS and 20 times faster of the LHM. In some sense, it guarantees a good ‘balanced performance’ between translation and rotation estimation, and it obtains the estimates at a very high speed.

4 Discussion

The simulative and experimental results have shown that fusing the measurements given by a camera and two IMUs allows to obtain a fast and accurate estimate of the relative position and orientation between two objects.

In both the simulations and the experimental tests, the proposed \(\alpha \)-P nP algorithm outperforms most of the methods listed in [28] in terms of estimation error. Only the LHM, and occasionally the recent SRPnP, have given a better estimation of the rotation, at the price of a much higher computational effort (about 30 times or more).

The CPU time required by the \(\alpha \)-P nP is from 20 to 100 times less than that of all the others methods, making that method suitable for real-time applications. Also, the accuracy of the estimates it gives is practically independent of the number of feature points used.

We also emphasize that, to the authors’ knowledge, the results presented here are the first ones where the experimental results are evaluated in terms of rotation and translation errors. Usually, when dealing with experimental data, only the reprojection index is used to test the performance. However, such index is not a robust evaluation method, owing to the strong nonlinearities and ambiguities embedded in the pinhole reprojection model.

On considering the low cost of IMUs and the standard linear algebra techniques used, we believe that the proposed method is a useful tool to face the pose estimation problem.