An Epipolar Geometry-Based Approach for Vision-Based Indoor Localization

  • Yinan Liu
  • Lin Ma
  • Xuedong Wang
  • Weixiao Meng
Conference paper
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 463)


Indoor positioning is getting more and more attention and research. We propose an epipolar geometry-based method for vision-based indoor localization using images. It needs an image collected in the positon that is aiming to localize. It uses SURF to pick up the feature points and filtrate them to remain good ones and get rid of bad ones. The good feature points are used to match the feature points in the database. (The feature points are selected by the images whose positions are already known). We use the matched feature points to calculate the essential matrix that include the translation information and rotary information. Then we can complete the localization by the relationship between the query image and the images in the database. What’s more we use the feature points to replace the images to build the database aiming to reduce the space and speed up the localization.


Indoor localization Epipolar geometry SURF Essential matrix 

1 Introduction

Localization system (an aggregation of interconnected or device to determine spatial coordinates) is able to ensure the simultaneous observation for four satellites at anytime and anywhere and then collects the longitude and latitude of the observation point to achieve the purpose of navigation. The technology allows the cars, ships, airplanes and human beings to arrive at the destination safely and accurately according to the measurement route.

Nowadays, people cannot live without positioning service. The majority of positioning system are outdoor localization, however, indoor positioning service has been the hot project to research due to the large demand.

According to the scientific investigations, 80% external spatial information is related to visual sense and human beings have their own method to digest and absorb the huge amount visual information. In fact, the cerebral cortex has processed and analyzed the information as while as the information is collected by the eyes. The visual information is translated into the neural impulse signal by the photoreceptor cell and then is passed into the cortex through signaling nerve fibers, at last, the useful information is selected to human beings. With the development of the picture processing technology, the computers are vested the function of human eyes, using the computers to process the visual information collected by human eyes, therefore the front edge technology, computer vision emerges.

One heated application in computer vision science is image-based localization. It can classify all the proposed methods in two groups. In one category, researchers take advantage of the landmarks (such as logos) present in the environment to estimate the camera matrix and extract the query location [3, 4]. These methods can only be applied in environments where highly detectable logos are present and their 3D coordinates measurement is possible. Another category includes the works that use a stored image database annotated with position information of the cameras, i.e. image fingerprinting-based methods such as [9]. Upon receiving the query image, feature extraction and matching (using features such as SURF [10], SIFT, corners, etc.) between the query image and the all database images are performed.

Calculating the essential matrix is of vital importance in the system. First we can calculate the fundamental matrix (unnormalised essential matrix) by matched feature points. The linear algorithm is the easiest and the simplest one to estimate the fundamental matrix among the various algorithms. One of the most famous linear algorithms is the ‘eight-point algorithm’ proposed by Loguet-Higgens in 1981. The ‘eight-point algorithm’ took less time but the algorithm was very sensitive to the noise as a result it was difficult to be applied in practice. Hence, one kind of non-linear algorithm was given by Faugeras in 1992, which was more stable and precise than ‘eight-point algorithm’. After Faugeras, Hartley improved ‘eight-point algorithm’ by using ‘adjust eight-point algorithm’, which means applying ‘eight-point algorithm’ to calculate the fundamental matrix after standardizing the pre-matching points through the method of normalization. Therefore, the ‘adjust eight-point algorithm’ was widely applied because the robustness and the sensitivity to the noise were reduced. Additionally, ‘seven-point algorithm’ was discovered to estimate the fundamental matrix by seven pairs of the matching points. The algorithm was also sensitive to the noise though it declined the calculated amount. So we prefer the ‘adjust eight-point algorithm’ after the comparison.

2 Selecting and Matching Interest Points

There are several algorithms to select and to match the feature points in the query images and the images in the database, such as SIFT, SURF and so on.

SIFT (Scale-invariant Feature Transform) is an algorithm used to detect the part-feature in an image. It get the feature of the images by selecting interest points and related scale descriptor and orientation descriptor, and works well. This algorithm is not only characterized by scale invariance, but even if it changes the angle of rotation, the image brightness or the angle of camera, it can still get a good test result.

SURF (Speeded-up Robust Features) was built on the basis of sift, which not only improves computing speed but also makes it more secure and robust. The extraction of feature points is related to the properties of the image obtained, and also relates to the feature point matching method. Commonly used feature extraction are angular point features (such as Harris operator), line features (image edge detection), local area (spots), invariant features (such as scale invariant features). Given the interior factors such as complex background environment, illumination, this paper put forward by the Bay rapid scale invariant feature extraction algorithm of SURF, robustness to illumination changes and image changes must be better to extract the feature points, and scale invariance better relative to Harris, SIFT relatively low time complexity [8, 9, 10].

2.1 SURF Interest Points’ Selecting and Matching

SURF uses the Hessian Matrix \( H\left( {x,\sigma } \right) \) to search interest points, where \( x \) is the coordinate, and \( \sigma \) means the scale; \( L_{xx} (x,\sigma ) \) is the convolution of query image with Gaussian second derivative \( \frac{{\partial^{2} }}{{\partial x^{2} }}g\left( \sigma \right) \) (\( g\left( \sigma \right) \) is Gaussian Function. \( L_{xx} \left( {x,\sigma } \right) \), \( L_{xy} \left( {x,\sigma } \right) \) and \( L_{yy} \left( {x,\sigma } \right) \) has a similar meaning). It mainly take advantage of the integral principle, only needing three times of plus method and four times of access to the memory, greatly reduce the computational complexity and improve the running speed, as the Eq. 1 shows:
$$ H\left( {x,\sigma } \right) = \left[ {\begin{array}{*{20}c} {L_{xx} \left( {x,\sigma } \right)} & {L_{xy} \left( {x,\sigma } \right)} \\ {L_{xy} \left( {x,\sigma } \right)} & {L_{yy} \left( {x,\sigma } \right)} \\ \end{array} } \right] $$
To reduce the computation time, simplify the Gaussian model to rectangular area box filter, we use \( D_{xx} \), \( D_{xy} \) and \( D_{yy} \) to represent the convolution of box filtrate with image. Then the Hessian Matrix can be approximated by the Eq. 2:
$$ \Delta \left( {H_{apprax} } \right) = D_{xx} D_{xy} - \left( {0.9D_{xy} } \right)^{2} $$
The scale invariance of the SURF algorithm relies on looking for features at different scales. The scale space is divided by octaves, and each octave represents the convolution of the incremental filter template with the image. The first scale space is \( 9 \times 9 \), \( 15 \times 15 \), \( 27 \times 27 \); the other octaves are similar, but their increment is doubling, is 6, 12, 24. Figure 3 is \( 9 \times 9 \) filter template and the scale increases from the bottom to the top of the inverted pyramid (Fig. 1).
Fig. 1.

Filters in scaling

In order to locate the interest points, we use the non-maximal value suppression method in three dimensional space to find the extreme point. It means that a Hessian Matrix feature point is the extreme point among the 27 points (26 neighbors and itself). Then we interpolate in the scale space and image space, and we get the final feature point location and the scale value. Figure 2 is the result of image feature point extraction using SURF.
Fig. 2.

SURF result

Fig. 3.

The basic principle of epipolar geometry

2.2 SURF Feature Descriptors

In order to make the characteristic have better rotational invariance, it is necessary to give each feature a main direction, the concrete method is: (1) The Haar wavelet response of each point need calculating in a circular region of six times the radius of this interest point. (2) We add all the Haar wavelets’ dx and dy in the region of \( \frac{\pi }{3} \), up to a new vector \( \left( {m_{w} ,\theta_{w} } \right) \), where

\( m_{w} = \sum\limits_{w} {{\text{d}}x} + \sum\limits_{w} {\text{dy}} \), \( \theta_{w} = \arctan \left[ {\frac{{\sum\limits_{w} {{\text{d}}x} }}{{\sum\limits_{w} {{\text{d}}y} }}} \right] \), the longest vector represent the main direction. We set up the coordinate system based on the interest point and the main direction, and take the four adjacent squares, computing the vector V. \( V = \left[ {\begin{array}{*{20}c} {\sum {{\text{d}}x} } & {\sum {\left| {{\text{d}}x} \right|} } & {\sum {{\text{d}}y} } & {\sum {\left| {{\text{d}}y} \right|} } \\ \end{array} } \right] \). All the 16 vectors add up to a 64 d vector, called a feature descriptor, which is seldom influenced by rotational invariance.

2.3 Interest Points Matching

The accuracy of the feature points directly affects the matching results. On the other hand, in image matching, the matching process can be mismatched due to many factors such as distortion and inconsistency. In order to weaken the mismatches, we need to select the corresponding measure method. We choose similarity measure method. \( n_{1} \) and \( n_{2} \) are the numbers of interest points in Q and T. We use \( Q_{j} = ({\text{j}} = 1,2, \ldots ,n_{1} ) \), \( T_{j} = ({\text{j}} = 1,2, \ldots ,n_{2} ) \), the distance similarity [4] is Eq. 3:
$$ D(Q_{i} ,T_{j} ) = \sqrt {\sum\limits_{k = 1}^{m} {(Q_{i}^{k} - T_{j}^{k} )^{2} } } $$

In the equation, m is on behalf of the dimension. The smaller D, the higher the similarity. Because the need of interest points to calculate the fundamental matrix are only 8 pairs. We can choose the ratio of the nearest distance and next distance to filtrate the best 8 pairs of interest points (the ratio smaller, the matches better).

3 The Epipolar Geometry in Localization

After searching, matching and screening the interest points, we have enough quality material to computing the epipolar geometry between the query image and images in the database. Because the location of images in the database is already known, we can get the coordinate of query image finally.

3.1 A Brief Introduction of Epipolar Geometry

Epipolar geometry displays the geometrical relationship between the two images in the same scene. It is independent from the structure of the scene and relies on the camera parameters. Therefore, epipolar geometry is the inherent projective properties between the two images. Epipolar geometry is able to be widely applied to the domains including images matching and three-dimensional reconstruction. During the images matching process, the substantial purpose of the algorithm is to recover epipolar geometry.

The stereo vision of epipolar geometry has the same start point and objective of that entitles the function of human eyes to the computers and intelligent robots. Hence, the operating principle of epipolar geometry stereo vision positioning is similar to that of human eyes which means that people take a picture of a specific scenery and match the character points of existing pictures taken from different angles in the photo gallery. At last, 3D geometry information is supposed to be restored by calculating the position deviation between image pixels through triangulation principle. Epipolar geometric measurement is based on the parallax to obtain 3D information by the triangulation principle.

Assume a specific scene that a point X has its projections in the two cameras respectively in the three-dimensional space. The left view projection image is called the left view image and the right view projection image is called right view image. Let C and C′ be the optical centers for the two cameras respectively, x and x′ be the image points of the point X. The line connects optical center C and C′ called the baseline and the line of CC′ crosses the two pictures at points e and e′ known as epipole. C and C′ are coplanar and the plane called epipolar plane. The intersection of epipolar plane and image plane l and l′ called the epipolar. X (x′) are also on the epipolar plane \( \pi \) and the image plane simultaneously, therefore, l (l′) must pass through x (x′). Therefore, x (x′) can be found on the epipolar l l′ and it is not necessary to search the x (x′) on the whole image plane. This provides an important epipolar constraint reducing the searching space of the corresponding points from 2 dimensions to 1 dimension. When the points in three-dimensional space move from one place to the other place, all the generated epipolars pass through epipoles e(e′) which are the intersection points of baseline and image plane.

3.2 The 8-Point Algorithm

There is a point X, and the projection of X onto two planes is x and x′. There is a fundamental satisfy equation (x′)TFx = 0, namely x′ transposed to multiply F, multiplied by the result of X is 0, then F is the fundamental matrix on the left image to the right image, as can be seen from the formula on the basic matrix is a direction, the right to the left of the fundamental matrix is FT. F has the following special properties:
  1. (1)

    The rank of F is 2.

  2. (2)

    As a \( 3 \times 3 \) matrix F has 7 degrees of freedom.


Normally a \( 3 \times 3 \) matrix has 9 degrees of freedom. Because of a constant factor and 0-value determinant, F has 2 less degrees of freedom. In detail, if F is a fundamental matrix, kF is also a fundamental matrix. So the fundamental matrix is not unique, and naturally it reduces a degree of freedom.

According the analysis in the introduction (Sect. 1), we choose ‘adjust eight-point algorithm’. The advantage of the eight-point algorithm is that it is linear, easy to implement, and computes faster.

We use xi, xi′(i = 1, 2…, 8) for 8 pairs of interest points and \( f_{ij} (1\, \le \,i\, \le \,3,1\, \le \,j\, \le \,3) \) for F. (x′)TFx = 0. mi can be normalized as \( \left( {u_{i} ,v_{i} ,1} \right) \). The we get Eq. 4.
$$ u^{{\prime }} uf_{11} + u^{{\prime }} vf_{12} + u^{{\prime }} f_{13} + v^{{\prime }} uf_{21} + v^{{\prime }} vf_{22} + v^{{\prime }} f_{23} + uf_{31} + vf_{32} + f_{33} = 0 $$
Then we build matrix A, f, Af = 0.
$$ \left\{ {\begin{array}{*{20}l} {{\text{A = }}\left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {u_{1} u_{1}^{{\prime }} } & {u_{1} v_{1}^{{\prime }} } & {u_{1} } & {v_{1} u_{1}^{{\prime }} } & {v_{1} v_{1}^{{\prime }} } & {v_{1} } & {u_{1}^{{\prime }} } & {v_{1}^{{\prime }} } & 1 \\ \end{array} } \\ \vdots \\ {\begin{array}{*{20}c} {u_{8} u_{8}^{{\prime }} } & {u_{8} v_{8}^{{\prime }} } & {u_{8} } & {v_{8} u_{8}^{{\prime }} } & {v_{8} v_{8}^{{\prime }} } & {v_{8} } & {u_{8}^{{\prime }} } & {v_{8}^{{\prime }} } & 1 \\ \end{array} } \\ \end{array} } \right]} \hfill \\ {f = \left( {\begin{array}{*{20}c} {f_{11} } & {f_{12} } & {f_{13} } & {f_{21} } & {f_{22} } & {f_{23} } & {f_{31} } & {f_{32} } & {f_{33} } \\ \end{array} } \right)} \hfill \\ \end{array} } \right. $$

There is only a difference of constant factor among all the f, so we add a constraint: \( \left\| f \right\|\,\text{ = }\,1 \). So f is the eigenvector of the minimum eigenvalue of ATA. \( A = UDV^{T} \) (SVD decomposition). \( V = \left[ {\begin{array}{*{20}c} {v_{1} } & {v_{2} } & {v_{3} } & {v_{4} } & {v_{5} } & {v_{6} } & {v_{7} } & {v_{8} } & {v_{9} } \\ \end{array} } \right] \), \( f = v_{9} \). Finally the F is obtained.

3.3 Subsequent Steps

Essential matrix is normalized fundamental matrix, as \( F = K_{\text{d}}^{{ - {\text{T}}}} EK_{\text{u}}^{{{\prime } - 1}} \) shows, where Ku and Kd are the calibration matrices of the query and database cameras, respectively. Essential matrix includes the relative pose between the query and database cameras as
$$ {\text{E = }}\left[ {{\text{t}}_{\text{r}} } \right]_{ \times } {\text{Rr}} $$
where tr and Rr are the relative translation vector and the rotation matrix, respectively, and \( \left[ \, \right]_{ \times } \) represents the cross product matrix. We can find the tr and Rr with the method proposed in [1], which satisfies the chirality constraint.
In fact, tr shows the translation vector between the database camera center and the query camera center in the query camera coordinates system. In other words, if we show the inhomogeneous 3D points in the database and query camera coordinate systems by X and X′ respectively, we have
$$ {\text{X}}^{{\prime }} = {\text{R}}_{\text{r}} \left( {{\text{X}} + {\text{R}}_{\text{r}}^{ - 1} {\text{t}}_{\text{r}} } \right) $$
The \( {\text{R}}_{\text{r}}^{ - 1} {\text{t}}_{\text{r}} \) is the translation vector in the database camera coordinate system. We call it t. If we show the inhomogeneous 3D coordinates of the global coordinate system by Xg, we have:
$$ {\text{X}} = {\text{RX}}_{\text{g}} + {\text{t}} $$
where R is the database camera absolute rotation matrix in the global reference coordinate system, and t is translation vector in the database camera coordinate system. Hence, In order to find the translation vector with respect to the global coordinate system, we write
$$ {\text{X}}_{\text{g}} = {\text{R}}^{ - 1} {\text{X}} - {\text{R}}^{ - 1} {\text{t}} $$
So \( {\text{t}}_{\text{total}} = - {\text{R}}^{ - 1} {\text{R}}_{\text{r}}^{ - 1} {\text{t}}_{\text{r}} \) represents the estimate direction of the line connecting database location to the query location [14]. We need several lines to find the point whose sum distance to all the lines is minimum. Figure 4 is a case for such process. And the circle is the result of localization.
Fig. 4.

The result of localization

4 Experimental Results

In our experiments, database images were taken inside a room at the locations depicted in Fig. 4 using a cellphone camera. There are 4 lines in total, meaning that we calculate the epipolar geometry between the query images and four images in the database. It’s obvious that we need at least two lines to complete the localization. But if one line is wrong, the result will be badly affected. We need more lines to make the result more stable, but we are going to have a larger calculation.

As Fig. 5 shows, the larger the number of lines, the smaller the average error is. But the number of lines does less and less benefit to reduce the average error. However the calculation have a linear growth. We can choose 4 or 5 lines to make a tradeoff between the calculation and the average error.
Fig. 5.

The influence of the number of lines to average error

What’s more, we can’t ignore the effect of screening the interest points in SURF algorithm. In Table 1, we make a comparison between random matches and screened matches. Average error reduced apparently by using screened matches in all the scenarios.
Table 1.

Average error in different conditions (cm)


Random matches

Screened matches













5 Conclusion

In the paper we proposed an Epipolar geometry-based method for fine location estimation in vision-based localization applications applicable to pose-annotated databases. We use SURF algorithm to select and match the interest points as the material to calculate the fundamental matrix. Then we use 8-point algorithm and some other methods based on epipolar geometry. After obtaining the positioning result, we analyzed the factor that affect the average error and make a tradeoff in choosing the number of lines. We also get the conclusion that screening the matches in SURF algorithm is of vital importance. Finally we control the average error in an acceptable range, and get a good result.


  1. 1.
    Nistér, D.: An efficient solution to the five-point relative pose problem. IEEE Trans. Pattern Anal. Mach. Intell. 26(6), 756–770 (2004)Google Scholar
  2. 2.
    Yang, J., Chen, L., Liang, W.: Monocular vision based robot self-localization. In: 2010 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 1189–1193. IEEE (2010)Google Scholar
  3. 3.
    Muramatsu, S., Chugo, D., Jia, S., Takase, K.: Localization for indoor service robot by using local-features of image. In: ICCAS-SICE, pp. 3251–3254. IEEE (2009)Google Scholar
  4. 4.
    Bay, H., Ess, A., Tuytelaars, T., et al.: Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008)Google Scholar
  5. 5.
    Nicole, R.: Title of paper with only first word capitalized. J. Name Stand. Abbrev. (in press)Google Scholar
  6. 6.
    Yorozu, Y., Hirano, M., Oka, K., Tagawa, Y.: Electron spectroscopy studies on magneto-optical media and plastic substrate interface. IEEE Transl. J. Magn. Japan 2, 740–741 (1987). Digests 9th Annual Conference of Magnetics Japan, p. 301, 1982Google Scholar
  7. 7.
    Young, M.: The Technical Writer’s Handbook. University Science, Mill Valley (1989)Google Scholar
  8. 8.
    Horaud, R., Conio, B., Leboulleux, O., Lacolle, B.: An analytic solution for the perspective 4-point problem. Comput. Vision Graph. Image Proces. 47(1), 33–44 (1989)Google Scholar
  9. 9.
    Wang, J., Zha, H., Cipolla, R.: Coarse-to-fine vision-based localization by indexing scale-invariant features. IEEE Trans. Syst. Man Cybern. Part B Cybern. 36(2), 413–422 (2006)Google Scholar
  10. 10.
    Liqin, H., Caigan, C., Henghua, S., et al.: Adaptive registration algorithm of color images based on SURF. Measurement 66, 118–124 (2015)Google Scholar
  11. 11.
    Harris, J.M., Nefs, H.T., Grafton, C.E.: Binocular vision and motion-in-depth. Spat. Vis. 21(6), 896–899 (2014)Google Scholar
  12. 12.
    Tourap, A.M.: Enhanced predictive zonal search for single and multiple frame motion estimation. In: Visual Communications and Image Processing (2012)Google Scholar
  13. 13.
    Olson, C.F., Abi-Rached, H., Ye, M., Hendrich, J.P.: Wide-baseline stereo vision for mars rovers. In: Proceedings Of the 2003 IEEE/RSJ International Conference On Intelligent Robots And Systems, vol. 2, pp. 1302–1307, October 2003Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Communications Research CenterHarbin Institute of TechnologyHarbinChina

Personalised recommendations