1 Introduction

Multi-view 3D(three-dimensional) reconstruction first takes a series of pictures from different perspectives around an object or scene, estimates camera pose through the correct feature matching result between image pairs, and then reconstructs the 3D model after the 3D point cloud reconstruction and texture mapping [1]. As one of the keys to the whole process, image feature matching has a significant impact on the integrity and accuracy of the reconstructed model. It extracts feature points from image pairs, calculates the similarity between feature points, and establishes reliable correspondences between images. This technique could be widely used, such as in image registration [2], 3D reconstruction [3], target recognition [4], and visual simultaneous localization and mapping (VSLAM) [5].

To obtain complete and accurate scene estimation results, the commonly used method is using triangulation to estimate camera pose and scene structure, which requires the correct feature matching results as the basis. The reliable correspondence between two images can be obtained by removing outlier matches from the putative match sets created by some algorithms. During the past few decades, many approaches have been proposed to solve the outlier elimination problem.

However, when the images taken from different scenes or different surfaces of an object are similar, but their global structures of the detected and matched feature points are different, for example, the basic pattern elements in the images are the same, but the elements are typed differently or symmetrically, they should be considered as not belonging to the same scene or the same object, but the mismatches with locally similar features will be easily mistaken as correct, so the accuracy of the matching result is significantly reduced. When the feature points are matched correctly, we can recover the camera motion between two images from the correspondence of these 2D image points. This is done by solving the essential or basis matrix using the pairwise polar geometry method based on the pixel positions of the paired points. Then, we can solve the relative rotation matrix R and the relative translation matrix T between the cameras. If the paired points are matched incorrectly, it will lead to incorrect calculations of the rotation and translation parameters of the inter-camera motion, and the pose estimation of the cameras will be wrong. The incorrect estimation of the extrinsic parameters will lead to the next 3D coordinates of the spatial points calculated using triangulation error, and the task of recovering scene structure will be failed.

To address the above challenges, this paper proposes an efficient similar-pattern-oriented feature matching filtering algorithm (SMF). The SMF algorithm runs after an initial feature matching algorithm, i.e., GMS (Grid-based Motion Statistics) [6], LPM (Locality Preserving Matching) [7], etc., and further judges differences or symmetries of pattern structures between image pairs. An observable fact is that if we take images around the same scene, not only the feature points neighborhood has local topological consistency, but the global feature point structures should also have global topological consistency. Based on the observation, we design a model that constraints the unknown inlier correspondences to have similar global topology. The experiment on mismatch removal shows that our SMF can effectively get more accurate inlier matches by identifying the mismatches caused by only considering the local matching consistency and can be used to obtain more accurate camera poses. The experiment also shows that the reconstructed 3D models are more completed and more accurate by using our SMF to remove mismatches. The main contributions of this paper are as follows:

  • It adopts grid matching based on feature distribution, which reduces the time complexity of subsequent feature point vector matching and can be used in real-time tasks.

  • It proposes a matching confidence calculation method based on global topology consistency, which can well determine whether there are mismatches between the image pairs caused by the different layouts of the pattern elements, and can increase the accuracy of feature matching results.

  • The experiments show that the algorithm SMF proposed in this paper can effectively remove wrong matches between similar but different images and is helpful to obtain more accurate and more completed camera pose and 3D reconstruction results.

Section 2 describes the related work. The proposed algorithm SFM is presented in Sect. 3, including the consensus on global topology, the grid strategy, and the confidence calculator. The performance of our method in comparison with other approaches is illustrated in Sect. 4, and the concluding remark is in Sect. 5.

2 Related works

The existing image matching methods can be classified into three categories: area-based, feature-based, and learning-based [8]. Area-based methods [9,10,11] usually do not detect features and typically refer to dense matching. Learning-based methods [12,13,14] can achieve better performance in some cases, but they still need subsequent mismatch removal steps because of the high percentages of the outliers in the putative sets. Feature-based methods first extract features and local descriptors of images and use direct or indirect ways to find out the correspondences.

Direct feature matching methods use spatial geometrical relations to get the correspondences [15, 16]. Indirect feature matching methods establish putative matches and remove false matches to establish reliable correspondences. The putative match set obtained by judging the similarity of descriptors still includes a large percentage of mismatches. The removal methods use extra local and/or geometrical constraints. They can be roughly divided into three categories: resampling-based methods, non-parametric model-based methods, and relaxed methods.

The resampling methods start with the classic RANSAC proposed by Fischler et al. [17]. It sampled a minimal subset iteratively, estimated the fundamental matrix model, and obtained the consistent inliers by verifying the quality of the set's number. Many approaches have been developed to improve RANSAC [18,19,20,21]. The shortcomings of these resampling methods are the runtime will exponentially increase when the outlier percentage in the putative set is high, and the estimated parametric model is less-efficient undergoing non-rigid transformations which are more complex.

The non-parametric model-based methods are been developed to divorced from the resampling methods, including ICF [22], BD [23], VFC [24], MR-RPM [25, 26]. They distinguish the mismatches by applying the prior conditions, i.e., motion coherence, in which the correspondence features have slow-and-smooth motion. These methods can be applied to non-rigid transformations, but are not applicable to real-time tasks because of cubic complexity.

The relaxed methods use coherence constraints and local neighborhood consistency. BIAN et al. proposed GMS [6], which is based on the assumption that the number of correct matching points near the correctly matched features should be greater than the number near the false matched ones. It transforms motion smoothness constraints into statistics measures to reject false matches and proposes a score estimator base on grid and performances well in tasks requiring real-time feature matching. LPM [7] and ANTC [27] observed that due to the physical constraints, the local topological structures are usually kept even though the absolute distance between the matches changes a lot under complex deformations. GLOF [28] detected outliers in a small neighborhood by using the local density reachability. However, when the small pattern elements between the image pairs that are taken from different scenes are the same, but they are arranged in a different way, or even nearly symmetrical, which means the feature structure has changed, the relaxed methods only do match detection for local extreme points, and cannot realize the differences of the global topological structures of patterns, so the mismatches between image elements of different typeset will be considered as true while they should be recognized as false.

3 Method

The proposed algorithm SMF can be used to establish the accurate correspondences between similar but different images. We first construct putative match sets that have local consistency by using a certain matching algorithm and then use the global constraint to remove the false matches.

3.1 Problem formulation

Image pairs \(\{ I_{a} ,I_{b} \}\) have \({\text{\{ }}A,B{\text{\} }}\) features, respectively. Suppose we have obtained a putative match set \(S = {\text{\{ }}(x_{1} ,y_{1} ),(x_{2} ,y_{2} ),...,(x_{i} ,y_{i} ),...,(x_{T} ,y_{T} )\}\) between \(I_{a}\) and \(I_{b}\), where \(x_{i}\) and \(y_{i}\) are the spatial positions of corresponding feature points extracted by using the well-known feature detector (SIFT [29], SURF [30], or ORB [31]), and the matches in \(S\) satisfy the local constraints. \(S\) has cardinality \(|S| = T\). The goal of this work is to remove the mismatches in \(S\) caused by similar images with different pattern layouts to produce match results which are more accurate.

3.1.1 Formulation for locality preserving matching

LPM [7] has proposed that the accurate match set can be obtained by filtering out false matches that do not meet the requirements of the spatial neighborhood structure, and getting the final solution by minimizing the following cost function:

$$ C(p;S,\lambda ,\tau ) = + \lambda \sum\limits_{{i = 1}}^{N} {\frac{{p_{i} }}{K}\left( {\sum\limits_{{j|x_{j} \in {\text{N}}_{{x_{i} }} }} {{\text{d}}\left( {y_{i} ,y_{j} } \right) + \sum\limits_{{j|x_{j} \in {\text{N}}_{{x_{i} }} ,y_{j} \in {\text{N}}_{{y_{i} }} }} {{\text{d}}\left( {v_{i} ,v_{j} } \right)} } } \right)\left( {N - \sum\limits_{{i = 1}}^{N} {p_{i} } } \right)} $$
(1)

where \(p_{i} \in \{ 0,1\}\) represents the correctness of the matches \((x_{i} ,y_{i} )\).\(p_{i} = 1\) indicates \(S_{i} = (x_{i} ,y_{i} )\) is an inlier match, and it is outlier otherwise. \(N_{{x_{i} }}\) is the neighborhood set of point \(x\) under Euclidean distance and \(K\) is the number of \(N_{{x_{i} }}\). \(\lambda > 0\) balances the first and the second items. \({\text{d}}(y_{i} ,y_{j} )\) denotes the Euclidean distance metric between \(y_{i}\) and \(y{}_{j}\). \({\text{d}}(v_{i} ,v_{j} )\) represents the consistency of local topology according to \(s(v_{i} ,v_{j} )\), which is calculated based on the differences between \(v_{i}\) and \({\text{v}}{}_{j}\) by the following formulate:

$$ s(v_{i} ,v_{j} ) = \frac{{\min \{ |v_{i} |,|v_{j} |\} }}{{\max \{ |v_{i} |,|v_{j} |\} }} \cdot \frac{{(v_{i} ,v_{j} )}}{{|v_{i} | \cdot |v_{j} |}} $$
(2)

where \(v_{i}\) is the vector from \(x_{i}\) to \(y_{i}\), \(v_{j}\) is the vector from \(x_{j}\) to \(y_{j}\), \((.,.)\) denotes the inner product. If \(s(v_{i} ,v_{j} )\) is bigger than a predefined threshold \(\tau\), the neighborhood topology is consistent, and the distance \({\text{d}}(v_{i} ,v_{j} )\) can be written as follow:

$$ {\text{d}}(v_{i} ,v_{j} ) = \left\{ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {0,}\, & {s(v_{i} ,v_{j} ) \ge \tau } \\ \end{array} } \\ {\begin{array}{*{20}c} {1,} \,& {s(v_{i} ,v_{j} ) < \tau } \\ \end{array} } \\ \end{array} } \right. $$
(3)

3.1.2 Formulation for global topology consistency

Equation (1) considers both the distance consistency and the topology consistency of the local neighborhood. It works well when similar image pairs have the same pattern structures. However, if the layout of pattern elements between images is different, the same elements with different relative positions in \(I_{a}\) and \(I_{b}\) will still be matched since they have local matching consistency, resulting in erroneous matching. We show an example in Fig. 1. Regions \(x_{i}\) and \(x_{j}\) in Fig. 1a and b are both matched to regions \(y_{i}\) and \(y_{j}\) relatively because they satisfy local distance and topology consistency. But intuitively, they should be considered as correct in Fig. 1a because they represent the same scene, while wrong in Fig. 1b because they are on different surfaces of the object. From Fig. 1c, we observe that the cosine value of the angle between vectors constructed by one feature area to another feature area in \(I_{a}\) and \(I_{b}\) is close to \(1\) because of the similar global topology structure. The cosine value in Fig. 1d is less than \(0\) because of the different global topological structures. More specifically, the global topological structures have significant influence on the similarity of \(v_{i,j}^{x}\) and \(v_{i,j}^{y}\).

Fig. 1
figure 1

Global topological structure. The corresponding regions of \(x_{i}\) and \(x_{j}\) in the other image are \(y_{i}\) and \(y_{j}\). a The pattern elements of the image pair have the same relative positions. b The pattern elements of the image pair have different relative positions. c Global topological structure corresponding to (a). d Global topological structure corresponding to (b). Region \(x_{i}\) and region \(x_{j}\) are local neighborhoods that conform to local distance consistency and local topology consistency, which can match correctly from a local perspective. If the pattern elements of the image pair have the same relative positions, the vector \(v_{i,j}^{x}\) consisting of \(x_{i}\) and \(x_{j}\) is close to the vector \(v_{i,j}^{y}\) consisting of \(y_{i}\) and \(y_{j}\). However, if the pattern elements of the image pair have different relative positions, \(v_{i,j}^{x}\) differs greatly from \(v_{i,j}^{y}\)

Motivated by this idea, we define the consensus of global topology between \(v_{i,j}^{x}\) and \(v_{i,j}^{{\text{y}}}\) as follows:

$$ {\text{d}}(v_{i,j}^{x} ,v_{i,j}^{y} ) = \frac{{(v_{i,j}^{x} ,v_{i,j}^{y} )}}{{|v_{i,j}^{x} | \cdot |v_{i,j}^{y} |}} $$
(4)

Obviously, the consensus of global topology \({\text{d}}(v_{i,j}^{x} ,v_{i,j}^{y} ) \in [ - 1,1]\). The larger the value, the stronger the consistency. On the basis of Eq. 1 that constructs a local consistent match set \(I\), we further propose a cost function for getting a global topological consistent match set \(U^{*}\):

$$ U^{*} = \arg \mathop {\min }\limits_{U} C(U;I), $$
(5)

With the cost function \(C\) defined as:

$$ C(U;I) = \sum\limits_{i \in U}^{{}} {\sum\limits_{{j \in U,j \notin N_{{{\text{x}}_{i} }} }}^{{}} {(1 - {\text{d}}(v_{i,j}^{x} ,v_{i,j}^{y} )} } ) $$
(6)

3.2 Grid the problem

Since the neighborhood of each point has guaranteed local consistency by using Eq. 1, if point \(x_{i}\) does not satisfy global topological consistency with point \(x_{j}\) in another image because of the different position, other points in the neighborhood of \(x_{i}\) and other points in the neighborhood of \(x_{j}\) will also not satisfy the global constraints. Therefore, we can address it with a grid approximation. This section transits the previous analysis into an efficient grid matching algorithm.

3.2.1 Construct grid matching characteristics

We construct a grid matcher to convert the distribution characteristics of feature points into grid matching characteristics.

The image pairs \(I_{a}\) and \(I_{b}\) are, respectively, divided into \(N = 400\) non-overlapping grids, and we call the divided result as \(G_{x}\) and \(G_{y}\), respectively. The small grids in \(G_{x}\) are marked as \(G_{{x_{1} }} ,G_{{x_{2} }} ,...,G_{{x_{i} }} ,...,G_{{x_{N} }}\), and the small grids in \(G_{b}\) are marked as \(G_{{y_{1} }} ,G_{{y_{2} }} ,...,G_{{y_{j} }} ,...,G_{{y_{{_{N} }} }}\). If the coordinates of the points fall in the same grid area, their grid number is the same. Define the matching score of the grid pair \( G_{{x_{i} }} ,G_{{y_{j} }} )\) as follows:

$$ S(G_{{x_{i} }} ,G_{{y_{j} }} ) = |S_{k} |,\{ S_{k} |S_{k} (G_{{x_{i} }} ,G_{{y_{j} }} ) \in S\} $$
(7)

\(S_{k} (G_{{x_{i} }} ,G_{{y_{j} }} )\) indicates that the feature point pairs fall in the grid numbered \(G_{{x_{i} }}\) and \(G_{{y_{j} }}\), respectively, \(\{ S_{k} \}\) represents the set of matching pairs that satisfy this characteristic, \(S(G_{{x_{i} }} ,G_{{y_{j} }} )\) represents the total number of feature matching pairs fall in \(G_{{x_{i} }}\) and \(G_{{y_{j} }}\).

The matching score of each grid pair is calculated by constructing a cumulative matrix of size \(N{*}N\), and each matrix element is initialized to zero. If a pair of feature points falls in the grid numbered \(G_{{x_{i} }}\) in \(G_{x}\) and the grid numbered \(G_{{y_{j} }}\) in \(G_{y}\), respectively, the value of the element in the \(G_{{x_{i} }}\) row and \(G_{{y_{j} }}\) column of the matrix is added by one. As shown in Fig. 2a, \((x_{i} ,y_{i} )\) is the i-th pair of feature matches, and if the coordinate of \(x_{i}\) belongs to the grid numbered 3 in \(G_{x}\) and the coordinate of \(y_{i}\) belongs to the grid numbered 4 in \(G_{y}\), the (3,4) element of the cumulative matrix is added by one. Traversal of all matching pairs yields the matching statistics for each grid in \(G_{x}\) with each grid in \(G_{y}\) as shown in Fig. 2b.

Fig. 2
figure 2

Grid matching example. a Single grid pair matching. b grid matching matrix. The coordinates of the matched feature point are in the grid numbered 3 in \(G_{x}\) and the grid numbered 4 in \(G_{y}\), respectively, as (a), then we add one to the (3,4) element in the matrix as (b)

For multiple grids in \(G_{y}\) that have a matching relationship with \(G_{{x_{i} }}\), non-maximization suppression is used, and only the case with the highest matching score is considered, and the grid numbered \(G_{{y_{k} }}\) with the highest score is found as follows:

$$ S(G_{{x_{i} }} ,G_{{y_{k} }} ) = \max (S(G_{{x_{i} }} ,G_{{y_{1} }} ),S(G_{{x_{i} }} ,G_{{y_{2} }} ),...S(G_{{x_{i} }} ,G_{{y_{l} }} )) $$
(8)

where \(l\) represents the number of grids in \(G_{y}\) that have a matching relationship with \(G_{{x_{i} }}\).

3.2.2 Build grid vector matching sets

We designed a grid vector strategy to transform grid matching into a motion smoothness constraint problem for grid vectors. We construct a grid vector set \(V_{a}\) for any two different grids \( G_{{x_{i} }} ,G_{{x_{j} }} )\) in \(G_{x}\) which has matched features inside. If there are \(M\) grids in \(G_{x}\) that contain non-zero matches, the cardinality of \(V_{a}\) is \(\frac{{M{*}(M - 1)}}{2}\). Similarly, we construct the feature grid vector set \(V_{b}\) for \(G_{y}\). From the above analysis, it can be seen that the cardinality of \(V_{b}\) is the same as \(V_{a}\).

Suppose that after non-maximal suppression, the corresponding grid of \(G_{{x_{i} }}\) is \(G_{{y_{i} }}\), and the corresponding grid of \(G_{{x_{j} }}\) is \(G_{{y_{j} }}\), then the grid vector \(v_{i,j}^{x}\) in \(G_{x}\) and the grid vector \(v_{i,j}^{y}\) in \(G_{y}\) form a grid vector pair. As shown in Fig. 3, the grid vector \(V_{10,8}^{x}\) is formed by \(G_{{x_{10} }}\) and \(G_{{x_{8} }}\), the grid vector \(V_{11,8}^{y}\) is formed by \(G_{{y_{11} }}\) and \(G_{{y_{8} }}\), and they construct a grid vector pair (\(V_{10,8}^{x}\), \(V_{11,8}^{y}\)). So, we can obtain \(\frac{{M{*}(M - 1)}}{2}\) grid vector pairs between \(V_{a}\) and \(V_{b}\).

Fig. 3
figure 3

Grid vector pair. After non-maximal suppression, \(G_{{x_{10} }}\) is matched to \(G_{{y_{11} }}\), \(G_{{x_{8} }}\) is matched to \(G_{{y_{8} }}\). The vector \(V_{10,8}^{x}\) is from \(G_{{x_{10} }}\) to \(G_{{x_{8} }}\). The vector \(V_{11,8}^{y}\) is from \(G_{{y_{11} }}\) to \(G_{{y_{8} }}\). \(V_{10,8}^{x}\) and \(V_{11,8}^{y}\) construct a grid vector pair(\(V_{10,8}^{x}\), \(V_{11,8}^{y}\))

If multiple grids in \(G_{x}\) are matched to the same grid \(G_{{y_{i} }}\), we divide \(G_{{y_{i} }}\) into \(3{*}3\) small grids and use the similar procedure to construct vector \(v_{i,j}^{y}\).

3.2.3 Global topology consistency of grids

From the analysis in Sect. 3.1, we can see that if the matching result between two images is correct, the corresponding grid vectors should have global topological consistency and should be close to each other. Note that the problem we are going to solve is the mismatch caused by different pattern layouts. Since the neighborhood of each pattern element between image pairs satisfies local consistency, we need to find out the sets that do not satisfy global topological consistency and remove them from the putative matches. In Sect. 3.1 we discussed the global topology between points, now we use the grid approximation and Eq. 6 in Sec 3.1 becomes as follows:

$$ C({\rm P};I) = \sum\limits_{{i|G_{i} \in {\rm P}}}^{{}} {\sum\limits_{{j|G_{j} \in {\rm P},G_{j} \ne G_{i} }}^{{}} {\left( {1 - {\text{d}}\left( {V_{{G_{i} ,G_{j} }}^{x} - V_{{G_{i} ,G_{j} }}^{y} } \right)} \right)} } $$
(9)

where P is the correct grid set that has global topology consistency, and all the matches contained in P will be preserved. \(V_{{G_{i} ,G_{j} }}^{x}\) is the vector from grid \(G_{{x_{i} }}\) to grid \(G_{{x_{j} }}\), \(V_{{G_{i} ,G_{j} }}^{y}\) is the vector from the corresponding grid of \(G_{{x_{i} }}\) to the corresponding grid of \(G_{{x_{j} }}\) in the other image. The set \(I\) contains the putative matches obtained by a certain image matching method.

3.2.4 Compute grid vector consistency

The complex transformation will lead to the difference in the absolute distance between the corresponding grid vectors above. Nevertheless, the relative position between the two endpoints of the vector will be preserved. For example, if \(G_{{x_{i} }}\) is to the left of \(G_{{x_{j} }}\), then \(G_{{y_{i} }}\) will also be to the left of \(G_{{y_{j} }}\). Thus, we convert the point vector distance in Eq. 4 to the grid vector distance and quantize it into two levels:

$$ \widehat{d}\left( {V_{{G_{i} ,G_{j} }}^{x} ,V_{{G_{i} ,G_{j} }}^{y} } \right) = \left\{ {\begin{array}{*{20}l} {1,} \hfill &\quad {{\text{d}}\left( {V_{{G_{i} ,G_{j} }}^{x} ,V_{{G_{i} ,G_{j} }}^{y} } \right) > 0} \hfill \\ {0,} \hfill &\quad {{\text{d}}\left( {V_{{G_{i} ,G_{j} }}^{x} ,V_{{G_{i} ,G_{j} }}^{y} } \right) < 0} \hfill \\ \end{array} } \right. $$
(10)

We note that even if \(\mathop d\limits^{ \wedge } > 0\), there may be false matches. Figure 4 shows an example. From Fig. 4a and b, we can see that \(V_{{G_{i} ,G_{j} }}^{x}\) and \(V_{{G_{i} ,G_{j} }}^{y}\) don’t have global consistency. But since the cosine value of the angle between \(V_{{G_{i} ,G_{j} }}^{x}\) and \(V_{{G_{i} ,G_{j} }}^{y}\) is also greater than zero, it is easy to mistake them as a correct match if we judge it directly according to Eq. 10. By observation we found that the displacement between \(G_{{x_{i} }}\) and \(G_{{y_{i} }}\) is small, as well as \(G_{{x_{f} }}\) and \(G_{{y_{f} }}\), but the displacement between \(G_{{x_{j} }}\) and \(G_{{y_{j} }}\) is large. That is to say, both directions of the mismatched grid vector and their displacement are different from each other due to different global topology.

Fig. 4
figure 4

wrong match with \(\mathop d\limits^{ \wedge } > 0\). a The grid \(G_{{x_{j} }}\) is at the top right of the grid \(G_{{x_{i} }}\). b The grid \(G_{{y_{j} }}\) is at the bottom right of the grid \(G_{{y_{i} }}\). C The angle \(\theta\) and the displacement formed by \(V_{{G_{i} ,G_{j} }}^{x}\) and \(V_{{G_{i} ,G_{j} }}^{y}\). The grids in (a) are partially matched with the grids in (b). It can be seen that although \(\mathop d\limits^{ \wedge }\) is bigger than zero, \(G_{{x_{j} }}\) and \(G_{{y_{j} }}\) should not be considered as a correct matched grid pair

Based on the above displacement considerations, we define a distance metric that considers both displacements and angles:

$$ s(G_{i} ,G_{j} ) = \left\{ {\begin{array}{*{20}l} 0 &\quad {(\kappa \mathop p\limits^{ - } - p_{i,j}^{{}} ) < 0} \\ 1 &\quad {(\kappa \mathop p\limits^{ - } - p_{i,j}^{{}} ) > 0} \\ \end{array} } \right. $$
(11)

\(p_{i,j}\) is computed by Eq. 12, it contains the displacement between the grid pair \((G_{{x_{i} }} ,G_{{y_{i} }} )\) and the grid pair \((G_{{x_{j} }} ,G_{{y_{j} }} )\), \(\mathop p\limits^{ - }\) is computed by Eq. 13 that represents the average displacement of all grid pairs:

$$ p_{i,j} = {\text{abs}}(||G_{{x_{i} }} - G_{{y_{i} }} || - ||G_{{x_{j} }} - G_{{y_{j} }} ||) $$
(12)
$$ {\text{ }}\mathop p\limits^{ \wedge } = \sum\limits_{{i = 1}}^{M} {{\text{P}}_{{\text{i}}} ||G_{{x_{i} }} - G_{{y_{i} }} ||} $$
(13)

\({||}.{||}\) denotes the Euclidean distance metric, \(P_{i}\) is the contribution of each grid. We assume that all the contributions are the same and \(P_{i} = 1/M\). κ = 0.3 is experimentally determined.

Combining Eqs. 10 and 11, we obtain a new consensus of global topology considering both displacement and angle between \(V_{{G_{i} ,G_{j} }}^{x}\) and \(V_{{G_{i} ,G_{j} }}^{y}\):

$$ \mathop {\text{d}}\limits^{ \wedge } (V_{{G_{i} ,G_{j} }}^{x} ,V_{{G_{i} ,G_{j} }}^{y} ) = \left\{ {\begin{array}{*{20}c} {s(G_{i} ,G_{j} )} &\quad {{\text{d}}(V_{{G_{i} ,G_{j} }}^{x} ,V_{{G_{i} ,G_{j} }}^{y} ) > 0} \\ 0 &\quad {{\text{d}}(V_{{G_{i} ,G_{j} }}^{x} ,V_{{G_{i} ,G_{j} }}^{y} ) < 0} \\ \end{array} ,} \right. $$
(14)

where the value of \(s(G_{i} ,G_{j} )\) is in \(\{ 0,1\}\).

3.2.5 Grid matching metric

From Eq. 14, we can get the consistency of each grid vector pair, and then, all vectors are divided into two cases, with or without consistency. Clearly, a vector pair consisting of both correctly matched grids would be considered correct, while a vector consisting of either of the incorrectly matched grids should be considered false.

Assuming that there are \(M\) grids in total, in which there are \(t\) correct matched grids that should be retained, then the number of incorrectly matched grids that should be removed is \(M - t\). If a grid is matched correctly, the number of vectors with global consistency generated by it should be \(t - 1\), and the number of vectors without global consistency generated by it should be \(M - t\). If a grid is matched incorrectly, the vectors connected to it will all be without global consistency. That is to say, the number of all grid vectors with global consistency should be \(\frac{t(t - 1)}{2}\) marked as \(T\), and the number of vectors without global consistency should be \(t(M - t) + \frac{(M - t)(M - t - 1)}{2}\) marked as \(F\). Then, we define a global consistency score calculator for each grid as follows:

$$ S_{{G_{{\text{i}}} }} = \frac{{G_{{T_{i} }} - G_{{F_{i} }} }}{T - F} $$
(15)

where \(G_{{T_{i} }}\) is the number of globally consistent vectors in the vector set connected to \(G_{i}\), \(G_{{F_{i} }}\) is the number without global consistency connected to it. We calculate the score of correctly matched grids, the score of mismatched grids, and the value of \((T - F)\) when the total number of grids \(M\) and the number of correctly matched grids \(t\) take different values, and we enumerate the cases of \(M = 20\) and \(M = 40\), which are summarized in Fig. 5. As we can see, when the value of \((T - F)\) is greater than zero, the score of correctly matched grids will be greater than \(2/M\), while the score of incorrectly matched grids will be less than zero. When the value of \((T - F)\) is less than zero, the score of correctly matched grids will be less than \(2/M\), while the score of mismatched grids is greater than \(2/M\).

Fig. 5
figure 5

The relationship between \((T - F)\) and the match score of the grids. a The results when \(M = 20\). b The results when \(M = 40\)

According to the above analysis, we can use the following formula to calculate the match confidence of \(G_{i}\):

$$ F_{{G_{i} }} = \text{sgn} \left( {\left( {T - F} \right)*\left( {S_{{G_{i} }} - \frac{2}{M}} \right)} \right) $$
(16)

In order to simplify the calculation, \(F_{{G_{i} }}\) can be represented as follows:

$$ F_{{G_{i} }} = \text{sgn} \left( {T - F} \right)*\text{sgn} \left( {S_{{G_{i} }} - \frac{2}{M}} \right) $$
(17)

\(F_{{G_{i} }}\) can be divided into the following two cases:

$$ F_{i} = \left\{ {\begin{array}{*{20}l} 1 &\quad {if(F_{{G_{i} }} > 0)} \\ 0 &\quad {if(F_{{G_{i} }} < 0)} \\ \end{array} } \right., $$
(18)

where \(F_{i} = 1\) represented the grid \(G_{i}\) is correctly matched between image pairs, and \(F_{i} = 0\) represented not.

According to Eqs. 915 and 17,we can obtain the minimization problem in Eq. 19.

$$ C({\rm P};I) = \sum\limits_{{i|G_{i} \in {\rm P}}}^{{}} {\left( {1 - \text{sgn} \left( {T - F} \right)*\text{sgn} \left( {\frac{{G_{{T_{i} }} - G_{{F_{i} }} }}{{T - F}} - \frac{2}{M}} \right)} \right)} $$
(19)

3.3 Implementation details

The proposed SMF method is summarized in Algorithm 1. First, a specific matching algorithm is used to do pre-matching. In the experiment part, we use five algorithms to do pre-match separately, including RANSAC [17], VFC [24], GMS [6], LPM [7], and GLOF [28]. Second, grid matching is performed based on the feature match distribution. Third, the grid vector matching set is constructed, and the global consistency is calculated by using Eq. 14. In order to prevent false matching caused by image rotation, we rotate \(G_{y}\) around the center every 45 degrees and take the matching result which has the smallest cost value. Finally, we get a grid set \(P\) that minimize Eq. 19. After that, we keep all feature matching pairs located within the grids in \(P\), delete the rest, and obtain the final matching result \(I^{*}\). There is only one parameter κ in our method, which is used to determine whether the grid vector is global consistency or not according to the displacement.

3.4 Computational complexity

The main two steps in the SMF method are calculating the number of matches between grids and calculating the global consistency score for all grid vector pairs. The time complexity of the first step is \(O(T)\), where \(T\) is the number of the putative match set. Suppose the number of the grids that have matches inside is \(M\), which is much smaller than \(T\), then the time complexity and the space complexity of the second step are \(O(M^{2} )\). So, compared with the given pre-matching results, our SMF has linear time complexity.

4 Experimental results

We conduct experiments on image feature matching and object reconstruction to evaluate the performance of our SMF. The features of each image are detected by using ORB [31], which is efficient, robust, can obtain a large number of extracted feature points, and is suitable for real-time tasks. The number of features is set to 5,000. We implemented all the algorithms in Visual Studio 2019, opencv4.5.1, and C +  + without any optimization. All the experiments are performed on a notebook with 2.8 GHz Intel Core i7-1165G and 16 GB RAM.

4.1 Results on feature matching

Image pairs with four different types of similarity are used to verify the effectiveness of our SMF on mismatch removal, including identical image pairs with large baselines, similar image pairs with partially same patterns, similar image pairs with completely different pattern layouts, and similar image pairs with almost symmetric patterns. We show some examples in Fig. 6. The traditional ratio-test(threshold 0.6) match result is used as the putative match set.

Fig. 6
figure 6

Some examples of the matching datasets. a DTU [32] scan4. b Book has the same patterns with different relative positions. c Pencil box has almost symmetrical patterns on different surfaces. d Toothpaste has partially same patterns with different typography

The DTU [32] dataset is used to verify the effectiveness of the SMF algorithm on the same object. We collect the other three datasets with different types of similar images. We manually checked the correctness of each feature match for each image pair, labeled it as true or false, and used it as the ground truth to ensure objectivity. The details of the datasets are described as follows:

  • DTU [32]: The dataset is mainly used for MVS(Multiple-View Stereo) reconstruction. It contains many types of 3D scenes, each containing a series of images taken from different perspectives. The ground truth is calculated by camera parameters supplied by the dataset. Two scenes are selected, and we choose 50 image pairs that have large baselines. The average number of the putative matches is 1203.5, and the average inlier rate is 56.73%.

  • Similar image dataset No.1: The dataset contains images taken from three different scenes. The image pairs have partly identical patterns and partly differently positioned patterns. We collect 30 image pairs totally. The average number of the putative matches is 1406.3, and the average inlier rate is 39.63%.

  • Similar image dataset No.2: The dataset contains images taken from 4 different objects. The different surfaces of the objects have similar pattern elements, but different typography. We match every two images for each object and create 43 image pairs. The average number of the putative matches is 455. Due to the completely different typography, the number of true matches should be zero.

  • Similar image dataset No.3: The dataset contains images taken from 3 different objects. The different surfaces of the objects have almost symmetrical patterns. We match every two images for each object and create 27 image pairs in total. As with dataset No.2, the number of true matches should also be zero.

We test our SMF on the datasets described above and compared it with five state-of-the-arts: RANSAC [17], VFC [24], GMS [6], LPM [7], and GLOF [28]. RANSAC [17] is particularly a classic sampling-based approach; VFC [24] is a non-parametric-interpolation- based method; LPM [7] is a locality-neighborhood-based method; GMS [6] is based on grid motion statistics; GLOF [28] is a mismatch rejection method based on the local density reachability. We implement these algorithms based on publicly available codes. Then, the SMF method proposed in this paper is tested by using these algorithms as pre-matching algorithms to test the filter effectiveness. We also test the results by using other matching filters based on LPM [7] and VFC [24].

Figure 7a–e illustrates some representative matching results on similar image pairs including in dataset No.1 by using SMF based on RANSAC [17], VFC [24], GMS [6], LPM [7], and GLOF [28]. These image pairs contain the same patterns with different locations. We first use the five methods to do pre-match, and all lines in Fig. 7a–e are the pre-matching result. Our goal is to identify false matches caused by different typography. The matches of patterns with different positions should be identified as false, marked with blue lines, and the rest of the true matches are marked with green lines. For example, the small pattern of the “Book” and “Dictionary” pairs has different positions, so the matches between the small pattern should be recognized as mismatches. The pattern layout of the two surfaces on “Toothpaste” is completely different, so the matches between these two surfaces should also be false. Figure 7f indicates the match results by using GMS [6] as a matching filter based on LPM [7]. Figure 7g indicates the match results by using GLOF [28] as a matching filter based on VFC [24]. Figure 7h shows the correct matching results. From Fig. 7, we can see that SMF can work with different algorithms to effectively identify the mismatches caused by the same patterns at different locations.

Fig. 7
figure 7

Feature matching results of further using our SMF compared with the other seven methods. From left to right: Dictionary, Book, Toothpaste. From top to bottom: RANSAC [17], VFC [24], GMS [6], LPM [7], and GLOF [28], LPM [7] + GMS [7], VFC [24] + GLOF [28]. The green lines indicate the matching results of each matching algorithm. In ae, all lines are the matching results by using RANSAC [17], VFC [24], GMS [6], LPM [7], and GLOF [28] as pre-matching methods separately. Among them, the blue line indicates the false matches identified by SMF, and the green line indicates the matches that should be retained. f Indicates the match result by using GMS [6] as a matching filter based on LPM [7]. g Indicates the match result by using GLOF [28] as a matching filter based on VFC [24]. h Shows the correct matching results

Three metrics are used to evaluate the result: precision, recall, and F-score. The number of true positive matches(TP), true negative matches(TN), false positive matches(FP) and false negative matches (FN) are given, and the precision is calculated by:

$$ P{ = }\frac{{{\text{TP}}}}{{\text{TP + FP}}} $$
(20)

The recall is obtained by:

$$ R =\, \frac{{{\text{TP}}}}{{\text{TP + FN}}} $$
(21)

The F-score is given as follows:

$$ F = \frac{2 \times P \times R}{{P + R}} $$
(22)

The quantitative comparisons of the average precision, recall, and F-score for each object are summarized in Table 1. From the results, we see that when using the original algorithm, RANSAC [17] can obtain relatively better metrics because of the matrix transformation principle, which can deal with the different matrices caused by a few same patterns at different locations. Other original algorithms have low precision because they likewise only identify correct matches based on local feature matching or local topological consistency. We can see that after using our SMF algorithm, the mismatch caused by different global topologies can be effectively identified. The metrics of using GMS [6] as the filter on LPM [7] and the metrics of using GLOF [28] as the filter on VFC [24] indicate that using SMF as the filter can get more corrective matching results. A more intuitive evaluation result is shown in Fig. 8 for a more straightforward comparison.

Table 1 Precision metric, recall metric and F-score metric of image pairs including in dataset No.1 based on different algorithms
Fig. 8
figure 8

Average quantitative performance of different algorithms on the three datasets in Fig. 7. The blue columns show the metrics by using RANSAC [17], VFC [24], GMS [6], LPM [7], and GLOF [28]. The orange columns show the metrics after using SMF. From left to right: precision, F-score

Figure 9a illustrates some examples of the matching results of similar image pairs with totally different pattern layouts(Dataset No.2) and similar image pairs with almost symmetry patterns(Dataset No.3) by using SMF on the result of the individually executed algorithm mentioned above. Figure 9b illustrates the results of two combined algorithms. These image pairs have completely different global typography or are almost symmetrical. Therefore, they should theoretically not have any matching in order to obtain sufficiently correct camera poses in subsequent work such as 3D modeling. From the results, it can be seen that when other algorithms identify the matches between similar images with different typography as correct, our SMF can identify them as belonging to different scenes and consider them as wrong, which is beneficial for subsequent camera pose estimation work. This verifies the effectiveness of the proposed global topology strategy. So, SMF can be used as a follow-up step to any matching algorithm for mismatch removal of different typography images.

Fig. 9
figure 9

Feature matching results of further using our SMF compared with the other methods. All lines in (a) are the matching results by using RANSAC [17], VFC [24], GMS [6], LPM [7], and GLOF [28] as pre-matching methods separately. Among them, the blue lines indicate the false matches identified by SMF, and the green lines indicate the matches that should be retained. b Indicates the match results by using GMS [6] as the matching filter based on LPM [7] and the match results by using GLOF [28] based on VFC [24]

Figure 10 shows the average run time of the mentioned approaches on the four datasets. The maximum image resolution does not exceed 1500*1200. GMS [6] has the fastest running time due to the grid strategy. Compared with the run time of the original algorithm, continuing to use SMF only takes approximately 37.4% more time to identify mismatches caused by different global topologies.

Fig. 10
figure 10

Average run time(seconds) on the four datasets based on different algorithms

4.2 Results on object reconstruction

Open-source software such as VisualSFM [33, 34], COLMAP [35, 36], and OpenMVG + OpenMVS [37,38,39] is widely used to reconstruction scenes from unserialized images. To verify the role of SMF for 3D reconstruction, we perform reconstruction tasks on three objects that are similar on different surfaces and compare the results with the other three open-source software.

We take 17 images from different angles around Toothpaste, the schematic is shown in Fig. 11a. Figure 11b is the results of the camera pose estimation used by OpenMVG [37, 38], COLMAP [35, 36], and OpenMVG [37, 38] + SMF. It can be seen that there are obvious pose estimation errors when using COLMAP [35, 36] and OpenMVG [37, 38]. By additionally using SMF, we can get the correct camera pose estimation results, which is important for the next 3D reconstruction step [40].

Fig. 11
figure 11

a Schematic of the camera positions. b Results of camera pose estimation by OpenMVG [37, 38], COLMAP [35, 36], and OpenMVG [37, 38] + SMF

In addition, we continue the dense point cloud reconstruction process based on the different camera poses calculated by the three algorithms to estimate the depth for each scene taken from different positions. We calculate the depth estimation errors for OpenMVG [37, 38], COLMAP [35, 36] and OpenMVG [37, 38] + SMF. We use Intel Realsense D435i and align the RGB images with the depth maps. Although there are usually many invalid values in the obtained depth maps, such as those invalid values caused by the occlusion due to the different angles of the left and right cameras, the valid depth values are relatively reliable. Figure 12 shows one example of the RGB image, the left IR view, the right IR view, and the corresponding depth map, in which the black is invalid data. We sample 20 frames for each angle and take the average depth of these 20 frames as the true depth value of each scene taken from different positions. To eliminate the effect of different units of depth values get from different algorithms, we normalized the data. We use three evaluation metrics to evaluate the depth estimation performance, including AbsRel, MAE, and RMSE, that are shown in Eq. (23), Eq. (24), and Eq. (25):

$$ {\text{Abs}}{\text{Re}} l = \frac{{\sum\nolimits_{i = 1}^{N} {\frac{{|D_{i} - D_{i}^{*} |}}{{D_{i}^{*} }}} }}{N} $$
(23)
$$ {\text{MAE}} = \frac{{\sum\nolimits_{i = 1}^{N} {|D_{i} - D_{i}^{*} |} }}{N} $$
(24)
$$ {\text{RMSE}} = \sqrt {\frac{{\sum\nolimits_{i = 1}^{N} ( D_{i} - D_{i}^{*} )^{2} }}{N}} $$
(25)

where \(D_{i}\) is the estimated depth of the \(i_{th}\) point, \(D_{i}^{*}\) is the true depth value of the \(i_{th}\) point, and \(N\) is the number of the points with valid depth values in the scene. Table 2 shows the metrics compared with the true depth value by using OpenMVG [37, 38], COLMAP [35, 36], and OpenMVG [37, 38] + SMF. The depth estimate process for the 1st image to the 9th image failed, so we compare the error results for the 10th image to the 17th image. It can be seen that the estimated depth results are effectively improved after further filtering of the mismatches by using the method proposed in this paper. Figure 13 shows the reconstructed dense point clouds based on the three different estimated camera pose results mentioned in Fig. 11b. It can be seen that the dense point clouds obtained based on the camera pose calculated by OpenMVG [37, 38] and COLMAP [35, 36] have contour errors and point cloud loss, while the point cloud obtained based on the camera pose calculated by further using SMF has relatively complete and correct result.

Fig. 12 
figure 12

One example of the RGB image, the left IR view, and the right IR view taken by Intel RealSense D435i

Table 2 The error between the true depth and the depth obtained based on the three different estimated camera pose results by using OpenMVG [37, 38], COLMAP [35, 36], and OpenMVG [37, 38] + SMF for the 9th to the 17th image
Fig. 13
figure 13

The two different perspectives of the reconstructed dense point clouds based on the estimated camera pose calculated by different algorithms. From left to right: OpenMVG [37, 38], COLMAP [35, 36], OpenMVG [37, 38] + SMF. a Shows that the contour is reconstructed more correctly by adding SMF. b Shows that the point cloud is more complete by adding SMF

Figure 14 shows the reconstruction results by using different algorithms. For datasets Toothpaste and Garbage can, the results of the other three algorithms have different degrees of damage and distortion. For Pencil box, VisualSFM [33, 34] and COLMAP [35, 36] have surface missing or distortion. Although the results with OpenMVG + OpenMVS [37,38,39] are more complete, they still have distortions. It can be seen that after using our SMF to further filter the outliers caused by different global topology, it can help to subsequently calculate more accurate results for the camera pose, and the accuracy and completeness of the resulting 3D reconstructed model are higher.

Fig. 14
figure 14

3D reconstruction models by using different algorithms. The first line is the dense point cloud obtained by VisualSFM [33, 34]. The second line is the dense point cloud obtained by COLMAP [35, 36]. The third line is the models obtained by OpenMVG + OpenMVS [37,38,39]. The last line is the results obtained by OpenMVG [37, 38] + SMF + OpenMVS [39]

5 Conclusion

In this paper, we present a global-typography-based method SMF to remove mismatches for similar image pairs. It works based on the consensus that the global topological structure of matched points of image pairs taken from the same scene or the same object should be similar. SMF can work after using a certain matching algorithm that the given match result has local consistency and then can be used to remove false matches that do not have global consistency. We formulated the global topological structure of the matches into a mathematical model to robustly recover the inliers by judging the false matches caused by different pattern layouts. The experimental results on image matching demonstrated that our method outperforms the state-of-the-art methods. Moreover, it can be used in reconstruction pipelines to obtain more accurate results.

So far, the algorithm SMF proposed in our work assumes that the cardinality of the images is equal. In future work, we consider advancing SMF to define the global topology model better and applying our method to more computer vision applications, such as matching the elimination of different instances or distinguishing similar environments in VSLAM. We will also compare the effectiveness of the proposed algorithm with deep-learning methods.