GMS: Grid-Based Motion Statistics for Fast, Ultra-robust Feature Correspondence

Feature matching aims at generating correspondences across images, which is widely used in many computer vision tasks. Although considerable progress has been made on feature descriptors and fast matching for initial correspondence hypotheses, selecting good ones from them is still challenging and critical to the overall performance. More importantly, existing methods often take a long computational time, limiting their use in real-time applications. This paper attempts to separate true correspondences from false ones at high speed. We term the proposed method (GMS) grid-based motion Statistics, which incorporates the smoothness constraint into a statistic framework for separation and uses a grid-based implementation for fast calculation. GMS is robust to various challenging image changes, involving in viewpoint, scale, and rotation. It is also fast, e.g., take only 1 or 2 ms in a single CPU thread, even when 50K correspondences are processed. This has important implications for real-time applications. What’s more, we show that incorporating GMS into the classic feature matching and epipolar geometry estimation pipeline can significantly boost the overall performance. Finally, we integrate GMS into the well-known ORB-SLAM system for monocular initialization, resulting in a significant improvement.

GMS matching. Although Lowe's ratio test (RT) can remove many false matches, generated by ORB (Rublee et al. 2011) features here, the results are still noisy (a) and degenerate RANSAC (Breiman 2001) based estimators in applications. To address this issue, we propose GMS, which further removes motion-inconsistent correspondences towards the high-accuracy matching (b) assumption causes that neighboring true correspondences in one image would also be close in other images, while false correspondences not. This allows us to classify a correspondence as true or false by just counting the number of its similar neighbors, the correspondences that are close to the reference correspondence in both images. Figure 2 shows an visualization for our assumption, and Sect. 3.2 presents the theoretical analysis.
The computational cost is critical to a false match removal method, since feature matching is often used in real-time applications such as Visual SLAM (Mur-Artal et al. 2015). We accelerate the calculation by proposing a grid-based framework in Sect. 3.3, where we divide images into nonoverlap cells and process data at the cell level instead of at individual correspondence level. This avoids the distance comparison between correspondences, reducing the overall complexity from the plain O(N 2 ) to O(N ). As a result, GMS takes only 1 or 2 milliseconds CPU time in a single thread to identify true correspondences, even when the number of matches reaches 50K , as shown in Fig. 11.
The basic grid framework suffers from significant image changes in scale and rotation. To address the issue, we propose multi-scale and multi-rotation solutions. Specifically, we define 5 relative scales and 8 relative rotations between image pairs for scale and rotation enhancement, respectively. Then we repeat the basic GMS algorithm at different settings and collect the best results. As no data dependence exists in different repeats, the proposed methods can be implemented using multi-thread programming. Theoretically, they can be as fast as the basic GMS algorithm, when 8 (or 5) CPU threads are available. This resource burden is affordable to a regular desktop or laptop.
We conduct a comprehensive evaluation of GMS, including the robustness to common image changes, the performance and efficiency with varying feature numbers, and the accuracy of retrieved correspondences. Moreover, we evaluate the proposed method on very recent FM-Bench  for exploring how it can contribute to epipolar geometry estimation. The results demonstrate the superiority of GMS against other state-of-the-art alternatives. Finally, we incorporate the proposed GMS into the well-known ORB-SLAM (Mur-Artal et al. 2015) for monocular initialization, and a clear improvement is shown. This paper is an extension of our preliminary version (Bian et al. 2017). We extend it in the following four aspects: (1) more straightforward and more intuitive presentation; (2) scale and rotation enhancements; (3) more comprehensive evaluation; (4) use in real-time applications.

Relation to RANSAC
The proposed method attempts to remove false correspondence. Although it is related to RANSAC (Fischler and Bolles 1981) based algorithms, which fit a model from correspondences and remove outliers, note that GMS is not an alternative to the latter. The difference includes: (1) GMS cannot fit a model as and (2) outliers are model-dependent and are not conceptually equivalent to false correspondences. For example, in the image based reconstruction problem where the static scene assumption is made, some correct correspondences would also be removed as outliers if they land in moving objects. Instead of replacing RANSAC, the goal of GMS is to provide high-quality correspondence hypotheses to the latter towards better overall performance, i.e., model fitting and inlier generation.

False Correspondence Removal
Lowe's ratio test (Lowe 2004), referred as RT, is a widely used approach. It compares the distance of two nearest neighbors for identifying distinctive correspondences. However, due to lacking more powerful constraints, many false correspondences remain under challenging scenarios. An example is shown in Fig. 1, where applying GMS can further remove false correspondences. Other methods include KVLD (Liu et al. 2012) which uses constraints in both photometry and geometry, and VFC (Ma et al. 2014) which interpolates a vector field between two point sets and estimates the consensus set. Recently, CODE ) leverages non-linear optimization for globally modeling the motion smoothness. Our method is inspired by it, but simpler and more efficient. LPM (Ma et al. 2019) explores the local structure of surrounding feature points, which makes more restrictive assumptions than GMS. LC ) uses deep neural networks to find good correspondences by fitting an essential matrix. It is like an alternative to RANSAC (Fischler and Bolles 1981). However, the authors show that the predicted model is not as good as using RANSAC. They instead suggest using the method for finding good correspondences and then applying RANSAC for model fitting.

Successful Applications that use GMS
After the conference version of GMS was published, we notice that many recent works have used our method in their applications and achieved remarkable performances. For instance, (Causo et al. 2018) use GMS in an item picking system for the Amazon Robotics Challenge, and the system can pick all target items with the shortest amount of time; ) use GMS and extend the idea to tackle the tracking problem in dynamic scenes. The resultant solution produces one order of magnitude more accurate camera trajectory than ORB-SLAM2 (Mur-Artal et al. 2015) in the TUM benchmark (Sturm et al. 2012); (Yoon et al. 2018) use GMS for point cloud triangulation in 3D their trajectory reconstruction system. Moreover, our implementation of GMS has been integrated into OpenCV library (Bradski 2000), and we encourage researchers to use and extend it in more real-time applications.

Grid-Based Motion Statistics
Given putative correspondences generated by feature detectors, descriptors, and matchers, our goal is to separate the true correspondences from false ones.

Motion Smoothness Assumption
To differentiate true and false correspondences, we assume that pixels those are spatially close in the image coordinates would move together. It is intuitive, i.e., imagine that neighboring pixels have a high probability of landing in one rigid object or structure and hence have similar motions. The assumption is not generally correct, e.g, it may be violated in image boundaries where neighboring pixels may land in different objects and these objects have independent motions. However, it suits regular pixels, which are overwhelmingly more than edges. Besides, as our goal is a set of high-quality correspondence hypotheses for further processing instead of a final correspondence solution, the assumption is sufficient for our purpose.

Motion Statistics
True correspondences are influenced by the smoothness constraints, while false correspondences are not. Therefore, true correspondences are often have more similar neighbors than false correspondences, as shown in Fig. 2, where the similar neighbors refer to the correspondences which are close to the reference correspondence in both images. We use the number of similar neighbors to identify good correspondences.
Formally, let C be all correspondences across image I 1 and I 2 , c i be one correspondence that connects the point p i and q i between two images. We define c i 's neighbors as and its similar neighbors as where d(·, ·) refers to the Euclidean distance of two points, and r 1 , r 2 are thresholds. We term |S i |, the number of elements in S i , motion support for c i . The motion support can be used as a discriminative feature to distinguish true and false correspondences. Modeling the distribution of |S i | for true and false correspondence, we get: where B(·, ·) refers to the binomial distribution. |N i | refers to the number of neighbors for c i . t and are the respective probabilities that a true and false correspondence are supported by one of its neighbors. In Eq. 3, t is dominated by the feature quality, i.e., it is near to correct rate of correspondences. is usually small because false correspondences are nearly random distributed in regular regions. Note that it would be larger in visually similar but geographically different regions, e.g, repetitive structures (Kushnir and Shimshoni 2014). Here we assume that features are sufficiently discriminating that their correspondence distribution is better than random and that caused by repetitive patterns, i.e., t is larger than .
We can derive |S i |'s expectation: and variance: This allows definition of partionability between true and false correspondences as: where P ∝ √ |N i | and if |N i | → ∞, P → ∞. This means that the separability of true and false matches based on |S i | becomes increasingly reliable as the feature numbers are sufficiently large. This occurs even if t is only slightly greater than , making it possible to obtain reliable correspondence on difficult scenes by simply increasing the number of detected features. The similar results are shown in Lin et al. (2018), where ambiguous distributions are separated through large numbers of independent trials. Besides, it shows that improving feature quality (t) can also boost the separability. Fig. 3 Grid-Based framework. We use the pre-computed grid to find similar neighbors instead of explicit distance comparison between points The distinctive attributes permit us to classify c i as true or false by simply thresholding |S i |, giving: where T and F denote true and false correspondence sets, respectively. Based on Eq. 6, we set τ i to be: where α is a hyperparameter, and we empirically find that it results in good performance when α ranges from 4 to 6.

Grid-Based Framework
The complexity of a plain implementation for computing S i is O(N ), where N = |C| is the number of all correspondences, since we need to compare c i with all other correspondences. Therefore, the overall algorithm complexity is O(N 2 ). Although approximated nearest neighbor algorithms, like FLANN (Muja and Lowe 2009), can reduce the complexity to O(Nlog(N )), we show that using the proposed grid-based framework is faster (O(N )). Figure 3 illustrates the framework, where we divide two images into non-overlap cells G 1 and G 2 , respectively. Assume c i be a correspondence that lands in the cell G a and G b , like one of the red correspondences in Fig. 3. The neighbors of c i are re-defined as: and the similar neighbors are re-defined as: where C a are correspondences those land in G a , and C ab are correspondences those land in G a and G b simultaneously. In other words, we regard correspondences those are in one cell as neighbors, and correspondences those are in one cell-pair as similar neighbors. This avoids the explicit comparison between correspondences. To obtain motion support for all correspondences, we only need to put them in cell-pairs. In this fashion, the overall complexity is reduced to O(N ). Note that correspondences those land in one cell-pair share the same motion support, so we only need classify cell-pairs instead of individual correspondences. Moreover, instead of determining all possible cell-pairs, we only check the best one cell-pair that contains most correspondences for each cell in the first image. For example, we only check G ab and discard G ac , G ad in Fig. 3. This operation significantly decreases the number of cell-pairs and early rejects a considerable amount of false correspondences.

Motion Kernel
Few neighbors would be considered, if the cell size is small. This degenerates the performance. However, if the cell size is big, inaccurate correspondences would be included. To address this issue, we set the grid size to be small for accuracy and propose motion kernel for considering more neighbors. Figure 4 shows the basic motion kernel, where we consider additional eight cell-pairs (C a 1 b 1 , . . . , C a 9 b 9 ) for classifying the original cell-pair (C ab ). Formally, let c i land in C ab again. We re-define its neighbors as where We re-define its similar neighbors as where The basic kernel assumes that the relative rotation between two images is small. For matching image pairs with significant rotation changes, we can rotate the basic kernel, as described in the next section.

Multi-scale and Multi-rotation
To deal with significant scale and rotation changes between two images, we propose the multi-scale and multi-rotation solutions in this section.  a 1 b 1 , . . . , C a 9 b 9 ) when computing the motion supports for the original cell-pair (C ab )

Fig. 5
Rotated motion kernels. We set the pattern in first image fixed, and rotate the pattern in the second image in the clockwise direction, resulting in total 8 motion kernels for simulating possible relative rotations

Multi-rotation Solution
We rotate the basic kernel for simulating different relative rotations, resulting in total 8 motion kernels, as shown in Fig. 5. The rotation is often unknown in real-world problems, so we run GMS algorithm using all motion kernels and collect the best results, i.e., we find the kernel that results in most correspondences retrieved. The efficacy of multirotation solution is demonstrated in Sect. 4.2.

Multi-scale Solution
The relative scale between two images can be simulated in our grid framework, i.e., we fix the cell size (cell numbers) of one image and vary that of the other image. Assume that both images are divided into n×n cells. We change the cell number of the second image to be (n · α) × (n · α). Here, we provide 5 candidates for α, including { 1 2 , Similarly, we run GMS using all pre-defined relative scales and collect the best results. Note that we provide only 5 relative scales to demonstrate that our solution is effective for solving the scale issue, which is efficient and sufficient for most scenes, as demonstrated in Sects. 4.2 and 4.4. However, for more significant scale changes, we can use more candidates or increase α.

Algorithm 1 Grid-based Motion Statistics
Shift the gird of the first image by half cell-width in horizontal, vertical, and both directions, and then repeat algorithm more 3 times. return X

Algorithm and Limitation
Algorithm 1 shows the GMS algorithm, which takes putative correspondences and the setting for scale and rotation as input and outputs selected matches. We use the basic motion kernel and the single equal scale for matching regular images, e.g., video frames. The multi-scale and multi-rotation solutions are used for images with significant changes in scale and rotation, respectively.

Implementation
We implement the algorithms using C++ with OpenCV library (Bradski 2000). A single CPU thread is used currently, but the multi-thread programming can be used for accelerating the multi-scale and multi-rotation solutions. We use 20×20 cells as the default mode, and we vary the number of cells in the second image when activating the multi-scale solution. α = 4 in Eq. 8 is used for thresholding. The code has been integrated into the OpenCV library.

Limitation
The limitation of GMS lies in three aspects. First, as we assume that image motion is piece-wise smooth, the performance may degenerate in areas where the assumption is violated, e.g., image boundaries. This issue is not critical because regular pixels are overwhelmingly more than boundaries. Besides, as we are not targeting a final correspondence solution but a set of high-quality hypotheses, the assumption is sufficient for our purpose. To solve this problem, we will consider using edge detection  or image segmentation (Cheng et al. 2016;Liu et al. 2018) techniques in future work. Second, the performance is limited in visually similar but geographically different image regions. This issue often occurs in scenes that have heavy repetitive patterns. We leave the problem to global geometry estimation algorithms (Kushnir and Shimshoni 2014), as only local visual information is not sufficient to address that. Third, as we process data at the cell-pair level, inaccurate correspondences that lie the correct cell-pair will remain. These correspondences are useful in many applications which are not sensitive to matching accuracy such as Object Retrieval (Philbin et al. 2007). However, for accuracy-sensitive tasks, e.g., geometry estimation, they should be excluded. Therefore, we propose to run GMS on correspondences selected by RT instead of all putative correspondences for mitigating the issue. The efficacy of using RT as pre-processing can be seen in Fig. 7 and Table 4.

Experiments
We evaluate four aspects of GMS: -Comprehensive performance characterization on correspondence selection. -Matching challenging image pairs with significant relative scale and rotation changes. -Contribution to the overall performance of feature matching and epipolar geometry estimation. -Integration in real-time computer vision applications, i.e., Visual SLAM here.

GMS for Correspondence Selection
To comprehensively evaluate the performance of GMS, we experiment with different local features and varying feature numbers. We also examine the accuracy of retrieved correspondences using varying error thresholds. Two challenging datasets (Graffiti and Wall) selected from VGG (Mikolajczyk and Schmid 2005) are used for evaluation, which are well-known for significant viewpoint changes. Each dataset contains six images, where the ground truth homography between the first image to others is provided, resulting in five pairs with increasing difficulty for testing. Recall and Precision are used as evaluation metrics, where we regard correspondences those distances to the ground truth are smaller than 10 pixels as correct and others as wrong. Results on pairs with gradually increasing viewpoint changes. "#" stands for the feature numbers. RT refers to Lowe's ratio test (Lowe 2004). GMS and RT take all correspondences as input, while RT-GMS takes the results of RT as input. Therefore, the Recall of RT is the upper bound for RT-GMS selected by Lowe's ratio test. Here SIFT is not able to provide sufficient correct correspondences on very challenging pairs due to significant viewpoint changes, while ASIFT can.

Results on Different Features
The results show that GMS gets high recall and precision on correspondences generated by ASIFT, while the recall degenerates on correspondences generated by SIFT. The reason is that ASIFT provide sufficient correspondences and GMS can translate the high feature numbers to high match quality, as indicated in Eq. 6, while SIFT correspondences are quite sparse. However, on SIFT correspondences, RT-GMS can achieve high performance, although they are sparse. Note that the Recall of RT is the upper bound of RT-GMS. This is because the performance of GMS is also related to the feature quality, as indicated in Eq. 6, and the quality of correspondences selected by RT is higher than that of initial correspondences in terms of precision. Compared with SIFT-RT, the most widely used feature matching method as we know, the proposed SIFT-RT-GMS can get similar recall (i.e., similar correspondence numbers) but higher precision. It has important implications for many computer vision applications. We show a comparison of both methods in Table 2, where the performance gap is wide on challenging widebaseline datasets in terms of epipolar geometry estimation. Figure 7 shows the results with varying error thresholds on Graffiti, where we use all correspondences in Graffiti (5 pairs) for evaluation. Note that regarding the number of correct correspondences, ALL is the upper bound for GMS, and RT is the upper bound for RT-GMS. The results show that ASIFT provides better correspondences than SIFT (see ALL for comparison). However, when RT or GMS is applied, the precision on SIFT correspondences is higher than that on ASIFT correspondences, especially when the error threshold is small, e.g., 1 or 2 pixels. This is because there are many correct but not accurate correspondences generated by ASIFT and selected by RT and GMS. These inaccurate correspondences are useful in many applications where the accuracy is not critical, e.g., Object Retrieval (Philbin et al. 2007). However, they limit the performance of accuracy-sensitive applications such as epipolar geometry estimation. Therefore, we suggest using SIFT-RT-GMS solution for high accuracy of matching. Note that using ASIFT is also possible and may lead to a higher performance, since ASIFT correspondences are generally better than SIFT correspondences. For example, Lin et al. (2017) uses ASIFT features and achieves the state-of-the-art performance in terms of camera pose estimation. However, it uses a highly complicated regression method, which consumes huge computational cost to remove inaccurate correspondences.

Performance with Varying Feature Numbers
Equation 6 indicates that the performance of GMS relies on feature numbers. To understand how feature numbers impact it, we randomly pick a subset of ASIFT features with varying maximum feature numbers for evaluation. Figure 8 shows the results on Graffiti dataset. It shows that GMS requires the most features, while the performance of RT is less sensitive to feature numbers. RT-GMS can reduce the number requirement of GMS and achieves the most accurate matching. Overall, we suggest users trying RT and RT-GMS in scenarios where feature numbers are limited. Besides, note that the performance of GMS is also related to feature quality. We show that with the same feature detector (i.e., the

Robustness to Scale and Rotation Change
We use Semper and Venice datasets for testing the robustness of GMS to rotation and scale changes, respectively. Both datasets are selected from Heinly et al. (2012), which is an extension of VGG (Mikolajczyk and Schmid 2005) with the same data organization. We also use Boat dataset for testing, where image pairs have significant relative changes in both scale and rotation. Figure 9 shows the results, where we use SIFT feature (Lowe 2004) for generating putative correspondences and run GMS variants on correspondences selected by RT. It shows that the basic GMS is sensitive to large scale and rotation changes while using the multi-scale and multi-rotation solution can improve the performance significantly. Similar to previous results, the proposed method can achieve similar recall with RT and higher precision. We show visual results of GMS in Fig. 10, where the results on the most challenging pair of each dataset is illustrated. The scale invariance is critical in the unstructured environment, as  Table 4.

Runtime of GMS
We evaluate the runtime of GMS with varying feature numbers on Graffiti dataset using ASIFT features. Figure 11 shows the results, where GMS takes about only 2 ms in a single CPU thread, even when feature numbers reach 50, 000. The multi-scale (GMS-S) and multi-rotation (GMS-R) variants repeat the basic GMS with different pre-defined patterns 5 and 8 times, respectively. Therefore, their computational costs are linearly increased. We do not show GMS-SR in the Runtime of GMS in a single CPU thread. GMS-S and GMS-R repeat the basic GMS using different settings 5 and 8 times, respectively, so the runtime is linearly increased. GMS-SR, not shown in this figure, consumes 40 times more computational cost than the basic GMS. Note that the multi-scale and multi-rotation solutions could be accelerated by using multi-threshold programming, since no data dependence exists in different repeats figure for clarity. However, its time consuming is apparent, i.e., 40 times higher than the basic GMS. Note that both multiscale and multi-rotation solutions have no data-dependent in different repeats, so they could be accelerated by using multi-thread programming. Theoretically, the multi-scale (or rotation) version could achieve the same speed with the basic GMS, when there are 5 (or 8) CPU threads available.

GMS for Epipolar Geometry Estimation
We evaluate the proposed method on FM-Bench , where correspondences selection methods are integrated into a classic feature matching and epipolar geometry estimation pipeline (i.e., SIFT, RANSAC, and the 8-point algorithm), and the overall performance is compared. The benchmark datasets, metrics, and baselines are described below.

Datasets
FM-Bench consists of 4 datasets, including TUM (Sturm et al. 2012), KITTI (Geiger and Lenz 2012), Tanks and Temples (T&T) (Knapitsch et al. 2017), and a Community Photo Collection (CPC) dataset (Wilson and Snavely 2014). The first two datasets are popular for Visual SLAM testing, which provide indoor and outdoor videos, respectively. The last two datasets are widely used for structure-from-motion evaluation, which provides wide-baseline scenarios. Especially, the  Table 1 summarizes the test dataset, and Fig. 12 shows sample images.

Baselines
We compare with three state-of-the-art correspondence selection methods, including CODE , LPM (Ma et al. 2019), LC ) as strong baselines. CODE leverages the sophisticated non-linear optimization for finding correct correspondences, and it relies on a selfimplemented GPU-ASIFT feature, which extracts several times more features than the standard ASIFT (Morel and Yu 2009). LPM explores neighborhood structures, and LC uses deeply trained neural networks. Both methods are independent of features and use SIFT (Lowe 2004) in the original paper. Note that the comparison is unfair to our method because CODE uses ASIFT features and a very complicated solution (1000× slower than GMS).

Implementation Details
We use the DoG detector and SIFT descriptor (Lowe 2004) for generating putative correspondences. The implementation is from the VLFeat library (Vedaldi and Fulkerson 2010). We use the default parameters, which results in average 1082, 1751, 8133, and 7213 detected features in four datasets (as order in Fig. 12). The correspondences are pre-processed by using ratio test (RT), with the threshold being 0.8. Then we apply the evaluate methods (LPM, LC, and GMS) for good correspondence selection. We empirically find that using correspondences by RT results in better performance than using Bold values indicate the best performance all correspondences for LC , although it uses the latter in the original paper. We use the basic GMS in the first two SLAM datasets and the multi-scale solution in the last two SfM datasets, as where images are more unstructured. The multi-rotation solution is not used since there is no significant image rotation in these datasets. For CODE , we directly evaluate the output correspondences since it is a highly integrated correspondence system. We use the publicly available implementation for all methods and use the pre-trained model by authors for LC.

Metrics
To compare the overall performance of different correspondence systems, we fed their correspondences into a RANSAC-based 8-point estimator (Fischler and Bolles 1981;Hartley 1997) to recover the FM, and then compare the estimated FM with the ground truth FM using normalized symmetric geometry distance (NSGD) . See the appendix for details. We then report the success ratio of FM estimation (%Recall), where the NSGD threshold is 0.05, and we also show the results with varying error thresholds. What's more, we report the inlier rate (%Inlier) after RANSAC-based outlier removal for match quality comparison. Here inliers refer to matches those distances to the ground truth epipolar lines are smaller than α * l, where l stands for the length of image diagonal and α = 0.003. More details can be found in the appendix and Bian et al.  Table 2 shows the recall of FM estimation. Figure 13 shows the results with varying error thresholds on two wide-baseline datasets. Table 3 reports the inlier rate. All the above results demonstrate that GMS can show better performance with LC ) and LPM (Ma et al. 2019) using the same feature correspondences as input, i.e., SIFT-RT (Lowe 2004). However, our correspondence system is not as good as the powerful CODE  system. Compares with SIFT-RT, our approach can lead to significantly better results on T&T and CPC datasets, demonstrating the efficacy of Bold values indicate the best performance  Bold and italics denote the first and second performance, respectively GMS for high-accuracy matching. Concerning the runtime, CODE requires several seconds for correspondence selection, while GMS is more 1000 times faster. As the authors reported, LC needs 13 ms on GPU (or 25 ms on CPU) to find good correspondences from 2K putative matches, and LPM can identify false matches from 1000 putative correspondences in a few milliseconds. They are both slower than the proposed method (Fig. 11).

Effects of RT and Multi-scale
To understand how our multi-scale solution and using RT before GMS can benefit the overall performance, we conduct an ablation study by removing them respectively. The results are shown in Table 4. It shows that the performance is significantly decreased without RT. The reason is that inaccurate correspondences are included, and they limit the performance of geometry estimation. Besides, the results on wide-baseline matching datasets (T&T and CPC) demonstrate that using the proposed multi-scale solution can significantly improve the performance.

Paring with Deep Features
As Eq. 6 indicates that the performance of GMS is related to feature quality, we experiment with recently proposed deep learning based features, including the HardNet (Mishchuk et al. 2017) descriptor and HessAff (Mishkin et al. 2018) detector. Specifically, we use DoG (Lowe 2004) and Hes-sAff for interest point detection, respectively. Then we use use HardNet to compute descriptors, resulting in two putative correspondence generation solutions. The powerful CODE is compared as the strong baseline. Instead of RANSAC, we use LMedS (Rousseeuw and Leroy 1987) based FM estimator here, as it shows better performance than the former in FM-Bench . Table 5 shows the results, where GMS with deep features can achieve a competitive performance with CODE . The results are remarkable and have important implications to real-time applications since these deep features are efficient, while CODE is several orders of magnitude slower.

GMS for Monocular SLAM Initialization
Monocular SLAM methods (Mur-Artal et al. 2015) have to initialize the system before tracking and mapping, where the initial 3D map is created by triangulating correspondences to recover depths. Robust initialization has important implications for a monocular SLAM system, and high-quality correspondence is the key to this step. As Visual SLAM systems have high requirements for the run-time of methods for the overall real-time performance, many advanced correspondence approaches can not be used in this scenario. Fortunately, GMS is sufficiently fast for this purpose. In this section, we show that GMS can be used in the popular ORB-SLAM system (Mur-Artal et al. 2015) for better initialization.

Integration
In the initializer of ORB-SLAM, we replace the original Bag-of-Words based matching with the brute-force nearest neighbor matching, and we apply GMS for selecting good correspondences. The selected matches are used to recover geometry and create the 3D map by triangulation. For system stability, we use the default feature detection parameters, i.e., detecting 4K well-distributed ORB features for initialization in KITTI-like images.

Dataset and Metrics
We evaluate methods on the sequence 00-10 of KITTI odometry dataset (Geiger and Lenz 2012). For each sequence, we crop it into non-overlap fragments, with each fragment containing 20 consecutive frames. The performance averaged over all fragments is reported In each fragment, we measure (1) whether the initialization is successful; (2) how quickly the system is initialized; and (3) how many 3D points are generated. Regarding (1), we compare the estimated camera pose with the ground truth, and those are recognized as successful if pose error is less than 5 degrees in both rotation and translation. Regarding (2), we report the order of the first successfully initialized image in each fragment. Table 6 shows the results, where we compare with the original initializer. It shows that the proposed initializer leads to a significantly higher success ratio, faster initialization, and denser 3D map. This has a huge impact on monocular SLAM systems, especially when they work on challenging environments where previous solutions fail to provide reliable correspondences for initialization. The proposed method can mitigate this issue and enable the use of SLAM systems in more real-world scenarios.

Conclusion
This paper presents a fast correspondence selection algorithm that we termed grid-based motion statistics (GMS). It can effectively separate true correspondences from false ones at high speed by leveraging the motion smoothness constraint. Comprehensive experimental results demonstrate its robustness in different environments. We also show that it advances the feature matching and geometry estimation. Moreover, we plug GMS into the Monocular ORB-SLAM system for initialization, demonstrating its great potential to real-time applications. The code has been released and integrated into the OpenCV library.