Deep Corner

Recent studies have shown promising results on joint learning of local feature detectors and descriptors. To address the lack of ground-truth keypoint supervision, previous methods mainly inject appropriate knowledge about keypoint attributes into the network to facilitate model learning. In this paper, inspired by traditional corner detectors, we develop an end-to-end deep network, named Deep Corner, which adds a local similarity-based keypoint measure into a plain convolutional network. Deep Corner enables finding reliable keypoints and thus benefits the learning of the distinctive descriptors. Moreover, to improve keypoint localization, we first study previous multi-level keypoint detection strategies and then develop a multi-level U-Net architecture, where the similarity of features at multiple levels can be exploited effectively. Finally, to improve the invariance of descriptors, we propose a feature self-transformation operation, which transforms the learned features adaptively according to the specific local information. The experimental results on several tasks and comprehensive ablation studies demonstrate the effectiveness of our method and the involved components.

Unlike high-level computer vision tasks, such as object detection (Girshick et al., 2014) and semantic segmentation (Long et al., 2015), it is hard to manually label the ground truth location of keypoints, which is semantically ill-defined (DeTone et al., 2018). A feasible way to learn the keypoints is using an available detector to extract potential keypoints as the pseudo-label and then training the keypoint detection model in a supervised manner. For example, TILDE (Verdie et al., 2015) exploits SIFT to detect the keypoints at multiple scales and selects positive and negative samples according to the repeatability. In comparison, SuperPoint (DeTone et al., 2018) firstly trains a detector on a synthetic dataset consisting of simple geometric shapes with no ambiguity in the keypoint locations, such as vertices of triangles. Then, the pre-trained detector is applied to the real image many times by sampling random homographies to generate pseudo-labels which are used to further adapt the detector to real data. To generate reliable pseudo keypoints, the detector is generally required to process each training image many times, such as 100 homographies in SuperPoint.
Another solution is learning the detector and descriptor directly from the training real images without an additional pseudo keypoints extraction process (Dusmanu et al., 2019;Luo et al., 2020;Tyszkiewicz et al., 2020;Revaud et al., 2019). For example, D2Net (Dusmanu et al., 2019) and ASLFeat (Luo et al., 2020) develop a joint optimization approach for detector and descriptor learning, which enables the locally distinctive pixels to get higher detection score, i.e., be potential keypoint. In comparison, DISK (Tyszkiewicz et al., 2020) defines the matching score as a reward and exploits policy gradient method to optimize the keypoint score. For this kind of solution, since there is no ground truth keypoint as the supervision for model learning, it is important to define a training objective which can implicitly guide the network to maximize the detection score of the potential keypoints. To achieve this, D2Net (Dusmanu et al., 2019) directly derives the keypoints from the deep feature maps that are considered as the detection response map in traditional approaches (Lowe, 2004). Following D2Net (Dusmanu et al., 2019), ASLFeat (Luo et al., 2020) proposes the peakiness measure at multiple scales, which benefits the accurate localization of keypoints. R2D2 (Revaud et al., 2019) proposes to jointly learn a reliability map by maximizing the local peakiness and a repeatability map by modeling the descriptor matching precision. These methods mainly study the characteristics of keypoints and devise a keypoint detection score formulation that can guide the network to learn the detector. In fact, traditional hand-crafted detectors have made great efforts to define the keypoint score, which enables us to explore whether we can combine the efficient traditional detectors with deep neural networks in this paper.
Specifically, we investigate insights from the traditional corner detectors, especially those based on either gradient (Harris et al., 1988;Shi et al., 1994;Moravec, 1977) or intensity (Trajković & Hedley, 1998). For instance, the seminal Moravec's corner detector (Moravec, 1977) defines a corner to be a point with low self-similarity, i.e., how similar a patch centered on the pixel is to nearby and overlapping patches, as shown in Fig. 1. The similarity is calculated as the sum of squared differences (SSD) between the corresponding pixels of two patches. Finally, the cornerness measure is defined as the smallest SSD between the patch and its neighbours in horizontal, vertical, and diagonal directions. Inspired by the classical corner detectors, we propose a new deep learning-based approach for joint detector and descriptor learning. In DCNNs, each point located in the learned deep feature maps corresponds to a patch in the original image. Therefore, we can consider the similarity between spatially nearby feature vectors as the similarity between corresponding patches in the original image, as shown in Fig. 1. Based on this connection, we introduce a similarity-based keypoint measure, which evaluates the similarity between each pixel and its neighbours on the CNN feature maps. Our approach can be viewed as a corner detector that utilizes deep neural networks to compute the cornerness measure and thus we term our approach as Deep Corner.
Here, we first show the superiority of our Deep Corner through a simple experiment. We train two models with the same structure on GL3D dataset  using our similarity-based measure (S.M.) and CNN feature-based peakiness measure (F.M.) (Revaud et al., 2019;Luo et al., 2020), respectively. We report the %Rep and %MMA (higher better) on HPatches (Balntas et al., 2017) at different training stages in Fig. 2. We can find that our method performs better, even in the initialization state. Moreover, we also try to use our learned keypoint measure to guide the detector learning of F.M. using knowledge distillation (Hinton et al.,Fig. 2 Repeatability (%Rep) and Mean Matching Accuracy (%MMA) on HPatches (Balntas et al., 2017) at different training stages 2015; Gou et al., 2020), which is referred to as F.M./S. With the supervision, F.M./S can improve the performance at the beginning stage, but the overall performance does not change.
The results indicate that our similarity-based measure, which is derived from the traditional corner detector, is more effective than the keypoint detector measure solely based on CNNs to find the potential keypoints.
Apart from the new similarity-based measure, we further improve the keypoint localization accuracy and the distinctiveness of descriptors by incorporating a multi-level structure and a feature self-transformation layer. Specifically, since the keypoint localization is a pixel-level task, it would be helpful to calculate the keypoint measure in the feature maps with the original resolution. To achieve this, we design a multi-level architecture based on U-Net (Ronneberger et al., 2015), named MU-Net, which deploys the U-Net structure at multiple levels. The developed MU-Net is able to associate the high-level information with the local structure information and also preserve the local details through up-sampling feature maps at different scales to the original resolution. Moreover, in the typical CNN framework, all weights and biases are shared across all spatial locations, which might be not effective in learning invariant features robust to complex changes within image pairs. To make the network more flexible to model robust representations efficiently, we propose a feature self-transformation operation, which transfers the learned features into a new space by learning a scale factor and an offset factor adaptively according to the encoded content in each location. In addition, the feature maps usually contain specific information in each channel, and it is likely that the similarity between some channels is more related to the cornerness. Therefore, we further study to extend the similarity-based measure in Deep Corner to a multi-group version, where the feature maps are split into multiple groups and we compute the keypoint measure for each group. We conduct a series of experiments and ablation studies to analyze our method quantitatively and qualita-tively. The experimental results on several benchmarks can demonstrate the effectiveness of our method.
Both detectors and descriptors can be hand-crafted or learning-based. In recent years, with the advent of DCNNs, noticeable progress has been achieved in the learning-based solution, especially for the descriptors. A key object of deep descriptor models is learning shape-invariant features, which are insensitive to scale or view changes. One way to achieve this is exploiting data augmentation techniques, e.g., affine transformations including rotation and scaling, on the image patches (Luo et al., 2019;Tian et al., 2017;Luo et al., 2018;He et al., 2018;Tian et al., 2019). Additionally, there are some methods (Potje et al., 2021;Yi et al., 2016c, a) directly modeling the shape-aware parameters. For example, Yi et al. (2016c), Yi et al. (2016a) attempt to learn a canonical orientation for each feature point by minimizing the feature distance between positive patches (Yi et al., 2016c) or through the Spatial Transformer (Jaderberg et al., 2015) operation (Yi et al., 2016a). To improve existing descriptors, including both hand-crafted and learning-based, Wang et al. (2022d) develop a lightweight neural network with two stages, i.e., self-boosting and cross-boosting, which achieve the descriptor enhancement by exploiting geometric properties of the keypoints and mining the possible correlation between different keypoints, respectively. To address the limitation that more invariance might make descriptors less informative, Pautrat et al. (2020) and Li et al. (2022a) study the invariance selection for adapting the local feature descriptors to adverse changes in images. The former achieves this by developing a meta descriptor approach to automatically select the best invariance from learned several local descriptors with multiple variance properties; while the latter adopts a similar strategy but exploits a parallel self-attention module to get the meta descriptors. To alleviate the requirement for perpixel correspondence-level supervision, Revaud et al. (2022) devise an unsupervised learning strategy for local descriptors through explicitly integrating two matching priors (i.e., local consistency and uniqueness of the matching) in the loss objective. Recent years have witnessed that the features extracted in the original image can be exploited to recover the image appearance (Weinzaepfel et al., 2011;Mai et al., 2018), which might cause privacy disclosure. To protect sensitive information, privacy-preserving local descriptors have been also studied recently (Dusmanu et al., 2021;Ng et al., 2022).
DCNNs based detector learning focuses on the repeatability. For example, Verdie et al. (2015) propose to learn an efficient piece-wise linear regressor robust to drastic illumination changes as the keypoint detector, while Barroso-Laguna et al. (2019) study the hand-crafted and learned features together in a shallow multi-scale network and extract keypoints at different scales. Aiming at detecting rotationinvariant keypoints against geometric variations,  develop a self-supervised equivariant learning strategy based on group-equivariant convolutional neural networks with a proposed dense orientation alignment loss. To achieve the spatial distribution uniformity of keypoints and then improve the high-level matching tasks, Yan et al. (2022) devise an objective function integrating uniformity and repeatability. Due to the non-differentiable property, an alternate optimization algorithm is further developed to optimize the objective efficiently. In addition, a straightforward and interesting way to take advantage of the capability of deep learning for keypoint detection is applying the traditional corner detection strategy (Harris et al., 1988) directly on the deep features extracted from a pre-trained deep model. For example, D2D (Tian et al., 2020a) introduces two terms, named absolute saliency measure and relative saliency measure, to find keypoints from a pre-trained descriptor without any additional training. D2D only focuses on the keypoint detection and does not learn the detector and descriptor jointly, which might be not able to mine the capacity of deep neural networks effectively. In comparison, we investigate the traditional detector strategy in a describe-and-detect framework by designing suitable measures and operations for both detector and descriptor learning. Describe-and-detect (Luo et al., 2020;Barroso-Laguna et al., 2020;Dusmanu et al., 2019;Christiansen et al., 2019;Zhang et al., 2020;DeTone et al., 2018;Revaud et al., 2019;Liu et al., 2021;Bhowmik et al., 2020;Shen et al., 2019;Suwanwimolkul et al., 2021;Zhao et al., 2022a;Tyszkiewicz et al., 2020;Wang et al., 2022c;Santellani et al., 2022;Yang et al., 2022;Siqueira et al., 2022;Sun et al., 2022b) aims to extract the keypoints and corresponding descriptors in a single stage. Dusmanu et al. (2019) propose the first solution, i.e., D2Net. D2Net couples the feature detector with the feature descriptor tightly, where the detection map and the descriptors are from the same deep feature maps. They use the VGG16  pre-trained on ImageNet (Krizhevsky et al., 2012) to initialize the backbone network. However, D2Net is prone to low accuracy of keypoint localization. To address this issue, Luo et al. (2020) propose a simple multi-level keypoint detection maps fusion strategy. Additionally, they exploit the deformable convolution (Zhu et al., 2019) to extract geometric-invariant features. In comparison, Revaud et al. (2019) propose to estimate a reliability map as well as a repeatability map for learning repeatable and reliable matches. To address the problem of ambiguity in the ground truth, DeTone et al. (2018) propose to train the network with two branches, one for detector and the other for the descriptor, on synthetic data with the pseudo-ground truth using self-training. ASLFeat achieves state-of-the-art scores on multiple tasks among these methods. To make the model perform better on the downstream task, like matching, Tyszkiewicz et al. (2020) exploit a different learning strategy by optimizing the matching reward in a reinforcement learning framework. Instead of using a series of convolutional operations as previous methods did, Wang et al. (2021) exploit the Transformer structure (Vaswani et al., 2017) to capture the long-range dependencies and then improve the feature representation. Similarly, in Wang et al. (2022b), the transformer is also exploited to capture wider spatial context to construct robust local descriptors. To fully exploit both low-level and high-level features, Sun et al. (2022b) develop an adaptive multi-level feature fusion structure for descriptor learning. Considering that it might be challenging to jointly train the detector and descriptor in the describe-and-detect pipeline, Li et al. (2022b) propose to decouple the detection stage from the description step and first learn the description network which is then frozen when the detection network is training. In this paper, we follow the training scheme in D2Net and ASLFeat, while aiming at studying the learning of detectors and descriptors through taking advantage of the similarity between neighboured points.

Deep Corner
In this section, we introduce our approach in detail, including the similarity based measure for keypoint detection, feature self-transformation for descriptor learning. Before presenting our method, we first briefly review two previous works, i.e., D2Net (Dusmanu et al., 2019) and ASLFeat (Luo et al., 2020), which are closely related to our work.

Revisiting D2Net and ASLFeat
Let I ∈ R H ×W and X = F(I ) ∈ R C×H ×W denote the input image and deep representation acquired from the network F, where H (H ) and W (W ) represent the height and width of I (X ), respectively, and C represents the number of channels. Based on the representation X , D2Net (Dusmanu et al., 2019) gives the following definition of a keypoint: According to the definition, D2Net and ASLFeat design different formulations to calculate the keypoint measure. In addition, in contrast to D2Net, which only considers the last feature maps for the computation of keypoint measure map, ASLFeat exploits the feature maps at multiple scales, and fuses the measures via the summation operation.
To train the network, a set of image pairs and the correspondences between them are required. Considering an image pair (I , I ) and the correspondence set O between them, we denote the keypoint measure by s o and s o , the descriptor (feature vector) by x o and x o for each correspondence o ∈ O. Then, the loss function in D2Net and ASLFeat is written as: where M(x, x ) denotes the ranking loss for descriptor learning. In ASLFeat, the hardest-contrastive form (Choy et al., 2019) is exploited to implement M(x, x ) as follows: where d(x, x ) denotes the Euclidean distance, and m p and m n represent the predefined margins. In the following, we introduce our method with the notations defined above.

Keypoint Detection
Similarity-based keypoint measure. Traditional works on corner detection select the points with distinctive properties (Harris et al., 1988;Shi et al., 1994). Our Deep Corner, inspired by Moravec's corner detector (Moravec, 1977), defines a corner to be a point with low self-similarity. The self-similarity is defined as the similarity between the patch centered on the pixel and the nearby overlapping patches. Our method is based on the same principle but our approach differs from Moravec's corner detector in two aspects. First, we consider the self-similarity property on the learned deep feature maps instead of the raw pixels. As a location in the deep feature maps corresponds to a patch in the original image, the similarity between deep feature vectors corresponds to the measuring similarity of corresponding patches using CNN features, which contain richer structural information than the raw pixels. Second, based on self-similarity, we propose a new distinctiveness measure function that is differentiable and thus enables end-to-end learning of the network parameters. Specifically, we expect the network to be able to detect the point which is distinctive in the local patches and at the same time the distinctiveness of which is also a local maximum. The former condition guarantees the local uniqueness of the keypoints. However, for a region with complicated textures, it is easy to find a point that differs from its neighbours. Therefore, the network might detect the keypoints with low repeatability. To alleviate this issue, we introduce the second requirement, which further constrains the local structure of the keypoints. As a result, we provide the following definition for the detector: Here, S(x 1 , x 2 ) = 1 ex p(||x 1 −x 2 || 2 ) represents the similarity between two feature vectors x 1 and x 2 , D denotes the local distinctiveness map, and N (i, j) represents the neighbours of point (i, j). As S(X i j , X i j ) is always equal to 1, we thus define the distinctiveness as the average distance between the pixel (i, j) and its neighbors instead of S(X i j , X i j ) itself. In detail, we calculate the distinctiveness D i j measuring the difference between the point (i, j) and its neighbours as follows: Based on the local distinctiveness map D, we calculate the measures for the two conditions in Eq. 4, which can be optimized in a deep neural network. Specifically, for each location (i, j), to achieve the first condition in Eq. 4, we define a measure: For the second condition, we calculate a measure reflecting the local maximum property of the distinctiveness of the keypoint, written as: where σ is a non-linear activation function to enforce all measures to be positive. In our experiments, we select the So f t Plus function as the activation. The final measure s i j of point (i, j) is obtained by: Multi-level detection. To improve the keypoint localization accuracy, ASLFeat (Luo et al., 2020) resorts to the feature maps at multiple levels, which have different resolutions/scales. Specifically, it first gets the measure maps at different scales. Then, it up-samples the maps with low resolution to the original resolution, and a summation operation is exploited to fuse these maps. However, the fine information is lost at the low-resolution maps, and as a result, it is difficult to directly compute the keypoint detection measure from those feature maps. More specifically, the location in the map with low resolution corresponds to a large region in the original image, and the adjacent points are distant from each other in the original image. Therefore, the relationship (local maximum or dis-similarity) between the adjacent points cannot measure the distinctiveness well. To address this issue, we choose to firstly up-sample the feature maps to have the same spatial resolution as the feature maps at the first level (original resolution), and then calculate the keypoint measure on all up-sampled feature maps. As a result, the multi-level information can be exploited without the negative impact caused by the low resolution. Despite the subtle difference, the performance gain brought by our multi-level is significant, as shown in the experimental results.
To achieve this, we propose a multi-level U-Net structure (MU-Net), in which low-resolution feature maps at each level are up-sampled progressively within a U-Net architecture (Lin et al., 2017), as shown in Fig. 3. In our experiments, we consider three levels, i.e., X 1 ∈ R C 1 ×H ×W , where C * denotes the number of channels. We up-sample X 2 and X 3 into the same spatial resolution as X 1 and change the number of feature channels by using the deconvolution operation, and denote the up-sampled features by X 1 ↑ ∈ R C 1 ×H ×W and X 1 ↑↑ ∈ R C 1 ×H ×W , respectively. Then, we calculate the keypoint measure map according to Eqs. 5-7 from X 1 , X 1 ↑ and X 1 ↑↑ , and combine the measures together through the summation operation.
Multi-group detection. The deep features tend to encode specific information/concept (Yang et al., 2020) in different channels. Therefore, it is likely that the similarity between some channels is more related to the cornerness, while others are not very related, such as those encoding the brightness. In other words, it is sufficient to consider the similarity in those channels. Motivated by this, we split the feature maps X into G groups along the channel dimension, each group containing C G channels. Then we compute the keypoint measure for each group, which is represented as s g , and take the maximum as the final measure. Therefore, Eq. 7 can be extended to: where * g i j denotes the measure value of point (i, j) at the g th group. We apply Eq. 8 into X 1 , X 1 ↑ and X 1 ↑↑ , respectively. We visualize the measure maps at different groups for a better understanding in the ablations.

Descriptor Learning
Our work is not only aimed at finding keypoints, but also extracting descriptors for each keypoint. In the standard CNN framework, the learned weights and biases are shared across all spatial locations. Since the images to be matched usually contain contents with different conditions, it is challenging to model representations robust to the complex changes. To provide a remedy, we propose the feature self-transformation operation, which transfers the feature representation from the original space into a new space adaptively according to the encoded local information. In detail, a scale factor and an offset factor are first learned adaptively from the feature maps to be transformed. Since the scale factor and offset factor varies from location to location, then the feature extractor has a larger capacity to learn more robust invariant features in a flexible way when we apply the learned scale and offset factors on the corresponding feature maps. For the learned features X , the self-transformation operation is defined as follows: where F φ and F ψ denote two convolutional operations with ReLU activation, respectively,X represents the new feature maps, and · denotes the element-wise multiplication. Equation 9 represents that a scale factor F φ (X ) and an offset factor F ψ (X ) are firstly learned adaptively from the feature itself, which are then used to transform the feature into a new space.
To better transform the features, we exploit the selftransformation in multiple layers. Since here we aim to improve the descriptor, the self-transformation is adopted in the three layers before the learned descriptor, as shown in Fig. 3. Moreover, in our experiments, we find that the detector information is also beneficial to improve the distinctiveness of the descriptor, as the detector can also provide some details for the local information. In detail, as shown in Fig. 3, we exploit the feature self-transformation, which follows the convolutional layer, at the last level. We first encode X 1 , X 1 ↑ and X 1 ↑↑ into a new representation X new by exploiting the concatenation operation and two convolutional operations. Then, for the feature maps X to be transferred, the feature self-transformation in Eq. 9 can be re-written as: where [·||·] denotes the concatenation operation. In our experiments, we will show the capability of the proposed selftransformation for improving the invariance of the descriptor.

Implementation
Architecture. The architecture is shown in Fig. 3. The backbone network is similar to that used in L2Net (Tian et al., 2017), ASLFeat (Luo et al., 2020), and R2D2 (Revaud et al., 2019). In detail, the backbone consists of three levels. There are 2, 2, and 5 convolutional layers at the first, second, and third levels, respectively. The feature maps X 1 and X 2 are the output of the second convolutional layer at the first and second levels, respectively. Since we want to exploit the upsampled feature maps to guide the feature self-transformation in the last convolutional layers, we thus use the output of the second convolution at the third level as X 3 instead of the last one, as shown in Fig. 3. For the outputs of the second to the fourth convolutional layers at the last level, we apply the proposed feature self-transformation operation.N (i, j) in Eq. 5 contains 24 neighbours sampled uniformly from a 9 × 9 region centered on pixel (i, j) (excluding itself). N (i, j) in Eq. 6 contains 8 neighbours sampled uniformly from a 7×7 region centered on pixel (i, j) (excluding itself), as ASLFeat does. In addition, we set the number of groups to 4. We implement our method using PyTorch.
Training details. Similar to Luo et al. (2020) and Luo et al. (2018), we train our network on around 800k image pairs from GL3D  and (Radenović et al., 2016) containing the ground truth cameras and depths. We train the network from scratch with the batch size of 2 and use the SGD optimizer with the momentum of 0.9. We first train the main network without the feature transformation for 400k iterations with the initial learning rate of 0.1. Then we train the whole network initialized with the pre-trained weights for another 200k iterations with the initial learning rate of 0.01. The training loss is the same as that used in ASLFeat (Luo et al., 2020), i.e., Eqs. 2 and 3, where m p and m n are set to 0.2 and 1.0, respectively.
Inference. Following previous works (Dusmanu et al., 2019;Luo et al., 2020), we first exploit a non-maximum suppression sized 3 to filter the keypoints that are adjacent. Then, we use the local refinement (Lowe, 2004) to improve the position of detected key points. Lastly, we extract the descriptors at the refined locations using the bilinear interpolation operation. To address the scale changes, we apply the multi-scale detection (referred to as 'MS' in our experiments) by resizing the image into different resolutions and then detecting keypoints at each scale during testing, as done in previous works (Dusmanu et al., 2019;Luo et al., 2020;Revaud et al., 2019).

Experiments
To demonstrate the effectiveness of our method, we provide quantitative comparisons against previous related methods on three tasks, including image matching, 3D reconstruction, and visual localization. We also conduct comprehensive ablation studies to analyze our method.
In the HPatches dataset, there are 116 available image sequences, and we select 108 sequences for evaluation, as done in D2-Net (Dusmanu et al., 2019) and ASLFeat (Luo et al., 2020). Each sequence consists of 6 images, and there exists only illumination change in 52 sequences and only viewpoint change in the other 56 sequences. Following previous works (Luo et al., 2020;Revaud et al., 2019;Dusmanu et al., 2019), we use three metrics to evaluate our method, including (1) Repeatability (%Rep): the ratio of the number of possible matches found in the two images to the minimum number of detected keypoints in the shared view; (2) Matching Score (%M.S.): the ratio of the number of correct matches found in the image pair and the minimum number of detected keypoints in the shared view; (3) Mean Matching Accuracy (%MMA): the ratio of the number of correct matches to the number of matches found through applying nearest-neighbor search on the descriptors. The 'possible match' in Repeatability indicates that the point distance is below a given threshold after the homography warping. The 'correct match' in Matching Score and Mean Matching Accuracy indicates that the match found through applying a nearest-neighbor search on the descriptors is a 'possible match'.
FM-Bench dataset contains images from four different datasets, including TUM dataset (Sturm et al., 2012), KITTI dataset (Geiger et al., 2012), Tanks and Temples dataset (T&T) (Knapitsch et al., 2017), and Community Photo Collection (CPC) (Wilson & Snavely, 2014). For evaluation, we first estimate the fundamental matrix through keypoints and descriptors extraction, matching by the plain nearest-neighbor search, bad matches rejection (e.g., Lowe's ratio test (Lowe, 2004)), and geometric verification (e.g., RANSAC (Fischler & Bolles, 1981)) successively. To measure the estimation accuracy, we compute the Normalized symmetric geometry distance (SGD) (Zhang, 1998) error and classify the estimates with the error below a certain threshold as accurate, as done in FM-Bench. Following (Bian et al., 2019;Luo et al., 2020), we use the %Recall, which indicates the ratio of accurate estimates to all estimates, for the overall performance evaluation. In addition, %Inlier/%Inlier-m is also used to show the matching performance after/before RANSAC, while the correspondence number after/before RANSAC (%Corr/%Corr-m) is also reported for analysis on the results rather than performance comparison.
Comparisons on HPatches. We report the results of previous approaches which follow the detect-then-describe or describe-and-detect pipeline. For the former pipeline, we consider (1) Hessian Affine keypoint detector (Mikolajczyk & Schmid, 2004) + RootSIFT descriptor (Arandjelović & (1237) The highest score is given in Bold Our methods outperform most of the existing methods on TUM (Sturm et al., 2012) and CPC (Wilson & Snavely, 2014) while yielding close performance to ASLFeat on T&T (Knapitsch et al., 2017) and KITTI (Geiger et al., 2012). ↑ indicates higher better (2) a learned shape estimator (HesAffNet (Mishkin et al., 2018)) and descriptor (HardNet++ (Mishchuk et al., 2017)), referred to as HAN+HN++; (3) ContextDesc (Luo et al., 2019) with SIFT detector (Lowe, 2004), referred to as SIFT+ContextDesc; (4) LF-Net (Ono et al., 2018), an endto-end trainable network. For the latter pipeline, we consider D2-Net ( (Tyszkiewicz et al., 2020), ALIKE (Zhao et al., 2022a), and PoSFeat (Li et al., 2022b). Note that, PoSFeat decouples the detector and descriptor by training them in two stages, while others train them together. As shown in Fig.4, our method with/without multi-scale detection (Ours/Ours MS) yields higher performance than most of the previous methods. Moreover, our methods gain the highest scores on the subset with viewpoint change, especially for matching thresholds below 5 pixels. We can find DELF (Noh et al., 2017) outperforms all the other methods on the subset with illumination change for matching thresholds below 4 pixels. It is because DELF uses a fixed grid of keypoints without further position refinements. This design performs well when there is only illumination change, but it is not robust to viewpoint change, which is universal in real applications. By training the detector and descriptor in two stages, PoSFeat achieves the highest performance.
Comparisons on FM-Bench. As shown in Table 1, 1 our method outperforms most of the existing approaches, especially those also following the describe-and-detect pipeline. In comparison with ASLFeat, ours yields close or higher scores.
For evaluation, we first perform exhaustive image matching with both ratio test at 0.8 and mutual check for outlier Table 3 Results on IMC benchmark (Jin et al., 2021)  Bold values indicate the highest mAA score LoFTR_v4* is the latest performance of LoFTR while SDS* yielding the highest score in the benchmark website is a combination of SuperPoint, DISK, and SuperGlue. We provide these two methods for reference Bold values indicate the best localization score among the compared methods Note that, R2D2 is trained on this dataset Bold values indicate the best localization score among the compared methods 'SS' indicates the combination of SuperPoint and SuperGlue, one for detecting and representing and one for matching; 'SP' indicates the combination of SuperPoint and PUMP, one for detecting and one for representing; LoFTR is an efficient end-to-end matching method rejection. Following the protocol defined by (Schonberger et al., 2017), we run SfM (Schonberger & Frahm, 2016) for sparse reconstruction and MVS (Schönberger et al., 2016) for dense reconstruction. For the former task, we report the number of registered images (referred to as #Reg. Images), the number of sparse points (#Sparse Poi.), the mean track length of the 3D points (Track Len.) and the mean reprojection error (Reproj. Err.). For the latter task, we report the number of dense points (#Dense Poi.). Both ASLFeat (Luo et al., 2020) and ours limit the maximum number of keypoints to 20k. We report the results obtained using single-scale detection and multi-scale detection. As shown in Table 2, our method performs better than most of the other methods. In comparison to ASLFeat (Luo et al., 2020), our model yields lower reprojection error, which demonstrates the effectiveness of our method for 3D reconstruction. Evaluation on IMC. The Image Matching Challenge 2020 benchmark (IMC) provides a dataset with thousands of phototourism images of 25 landmarks, which are taken from diverse viewpoints, with different cameras, in different illumination and weather conditions. For evaluation, this benchmark provides two tasks, i.e., stereo and multiview  (Luo et al., 2020); S.M.: similarity-based measure (Ours); F.T.: feature self-transformation reconstruction, where the reconstructed poses are compared to the ground truth. In the stereo task, we first extract local features across every pair of images and then use RANSAC to reconstruct the relative pose, while in multiview task, we use COLMAP (Schonberger & Frahm, 2016) to reconstruct the pose from small subsets of 5, 10, and 25 images. According to the benchmark documentation, we consider two categories, a limited budget of 2048 keypoints and a limited budget of 8000 keypoints. We select the hyperparameters on the validation set of three scenes, including "Reichstag", "Sacre Coeur", and "St. Peter's Square". Specifically, we set the ratio test threshold to 0.95, and use DEGENSAC (Chum et al., 2005) with an inlier threshold of 0.75 pixels for the stereo task. With the selected hyperparameters, we submit the extracted features from nine test sets to the website and report the results in Table 3. We compare our method with several previous describe-and-detect methods, like SuperPoint (DeTone et al., 2018), LF-Net (Ono et al., 2018), D2-Net (Dusmanu et al., 2019), R2D2 (Revaud et al., 2019), ASLFeat (Luo et al., 2020), ALIKE (Zhao et al., 2022a), DISK (Tyszkiewicz et al., 2020), and PoSFeat (Li et al., 2022b). We also provide the performance of two stateof-the-art methods, LoFTR (Sun et al., 2021) (an end-to-end image matching method) and SDS (a combination of Super-Point (DeTone et al., 2018), DISK (Tyszkiewicz et al., 2020), and SuperGlue (Sarlin et al., 2020).) The results are reported in the benchmark website or (Jin et al., 2021). The most important metric is the mean Average Accuracy (mAA) up to a 10-degree error threshold. Our method does not perform better than recent methods, like DISK and PoSFeat, We set the threshold for 'possible match' to 3, 4, 5, 6, 7, and 8 pixels, respectively. F.T. refers to feature self-transformation while yielding highest mAA scores among R2D2, D2-Net, and ASLFeat.
To better compare our method with ASLFeat, we also provide the results under different ratio test thresholds and inlier thresholds on the IMC validation set. As shown in Fig. 5, we can find that in both categories (2048 and 8000 features), our method performs better than ASLFeat almost for each ratio test threshold and inlier threshold.

Visual Localization
Here, we evaluate our method's performance in the visual localization task on the Aachen Day-Night dataset v1.0 and v1.1 (Sattler et al., 2012;Zhang et al., 2021), where the objects are matching images with extreme day-night changes. We first use the compared methods to generate the localization and description of the keypoints respectively, and then use the code from Sattler et al. (2012) for image registration. We limit the maximum feature number of all methods to 5000 and 10000, respectively. Through submitting the results to the benchmark, we can get the percentages of successfully localized night-time images within three given error bounds, i.e., (0.25m, 2 • ), (0.5m, 5 • ), and (5m, 10 • ). We make comparisons against R2D2 (Revaud et al., 2019), D2-Net (Dusmanu et al., 2019), and ASLFeat (Luo et al., 2020), whose codes are publicly available. We repeat the experiments three times and report the average results. Due to the extreme illumination changes, it is challenging to match the night images to the day images, which all methods do not address well. As shown in Table 4, our method yields comparable scores to these three previous methods, especially ASLFeat.

Ablation Studies
Here, we analyze the main components in our method, including the similarity-based measure, multi-level U-Net structure, feature self-transformation, and multi-group keypoint measure, on HPatches dataset (Balntas et al., 2017). We report the %Rep, %M.S., and %MMA in Table 6. Here, we select up to 5000 keypoints with the keypoint measure over 0.5 and set the threshold for 'possible match' to three pixels. Similarity-based Measure. First, we train the baseline network (no feature self-transformation and no MU-Net) using the feature-based measure (abbr. F.M.) in ASLFeat (Luo et al., 2020) and the proposed similarity-based measure (abbr. S.M.), respectively. Specifically, as done in (Luo et al., 2020), we compute the detection map on three feature maps coming from three levels. As shown in Table 6, S.M. yields higher scores than F.M. w.r.t. all three metrics, which shows the superiority of the proposed similarity-based keypoint measure. We also evaluate the two requirements of the detector, i.e., local distinctiveness (α) and local maximum of distinctiveness (β). The comparisons between S.M. (α), S.M. (β), S.M., and F.M. show that any one of the requirements (α and β) can perform better than feature-based peakiness measure, and the joint modelling can bring further improvements.

Fig. 8
Detection results on simple scenes. The first is from skimage 6 ; The last three come from Shui and Zhang (2013) . 7 We limit the maximum number of keypoitns to 100 Multi-level Detection. We study different multi-level detection strategies, including U-Net (only up-sampling feature maps at the last layer), multi-level detection in ASLFeat, and our MU-Net. Firstly, we provide the performance of singlelevel detection by calculating the measures from the first (i.e., the original resolution) and third level, respectively. As shown in Table 6, we can find that both F.M. and S.M. performs better at the original scale (marked by 1 st ) than the third level (3 rd ) and even better than the multi-level detection proposed in ASLFeat according to the comparisons of , which can also support the analysis. Therefore, to exploit the multi-level information and avoid losing detailed information, we introduce a multi-level U-Net architecture to better achieve multi-level detection. The performance improvements brought by MU-Net on both F.M. and S.M. show the effectiveness of the proposed multi-level detection strategy. Feature self-transformation. Based on the network with the multi-level U-Net, i.e., S.M. (MU-Net), we add the proposed feature self-transformation operation (referred as F.T.), as shown in Fig. 3. The comparison between F.T. and S.M. (MU-Net) demonstrates that the feature transformation is able to improve the discriminative power of the descriptors. In our experiments, we deploy the feature transformation in the Fig. 9 Detection results on indoor scenes. We limit the maximum number of keypoints to 2000 last three convolution layers. Here, we also try to deploy the transformation in the last two (marked by L2) and one (L1), respectively, but observe the performance drops. Nevertheless, all three models, i.e., F.T., F.T. (L2), and F.T. (L1), perform better than S.M. (MU-Net), which demonstrates the effectiveness of the proposed feature self-transformation. Considering that the improvements brought by the transformation might result from the increase of convolutional parameters, we replace the transformation operation with (1) two successive convolutions and (2) a residual block with the same number of parameters as the transformation. We do not find any meaningful improvements in comparison with S.M. (MU-Net). For example, with the first replacement, we only observe a 0.02% improvement (76.43% to 76.45%) for %M M A, while our F.T. improves S.M. (MU-Net) by 0.9%. Furthermore, as the HPatches (Balntas et al., 2017) dataset consists of two subsets, one containing illumination changes and one containing viewpoint changes, here we provide the improvements brought by the proposed feature selftransformation for the two scenarios separately. Table 7 shows that the proposed feature transformation can bring improvements w.r.t %M M A for both viewpoint change and illumination change, in comparison with the baseline without feature self-transformation. Since most of the illumination changes in HPatches dataset are not severe, previous methods yield higher performance on the illumination subset than the viewpoint subset, as shown in Fig. 4. Compared with these methods, our model brings fewer improvements on the illumination subset than the viewpoint subset but still performs better than previous methods on both subsets.
We also provide some visualization examples in Fig. 6. We can find that both methods do not perform well when there exists remarkable illumination change. In addition, in Table 7, we can observe that the lower %Rep are acquired on the illumination subset, since the illumination might make the images low-quality, which is a challenging scenario to be addressed in the future. Multi-group Measure. We set the number of groups (G) in the S.M. (MU-Net) with feature self-transformation to 1 (i.e., F.T.), 2, and 4, respectively. As shown in Table 6, we can find that a slight improvement can be brought by increasing the number of groups. Here, We also provide a visualization example to illustrate the learned score maps at different levels and groups. As shown in Fig. 7 where 'Level 1', 'Level Detection results on outdoor scenes. We limit the maximum number of keypoints to 5000. We highlight the main differences with a red rectangular. Zoom in for best view (Color figure online) 2', and 'Level 3' correspond the X 1 , X 1 ↑ , and X 1 ↑↑ in Fig. 3, respectively, we can find that the distinctiveness map varies from the level and group. For example, at the first level, the similarity in the first group is more important, while the similarity in the last group provides more information for the detector at the third level.
According to the analysis above, we can conclude that (1) the proposed keypoint measure and multi-level detection boost %Rep and %M M A remarkably; (2) the feature selftransformation mainly improves the discriminative capability of the descriptors (%M M A); (3) the multi-group measure further refines %Rep and %M M A without increasing the number of learned parameters but slightly. Lastly, we also make a comparison against ASLFeat (Luo et al., 2020). As shown in Table 6, our model outperforms ASLFeat w.r.t all there metrics, especially %Rep and %MMA.
In addition, relying on traditional keypoint detection strategies, a previous method, D2D (Tian et al., 2020a), also defines a keypoint measure containing two terms, one for absolute saliency and the other for relative saliency. However, it straightforwardly applies the defined measure on the deep feature maps extracted from a pre-trained descriptor model without extra training. The reported results in the paper show that the performance greatly depends on the pretrained descriptor models. In contrast, we learn the detector and descriptor jointly in an end-to-end manner. Here, we also use the measure in D2D to replace our keypoint measure, and report the results in Table 8. We can find that the D2D measure which relies on a pre-trained descriptor yields lower scores than ours.

Visualization Examples
In this part, we first visualize the keypoints detected by our method, DISK (Tyszkiewicz et al., 2020), ASLFeat (Luo et al., 2020), and D2-Net (Dusmanu et al., 2019), respectively. In Fig. 8, we show four samples containing simple contents. We can find that compared with the other three methods, ours can find the corner points better. In Figs. 9 and 10, we provide several examples (Bian et al., 2019) with more complex contents. We can observe that D2-Net generates worse detection results than others. The regions marked by the red rectangular in Fig. 10 illustrate that our method can yield more accurate corners than ASLFeat, especially for the Windows in the first and third examples. We can surprisingly find that DISK is able to detect the points which are very useful for image matching while ignoring the useless. For example, in the first and fourth samples, DISK does not detect the keypoints in the regions of Pedestrian which can be considered as needless information for matching the building, while the other three methods do. We attribute the advantages to the efficient learning strategy.
We then provide several matching examples with the view change in Fig. 11 and several examples with both view and illumination changes in Fig. 12. In each example, we only  (Sattler et al., 2012). We can find that our method is capable of addressing the view changes effectively. However, we can also observe that when there exist serious illumination changes and the illumination makes the images low-quality, it is quite challenging for our method to find enough correct matches.

Conclusion and Discussion
This paper aims at learning the local feature detector and descriptor jointly, following the describe-and-detect pipeline. To achieve that, we propose a new method called Deep Corner, which is inspired the the tradition corner detection methods. Specifically, we first propose the similaritybased measure for keypoint detection, which is able to select repeatable keypoints effectively and thus beneficial for the learning of descriptor. Additionally, to improve the keypoint localization accuracy, we further design a MU-Net structure for multi-level detection and extend the proposed measure into the multi-group version. Finally, we propose a feature self-transformation operation to improve the invariance of the descriptors. Experimental comparisons with previous related methods and ablation studies demonstrate the effectiveness of our method.
Limitation: According to the results in Fig. 4 and Table 4, the performance advantage over previous works yielded by our method for illumination change is not as high as for viewpoint change. In fact, the severe illumination change is intractable to address, which is also analyzed in previous sections. For example, the night images have low quality compared with the day images, and therefore, it is hard to detect keypoints and extract distinguishable representations. In the future, maybe we could address this issue by exploiting the image translation techniques and/or improving the normalization operation.

Data Availability
The datasets generated during and/or analysed during the current study are available publicly: 1. GL3D : https://github.com/lzx551402/GL3D 2. HPatches (Balntas et al., 2017): http://icvl.ee.ic.ac.uk/vbalnt/hpatches/ 3. FM-Bench (Bian et al., 2019): https://github.com/JiawangBian/FM-Bench 4. ETH Benchmark (Schonberger et al., 2017): http://landmark.cs.cornell.edu/projects /1dsfm/images.Tower_of_London.tar; http://landmark.cs.cornell.edu/ projects/1dsfm/images.Gendarmenmarkt.tar; http://landmark.cs.cornell. edu/projects/1dsfm/images.Madrid_Metropolis.tar; 5. IMC Benchmark (Jin et al., 2021): https://www.cs.ubc.ca/research/image-matchingchallenge/2020/data/ 6. Aachen Day-Night (Sattler et al., 2012;Zhang et al., 2021): https://www.visuallocalization.net/datasets/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.