Learning Accurate Template Matching with Differentiable Coarse-to-Fine Correspondence Refinement

Template matching is a fundamental task in computer vision and has been studied for decades. It plays an essential role in manufacturing industry for estimating the poses of different parts, facilitating downstream tasks such as robotic grasping. Existing methods fail when the template and source images have different modalities, cluttered backgrounds or weak textures. They also rarely consider geometric transformations via homographies, which commonly exist even for planar industrial parts. To tackle the challenges, we propose an accurate template matching method based on differentiable coarse-to-fine correspondence refinement. We use an edge-aware module to overcome the domain gap between the mask template and the grayscale image, allowing robust matching. An initial warp is estimated using coarse correspondences based on novel structure-aware information provided by transformers. This initial alignment is passed to a refinement network using references and aligned images to obtain sub-pixel level correspondences which are used to give the final geometric transformation. Extensive evaluation shows that our method is significantly better than state-of-the-art methods and baselines, providing good generalization ability and visually plausible results even on unseen real data.


Introduction
Template matching aims to find given templates in captured images (source images), and is a fundamental technique for Manuscript received: 2022-01-01; accepted: 2022-01-01 many computer vision tasks, including object detection, visual localization, pose estimation, etc. Numerous approaches have been proposed to overcome the difficulties and challenges in template matching, and the problem may seem to be solved.Nevertheless, this problem is a critical step in automatic processing on industrial lines, and in real scenarios, various challenges remain, including domain gap, size variance, and pose differences between template and source image.The above challenges motivate our approach of accurate template matching based on differentiable correspondence refinement.
Classic methods of template matching [1][2][3][4] generally calculate a similarity score between the template and a candidate image patch.Linemod-2D [1] utilizes gradient spreading and gradient orientation similarity measures, achieving real-time detection with high accuracy and robustness for untextured objects and is widely used in industry.However, its performance degrades significantly in the presence of cluttered backgrounds, image blurring or non-rigid deformation between template and source; these are all common in real world applications.
Deep learning has shown great potential to overcome works [10][11][12][13] have made remarkable progress in deep feature matching; e.g.LoFTR [13] uses transformers for this task and omits the feature detection step.However, there are limitations when applying LoFTR directly to template matching problems.Firstly, it tends to fail on cross-modal images, when mask images and grayscale images lie in very different in feature spaces.Secondly, the structural consistency of templates and images is not exploited, yet it is critical for accurate matching.More importantly, LoFTR cannot provide sufficiently accurate and reliable correspondences for untextured regions, or when large deformations exist, as in the cross-modal template matching problem.
Motivated by the challenges, we propose a differentiable structure-aware template matching pipeline.To address the modality difference between the template and the source image, we use a translation module to convert both of them to edge maps.We believe structural information is particularly important for fast and robust template matching: the template mask has a specific structure (shape) and correct correspondences between the template and the image should satisfy a specific transformation relationship.Therefore, we fully exploit template contour information and consider compatibility of angles and distances between correspondences.Specifically, we apply three strategies in our model to better use the structural information of templates and images.Firstly, in order to focus the network on valid areas, we only sample contour regions of the template as the input.Then the transformer [14] using relative positional encoding [15] is used to explicitly capture relative distance information.A method based on distance-and-angle consistency rejects soft outliers.
In pursuit of high-quality template matches, the transformation between the template and the source image is estimated in a coarse-to-fine style.In the coarse-level stage, we use transformers [14] to encode local features extracted by the convolutional backbone and then establish feature correspondences using a differentiable matching layer.By assigning confidences to these coarse-level matches based on feature similarity and spatial consistency, we obtain a coarse estimate of the geometric transformation, a homography.This coarse matching overcomes differences in scale and large deformations between the source and template image, which is critical for accurate matching at the fine level.We apply the coarse spatial transform [16] to coarsely align the source image, which then provides an updated source image for the fine level.A refinement module is used at the fine-level to obtain global semantic features and to aggregate features at different scales.We then adopt a correlation-based approach to determine accurate dense matches at the sub-pixel level.These final correspondences are more accurate, and no outlier rejection is needed.All correspondences are used to calculate the final homography.Compared to other recent matching methods [10][11][12][13], our correspondences have many fewer outliers, allowing our method to provide robust and accurate template matching without relying on RANSAC [17].
We use a linear transformer [18] in our pipeline to reduce computational complexity.Farthest point sampling (FPS) is applied to the template image to reduce the input data while retaining its structure.To solve the problem of insufficient training data, GauGAN [19] is adopted to generate synthetic images of industrial parts for network training.
We have evaluated the proposed method on three datasets, including two newly-collected industrial datasets and a dataset based on COCO [20].Our approach provides significantly improved homography estimates compared to the best baseline, as we show later.
Our main contributions can be summarized as: • An accurate template matching method, robust in challenging scenarios including cross-modality images, cluttered backgrounds, and untextured objects. 2 Related Work

Template Matching
Traditional methods of template matching mostly rely on comparing similarities and distances between the template and candidate image patch, using such approaches as sum of squared differences (SSD), normalized cross-correlation (NCC), sum of absolute differences (SAD), gradient-based measures, and so on.Linemod-2d and the generalized Hough transform (GHT) [2] are widely applied in industry.Such approaches degrade significantly in the presence of cluttered backgrounds, image blurring or large deformations.Deep learning-based template matching algorithms [5][6][7][8][9] can handle more complex deformations between the template and source image.They usually adopt trainable layers with parameters to mimic the functionality of template matching.Feature encoding layers are assumed to extract the features from both inputs; these deep feature encoders dramatically improve template matching results.While these methods still Gao et al.
rely on the rich textures of input images.However, deep learning methods are prone to fail with cross-modal input and are typically unable to provide an accurate pose for the target object.
Motivated by these challenges, our method predicts a homography transformation, and uses an edge-aware module to eliminate the domain gap between the mask template and the grayscale image for robust matching.

Homography Estimation
Classical homography estimation methods usually comprise three steps: keypoint detection (using e.g.SIFT [21], SURF [22], or ORB [23]), feature matching (feature correlation), and robust homography estimation (using e.g.RANSAC [17] or MAGSAC [24]).However, RANSAC-like approaches are non-differentiable.Furthermore, differentiable RANSAC algorithms [25,26] hinder generalization to other datasets.Other methods, such as the seminal Lucas-Kanade algorithm [27], can directly estimate the homography matrix without detecting features.The first deep learning-based homography estimation model was proposed in [28].Its network regresses the four corner displacement vectors of the source image in a supervised manner and yields the homography using a direct linear transform (DLT) [29].Many unsupervised approaches [30][31][32] have been proposed to minimize the pixel-wise photometric error or feature difference between the template and source image.
These methods are likely to fail under large viewpoint change, when textures are lacking, and for differing input modalities.Our work uses the template's structural (shape) properties and samples valid region features in the template to learn the correlation with the source image.An edge-aware module is used to translate the source image and template mask to bypass the effect of modality differences between two inputs.

Feature Matching
Before the era of deep learning, hand-crafted local features such as SIFT, SURF, and ORB were widely adopted.Deep learning-based methods [33][34][35] significantly improve the feature representation, especially in cases of significant viewpoint and illumination changes.SuperPoint [33], D2-Net [34] and ASLFeat [36] propose joint learning of feature descriptors and detectors; most computations of the two tasks are shared for fast inferencing using a unified framework.A significant improvement in feature matching was achieved by SuperGlue [12], which accepts two sets of keypoints with their descriptors, and updates their representations with an attentional graph neural network (GNN).Drawing inspiration from GNN, more methods [37][38][39][40] further improve the accuracy and efficiency of graph-based feature matching.Recently, several works [11,13,41] have attempted to adopt transformers to model the relationship between features and provide impressive results.In this work, we build on the success of transformers and learn accurate template matching with coarse-to-fine correspondence refinement.

Vision Transformers
Transformers [14] were initially proposed in natural language processing (NLP).Vision transformers [42] have attracted attention due to their simplicity and computational efficiency for image sequence modeling.Many variants [18,[43][44][45] have been proposed for more efficient message passing.In our work, we utilize self and cross attention to establish larger receptive fields and capture structural information from the inputs.In particular, linear transformers [18] with relative positional encoding are adopted to ensure low computational costs and more efficient message passing.

Overview
In industrial template matching, it is usual for the template to be represented as a binary mask indicating only the shape of the source object.In contrast, the source image is often grayscale.Thus, we first use an edge-aware translation module before feature extraction to eliminate the domain difference between these two images: see Sec. 4.2.1.We propose a differentiable feature extraction and aggregation network with transformers in Sec.4.2.2-4.2.3.The whole matching pipeline is performed in a coarse-to-fine style.At the coarse level, to estimate the homography from matched features, we combine spatial compatibility and feature similarity for soft outlier filtering: see Sec. 4.3; this is RANSAC-free and differentiable.A coarse homography is obtained from the coarse correspondences.Then, we apply the spatial transform [16] to the source image to provide a coarsely-aligned image.At the fine-level, we combine global semantics and local features to achieve sub-pixel dense matching and obtain an accurate homography estimate, as explained in Sec.4.4.The final correspondences are precise between the template mask and source image, ensuring a plausible template matching result.Inspired by LoFTR, we adopt a coarse-to-fine matching pipeline, as shown in Fig. 2. Note that unlike LoFTR, our approach takes full advantage of the geometric properties of the template and spatial consistency between the template and the object.In addition, our coarse-to-fine matching process is fully differentiable via a spatial transform connection, while LoFTR's coarse-to-fine strategy only enhances correspondence accuracy and is not fully differentiable.

Task
Given a binary template image T and a source search image I, our method aims to estimate a homography transformation between T and I to provide the precise position and pose of the object in the image I.For applications whose scenes have multiple objects and multiple candidate templates, the coarse stage of our method may be performed first to estimate the initial homography for selecting the correct template for each object.We then use the refinement stage to obtain the precise position and pose.

Edge translation
Unlike some other template matching and homography estimation tasks, the case considered here has a domain difference between the template T and source image I.The former is a binary mask, and the latter is a grayscale image; their features are too different to use common image matching approaches.Grayscale images may furthermore exhibit strong reflections if the material is glossy.Matching based on photometric similarity are not applicable in such cases.Firstly, to ensure domain consistency of the template and source image, and to avoid complications from reflections, we adopt a translation network to convert both into edge maps.In this step, we adopt PiDiNet [46], a lightweight and robust edge detector, to compute the edge maps.This conversion is crucial to permit later feature matching.

Feature extraction
We use a standard convolutional architecture similar to SuperPoint to extract features at different scales from both images after translation.SuperPoint has a VGG-style [47] encoder trained by self-supervision and shows leading performance in many vision tasks [12,48,49].We only retain the encoder architecture of SuperPoint as our local feature extraction network.Given an input image of size H × W , our feature extraction networks produce feature maps at four resolutions; we save the second layer feature map ( F ∈ R H/2×W/2×D ) and the last layer feature map ( F ∈ R H/8×W/8×C ).Thus, F T and F I are the coarse-level features, F T and F I are the fine-level features.

Feature aggregation with transformers
Since edge images are not richly textured, the features extracted by the local convolutional neural network are inadequate for robust feature matching.Structural and geometric features are more significant [50].Therefore, we adopt transformer blocks [14] to encode F T and F I to produce more global, position-aware features denoted F T tr and F I tr .A transformer block consists of a self-attention layer to aggregate the global context and a cross-attention layer to exchange information between two feature sets.
Patch sampling.Unlike previous work [11,13] which passes all patches of the image into the transformer module, we only keep meaningful feature map patches in F T .Specifically, we use furthest point sampling [51] to sample N p patches in which edge pixels exist, both to reduce computational cost and increase the efficiency of message passing.F T henceforth denotes the feature map after sampling.We do not drop any patches of the source image I: every location in I could be a potential match.We perform experiments to show the effect of patch sampling with various numbers of patches in Sec.5.5.1.

Positional encoding.
In transformers, all inputs are fed in simultaneously, and furthermore, do not encode any information concerning input ordering (unlike RNNs).We must encode positional information for the tokens input into transformers in order to make available the order of the sequences.Previous feature matching work using transformers [11,13] uses a 2D extension of the standard absolute positional encoding, following DETR [42].In contrast, [15,52] showed that relative positional encoding is a better way of capturing the positional relationships between input tokens.We employ a rotary position embedding [53] proposed in natural language processing for position encoding which has recently been successfully adopted for point cloud processing [52].We apply it to 2D images as it can express a relative position in a form like absolute position encoding.Furthermore, it can be perfectly incorporated in linear attention [18] at almost no extra cost.In order to obtain the relative positional relationship of the local features between the template and image, we thus use relative positional encoding in a linear transformer.For a given 2D location n = (x, y) ∈ R 2 , and its feature f n ∈ R C , the relative positional encoding is defined as: where and C is the number of feature channels.
Rotary position embedding satisfies: and Θ(n − m) = Θ(m) T Θ(n).Thus, relative position information between features f n and f m can be explicitly revealed by taking the dot product in the attention layer.This position encoding is more suitable in our application than absolute positional encoding, since relative positional relationships between template T and image I is crucial.Θ(.) is an orthogonal operation on features, which means it only changes the directions but not the lengths of feature vectors.Therefore, rotary position embedding stabilizes and accelerates the training process [52], facilitating downstream feature matching tasks.An experimental comparison to absolute positional encoding can be found in Sec.5.5.1.

Self-attention and cross-attention layers.
The key to the transformer model is attention.We use self and cross attention alternately in our pipeline.The input vectors for an attention layer are query vector Q, key vector K, and value V , and a basic attention layer is given by: Suppose Q and K have length N , and their feature dimensionality is C. Then the computational cost of the transformer grows as the square of the length of the input.The length of the source image T 's input token makes a basic version of the transformer impractical for local feature matching.Following [13], we adopt a more efficient variant of the attention layer, linear transformer [18].We use a kernel function sim(Q, K) = φ(Q)φ(K) T to replace the softmax calculation, where φ(.) = elu(.)+ 1.The computational cost is reduced from O(N 2 ) to O(N ) when C N .Following RoFormer [53], we do not inject rotary position embedding in the denominator to avoid the risk of dividing by zero.Differing from [52,53] as well as query Q and key K, the value V is also multiplied by Θ(•), since we consider the position information to be important auxiliary information for value V .Experiments justifying this approach are described in Sec.5.5.1.
Overall, each token in a linear transformer with relative positional encoding is given by: .

Establishing coarse matches
We establish coarse matches using the transformed features F T tr and F I tr .An optimal transport (OT) layer is adopted as our differentiable matching layer.We first calculate a score matrix S using dot-product similarity of the transformed features: . This score matrix S is used as the cost matrix in a partial assignment problem, following [12,13].This optimization problem can be efficiently solved with the Sinkhorn algorithm [54] to obtain the confidence assignment matrix C.
To obtain more reliable matches, the mutual nearest neighbor (MNN) criterion is enforced, and only matching pairs with confidence values higher than a threshold θ c are preserved.The set of coarse-level matches M c is thus: Another matching layer approach is based on dual-softmax (DS) [55,56].It applies softmax to both dimensions of S to get the probability of a mutual nearest neighbor match.A comparison of OT and DS methods can be found in Sec.5.5.1.

Confidence weights based on spatial consistency
The differentiable matching layer provides a tentative match set M c based on feature dot-product similarity.In this way, two irrelevant points may be regarded as a matching pair due to similarity of appearance.To prevent this, we add a new constraint, based on the observation that template matching, has an essential property: correct correspondences (inliers) have similar geometric transformations, while transformations of outliers are random.
RANSAC and its variants [57,58] are widely adopted for outlier rejection.However, such methods are slow to converge and may fail in cases of high outlier ratios.In contrast, spectral matching (SM) [59] and its variants [60][61][62][63] significantly improve results for rigid point cloud registration, by constructing a compatibility graph which preserves angle or distance invariance between point pairs.In contrast, our model assumes a non-rigid deformation in which pairwise distances between far points are more likely to vary than between closer ones.We thus extend SM and propose a method based on distance-and-angle consistency for non-rigid deformation outlier rejection.
Let β denote the distance compatibility term measuring the change in lengths of matched pairs.To allow for scale differences, we first normalize the distances between matching points on the template and image separately.Then for two coarse matches a = (i, i ) and b = (j, j ), β is defined as: , where d ij is the normalized pairwise distance between i and j, [•] + means max(•, 0), and σ d is a distance parameter controlling sensitivity to changes in relative length.Changes in directions are also penalized using a triplet-wise angle.
Inspired by [64], we compute angular compatibility from triplets of coarse feature points.For a matching pair a = (i, j) with positions p i and p j , we first select the k nearest neighbors N i of p i .For each p x ∈ N i , the angle c x i,j = ( i,x , i,j ), where i,j = p i − p j .To improve robustness, we select the maximum value c ij among the k nearest neighbors as the angle property for a matching pair (i, j).As for distance compatibility β, we now formulate the angular compatibility α as: , where σ α is the angular parameter controlling the sensitivity to changes inn angle.Fig. 4 illustrates the computation of distance and angular consistency.
The final spatial compatibility of matches a and b is defined Fig. 4 Given two matching pairs a = (i, i ) and b = (j, j ), we calculate both their distance compatibility and their angular compatibility.Green nodes represent k-nearest neighbors. as: where λ c is a control weight.E(a, b) is large only if the two correspondences a and b are highly spatially compatible.Following [59,60], the leading eigenvector e of the compatibility matrix E is regarded as the inlier probability of the matches.We use the power iteration algorithm [65] to compute the leading eigenvector e ∈ R k of the matrix E.

Initial homography estimation
Naturally, the inlier probability e together with feature score s must be combined to give the final overall inlier probability, where s is the corresponding element of the feature confidence matrix C. We simply compute w k = s k • e k : intuitively, w k takes into account how similar the feature descriptors are (s k ) and how much the spatial arrangement is changed (e k ) for a matching pair k.Finally, we use the confidence w k as a weight to estimate the homography transformation H c , using the DLT formulation [29].A weighted least squares solution is found to the linear system.The matches-with-confidence make our coarse-to-fine network differentiable and RANSAC-free, enabling end-to-end training.The effectiveness of confidence weights is explored in Sec.5.5.1.

Coarse-level training losses
Following [13], we use negative log-likelihood loss over the confidence matrix C returned by either the optimal transport layer or the dual-softmax operation to supervise the coarse-level network.The ground-truth coarse matches M gt c are estimated from the ground-truth relative transformations Gao et al.
(homographies).Using an optimal transport layer, the loss is: where (i, j) ∈ I means i or j do not have any reprojection in the other image.With the dual-softmax operation, we minimize the negative log-likelihood loss in M gt c :

Fine-level Matching
A coarse-to-fine scheme is adopted in our pipeline, a scheme which has been successfully applied in many vision tasks [10,13,[66][67][68][69].We apply the obtained coarse homography H c to the source image I to generate a coarsely-aligned image I w .We roughly align the two images, then use a refinement network to get sub-pixel accurate matches, and finally, a better-estimated transformation matrix is produced from the new matches.

Fine-level matching network
For a given pair of coarsely aligned images (warped image I w and template T ), sub-pixel level matches are calculated by our fine-level matching network to further enhance the initial alignment.Although [10,56] claim that local features significantly improve matching accuracy in feature matching when refining, we find that local features are insufficient to achieve robust and accurate matching in untextured cases.Instead, we combine the global transformer and local transformer for feature aggregation to improve fine-level matching, as shown in Fig. 2. The global transformer is first adopted to aggregate coarse-level features as priors.In detail, for every sampled patch pair ( ĩ, j) at the same location on template T and warped image I w , the corresponding coarse-level features are denoted F T ( ĩ) and F Iw ( j), respectively.A global transformer module with N f self-and cross-attention layers operates on these coarse-level features to produce transformed feature( F T tr ( ĩ), F Iw tr ( j)).Note that, for efficiency we only consider those patches which coarse matching sampled.To deeply integrate global and local features, F T tr ( ĩ) and F Iw tr ( j) are upsampled and concatenated with corresponding local (fine-level) features F T ( ĩ) and F Iw ( j), respectively.Subsequently, the concatenated features are used as inputs to a 2-layer MLP to reduce the channel dimensionality to the same as for the original local features, yielding the fused features.The effectiveness of this module is demonstrated in Sec.5.5.1.
For every patch pair ( ĩ, j), we then locate their all finer positions (i,j) where i lies on the edge.As fused feature maps, we crop two sets of local windows of size w × w centered at (i, j) respectively.A local transformer module operates N f times within each window to generate the final features (F T (i), F Iw (j)).Following [13,49], the center vector of F T (i) is correlated with all vectors in F Iw (j) resulting in a heatmap that represents the matching probability for each pixel centered on j with i.Using 2D softmax to compute expectation over the matching probability distribution, we get the final position j with sub-pixel accuracy matching i.The final set of fine-level matches M f aggregates all matches (i, j ).

Fine-level homography estimation
For each match (i, j ) in M f , we use the inverse transformation of H c to warp j to its original position on image I.After coarse-to-fine refinement, the correspondences are accurate without obvious outliers (see the last column of Fig. 5 later).We obtain the final homography H by wighted least squares using the DLT formulation, based on all matching pairs.The final homography H indicates the transformation from the template T to the source image I, precisely locating the template object.

Fine-level training losses
While training the fine-level module, the coarse-level module is fine-tuned at the same time.The training loss L is defined as L = λL c + L f .In L f , we use ground-truth supervision and self-supervision together for better robustness.For ground-truth supervision, we use the weighted loss function from [49].For self-supervision, we use L2 similarity loss [70,71] to minimize the differences between local appearances of the warped image I w and template T .L f is formulated as: where for each query point i, σ 2 (i) is the total variance of the corresponding heatmap, P T i denotes a local window cropped from template image T with i as the center, m i is a local area mask indicating presence of an edge pixel.Experiments on L2 similarity loss are presented in Sec.5.5.1.

Mechanical parts dataset
Assembly hole dataset COCO dataset Fig. 5 Qualitative matching results for the three test datasets.Compared to SuperGlue, COTR and LoFTR, our method consistently obtains a higher inlier ratio, successfully coping with large viewpoint change, small objects and non-rigid deformation.Red indicates a reprojection error beyond 3 pixels for the Mechanical Parts and Assembly Holes datasets and 5 pixels for the COCO dataset.Further qualitative results can be found in the declarations 7.1.

Experiments
After introducing the datasets used in our experiments (Sec.5.1) and implementation details (Sec.5.2), estimated homographies are compared for our proposed method and baselines (Sec.5.3).Applications of our approach in industrial lines are shown in Sec.5.4), while Sec.5.5 considers the effectiveness of the components of our strategy.Further experimental details can be found in the Appendices.

Datasets
Here we outline the datasets used for testing.Further details are given in Appendix A.3, to ensure reproducibility.

Mechanical Parts
Obtaining poses of industrial parts is essential for robotic manipulator grasping on automated industrial lines.We collected a dataset based on hundreds of varied planar mechanical parts.To enrich the dataset while avoiding laborious annotation, we used GauGAN to generate an extra 40k pairs of matching data for training.The test dataset consisting of 800 samples was collected from an industrial workshop with human-labeled ground truth.It was used to quantitatively evaluate our method for single template and single object scenes, and to visually demonstrate the application of our approach to multi-template and multi-object Gao et al.

Assembly Holes
Locating and matching assembly holes can help determine whether the product parts have been machined in the correct position.Thus, we collected data for dozens of different assembly holes in vehicle battery boxes, giving about 45k image pairs.Each sample contains a binarized template image, a gray image to be matched, and a human-labeled mask.To simulate a real industry scenario, we randomly scaled the template size and perturbed the image corners to simulate possible hole deformation or camera disturbance.We randomly selected 700 image pairs containing all hole types for testing, and the remainder for training and validation.

COCO
Going beyond industrial scenarios, we also performed tests using the well-known computer vision dataset COCO [20] that contains common objects in natural scenes.Since COCO was not devised for template matching, we generated the image and template pair by selecting one instance mask and applying various kinds of transformations, including scaling, rotation and corner perturbation.We randomly selected 50k and 500 images for training and testing from the COCO training and validation set, respectively.

Implementation Details
For training and testing, all images were resized to 480 × 640.We use Kornia [72] for homography warping in the coarse alignment stage.Parameters were set as follows: window size w = 8, numbers of transformer layers: N c = 4 and N f = 2, match selection threshold σ = 0.2, loss weight λ = 10 is set to 10, maximum number of template patches N p = 128, spatial consistency distance parameter σ d = 0.4, angular consistency parameter σ α = 1.0, weight control parameter = 0.5, and the number of neighbors k = 3.

Evaluation Metrics
Following [12,13,31], we compute the reprojection error of specific measurement points between the images warped with the estimated and the ground-truth homography.We then report the area under the cumulative curve (AUC) up to thresholds of [3,5,10] pixels for industrial datasets, and [5,10,20] pixels for the COCO dataset.To ensure a fair comparison, we sampled 20 points uniformly on each template boundary as measurement points for use throughout the experiments.

Baselines
We compared our method to three kinds of methods, based on: (i) overall similarity measure-based template matching, including Linemod-2D and generalized Hough transform (GHT), which are widely used for industrial scenes, (ii) keypoint detection with MNN search, including SURF, D2Net, ASLFeat and SuperPoint,and (iii) matching learning, including SuperGlue, COTR [11] and LoFTR (state-of-the-art feature matching networks).
For overall similarity measure-based methods which cannot deal with perspective transformation, we apply a more tolerant evaluation strategy.Specifically, we generate templates at multiple scales (step size = 0.01) and orientations (step-size = 1 • for matching.We use the centroids of generated templates as measure points and select the template with the best score as the final result.For SURF, we use the PiDiNet edge detector to preprocess the input images.In SuperGlue, we choose SuperPoint for keypoint detection and descriptor extraction.All learning-based baselines were fine-tuned on each dataset until convergence, based on the parameters of the source model.Further details of training setup are provided in Appendix A.2.We adopted RANSAC and MAGSAC for outlier rejection for all correspondence-based baselines when estimating the homography transformation, following [31].Direct linear transformation (DLT) is applied directly in a differentiable manner to our method, assuming matches have high inlier rates and trustworthy confidence weights.

Qualitative Comparison
We provide qualitative results in Figs. 5 and 6.In both figures, the first three rows use the Mechanical Parts dataset, the next three, the Assembly Holes dataset, and the last three, COCO.Fig. 5 shows that, compared to SuperGlue, COTR and LoFTR, the correspondences of our method are more accurate and reliable.While the correspondences predicted by SuperGlue and COTR, like ours, lie on the contour of the object, they contain more outliers.LoFTR yields more correspondences even in the blank area.However, these matching pairs tend to become inaccurate when further from the object.Instead, our method effectively uses contour information by focusing the matching points on the contour.With more correct matches and fewer mismatches, our approach does not need RANSAC or its variants for post-processing, which are essential for other methods.The second example from the COCO dataset demonstrates our method's superior ability to stably match small target objects.
In Fig. 6, we qualitatively compare our registration results to those of a classic template matching method, Linemod-2D,

Mechanical Parts dataset
Assembly Holes dataset

COCO dataset
Fig. 6 Qualitative registration results for the three test datasets.The green area represents template mask placed in the input image using the estimated homography.For Linemod-2D, we selected the template with the best match from the set of templates.MAGASAC was used for outlier rejection for SuperGlue, COTR and LoFTR.
and three deep feature matching methods.Linemod-2D is susceptible to cluttered backgrounds.Learning-based matching baseline methods perform better but are prone to unstable results, especially for small objects.Our method produces a warped template with more pixels aligned in all these scenarios.Fig. 7 shows that our approach provides much more accurate registration when examined in fine detail.

Baselines using Edge Maps
As extracting edges of input images may reduce the impact of modality differences on initial feature extraction, we performed further experiments on the Mechanical Parts dataset to evaluate competitive learning-based baseline methods using edge detection as pre-processing.For a fair comparison, we use PiDiNet to extract edge maps from the template and source images for all methods.Training settings remained the same as for the training process without edge extraction.As Fig. 8 shows, edge detection preprocessing worsens the results of these baseline methods, especially SuperGlue and COTR.We note that these methods tend to provide correspondences with lower accuracy for low-texture scenes, and edge detection results in images with little texture.

Application
We now describe a challenging application of our method to real industrial lines, illustrated in Fig. 9.For each batch of industrial parts, the task is to select the correct template from a set of candidates templates for each part and to calculate its accurate pose.This is now an N -to-N template matching problem.We first pre-process the original scene using a real-time object detection network [73] to roughly locate each part and crop it into a separate image.For each candidate template, we first conduct coarse matching to select the optimal template: we use the correspondences with weighted confidences obtained by coarse matching to get an SuperGlue COTR LoFTR Ours Fig. 7 Close-ups of registration results from SuperGlue, COTR, LoFTR and our method.Our method accurately focuses on the contours of objects.initial homography.Based on that homography, the template containing the most inlier correspondences is regarded as optimal.We then apply fine matching to accurately obtain the pose of the object using the selected optimal template.To quantitatively evaluate our algorithm in multi-template scenarios, besides the correct template, we randomly add extra 9 noisy ones to the candidate template set.We tested 284 scenarios with 2445 test samples and achieved a recognition accuracy of 98.8%, when taking an inlier rate of more than 80% as correct recognition using the estimation matrix.In addition, our method runs at a competitive speed since we adopt the strategy of only using coarse matching for template selection.Further details of runtimes are presented in Appendix A.1.Fig. 8 Accuracy of various methods with (solid lines) and without (dashed lines) edge detection preprocessing of the input images.The homography estimation accuracy is reported for pixel thresholds [3,5,10].
We further note that our model generalizes well to unseen real scenarios after training only on synthetic data.We provide the link to video demonstrations in the declarations 7.1.

Design study
To better understand our proposed method, we conducted seven comparative experiments on different modules, using the Mechanical Parts dataset.The quantitative results in Tabs. 4 and 5 validate our design decisions and show that they have a significant effect on performance.The choices considered are • Matching-layer: dual-softmax vs. optimal transport Both the dual-softmax operator and optimal transport achieve similar scores, and either would provide effective matching layers in our method.• Position-encoding: absolute vs. relative Replacing relative positional encoding by absolute positional encoding results in a significant drop in AUC.Relative position information is important in template matching.• Homography estimation: RANSAC vs. consistency confidence Since our method provides high-quality correspondences with confidence weights based on consistency, the differentiable DLT outperforms MAGSAC.An example is shown to demonstrate the advantages of DLT with consistency weights over RANSAC in Fig. 10.Inliers and outliers are explicitly distinguished by the RANSAC estimator, so correspondences with insufficient accuracy are directly discarded or fully adopted to estimate the final transformation matrix.Instead, our consistency module provides confidence weights, and we observe that the confidence weights estimated by the proposed method are consistent with ground-truth reprojection errors.Our method effectively assigns higher weights to more accurate correspondences and suppresses outliers.Therefore, in the case of high-quality matches, our consistency module can efficiently utilize correspondence information and so outperforms RANSAC.
• Value & position Multiplying the value token by the positional embedding in the transformer module provides better results.• Translation module: Canny [74] vs. translation network Accuracy using the translation network is better than using Canny edge detection.• Feature fusion In the refinement stage, deep fusion of local features and global features leads to a noticeable performance improvement.
• One stage vs. coarse-to-fine The coarse-to-fine module contributes to the estimation accuracy significantly by finding more matches and refining them to a sub-pixel level.
• Self-supervision loss Using self-supervision loss (L2 similarity loss) brings a significant performance boost in fine-level training.• Maximum number of sample patches See Tab. 5.As the maximum number of samples based on the contour increases, accuracy of our method tends to improve.However, without sampling and using the entire template image as input, performance is somewhat lower than achieved by sampling with the best number of patches.We believe that edge-based sampling allows our method to more efficiently perceive the template structure and aggregate local features.We set the maximum number of patches to 128 as a trade-off between accuracy and runtime.

Understanding Attention
To better understand the role of attention in our method, we visualize transformed features with t-SNE [75], and self-and cross-attention weights in Fig. 11.The visualization shows our method learns a position-aware feature representation.The visualized attention weights reveals that the query point can aggregate global information dynamically and focus on meaningful locations.Self-attention may focus anywhere in the same image, especially regions with obvious differences, while cross-attention focuses on regions with a similar appearance in the other image.

Limitations and Future Work
Our method utilizes an existing edge detection network to eliminate the domain gap between templates and images, which is convenient for our approach.However, we believe that jointly training the translation network is a promising avenue for further improving performance.Another interesting follow-up is to design a one-to-many template matching algorithm that does not rely on any pre-processing.

Conclusions
We have presented a differentiable pipeline for accurate correspondence refinement for industrial template matching.
With efficient feature extraction and feature aggregation by transformers, we obtain high-quality feature correspondences between the template mask and the grayscale image in a coarse-to-fine manner.The correspondences are then used to get a precise pose or transformation for the target object.
To eliminate the domain gap between the template mask and grayscale image, we exploit a translation network.Based on the properties of the cross-modal template matching problem, we design a structure-aware strategy to improve robustness and efficiency.Furthermore, two valuable datasets from industrial scenarios have been collected, which we expect to benefit future work on industrial template matching.
Our experiments show that our method significantly improves the accuracy and robustness of template matching relative to multiple state-of-the-art methods and baselines.Video demos of N -to-N template matching in real industrial lines show the effectiveness and good generalization of our method.

Availability of data and materials
The well-known CoCo dataset is available from https: //cocodataset.org/.Our two industrial datasets can be freely downloaded from https://drive.google.com/drive/folders/1Mu9QdnM5WsLccFp0Ygf7ES7mLV-64wRL?usp =sharing.The video demos are available at the page: https://github.com/zhirui-gao/Deep-Template-Matching.

Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements
We thank Lintao Zheng and Jun Li for their help with dataset preparation and discussions.

A.1 Speed
We have tested the runtime of our method and other baselines on the Assembly Holes dataset, and report average values using an NVIDIA RTX 3080Ti.Coarse matching in our method takes 63 ms to match one pair; full matching takes 105 ms.LoFTR takes 87 ms, while COTR is much slower at 17 s.GHT and Linemod-2D take 4.9 s and 2.2 s respectively: using multiple templates for different scales and poses is time-consuming.For a scene with 10 objects and 10 candidate templates, our method takes about 6.7 s to locate and identify all objects, and provide accurate poses.

A.2 Training Details
Our network was trained on 2 NVIDIA RTX 3090 GPUs using a batch size of 16.Although end-to-end training is feasible, we found that a two-stage training strategy yielded better results.The first stage trained coarse-level matching using the loss term L c , until the validation loss converged.
The second stage trained the whole pipeline using both L c and L f until the validation loss converged.Using the Mechanical Parts / Assembly Holes / COCO datasets, we trained our network for 15 / 15 / 30 epochs respectively for the first stage using Adam, with an initial rate of 10 −3 , and 18 / 15 / 12 epochs for the second stage using Adam, with an initial rate of 10 −4 .We loaded pre-trained weights for the translation network and local feature CNN provided by [33,46], and fixed the local feature CNN parameters in the second stage.We also loaded pre-trained parameters for other learning-based baseline methods and retrained them until the validation loss converged.Numbers of training epochs used for the different learning-based baseline methods are shown in Tab.6 for each dataset.For better performance, for the keypoints methods (D2Net, ASLFeat and SuperPoint), we only used the edge points on the template to construct the ground-truth matching pairs when training the network.For COTR, we followed its three-stage training strategy to fine-tune the network.Since there is no recommended training method for SuperGlue, we trained it based on the code at https://github.com/gouthamvgk/SuperGlue_training.

A.3.1 Mechanical Parts Dataset
We used GauGAN to generate further image pairs for various shape parts.The manually adjusted image pairs provided

Fig. 1
Fig. 1 Our template matching method.(a) Template T and image I. (b) Coarse matching.(c) Matching refinement.(d) Template warped to the image using the estimated geometric transformation.

Fig. 9
Fig.9 Application of our method in an industrial line.Above left: best matching template for an object.Above right: set of candidate templates.Below: final matches to selected templates.The coarse matching inlier rate using every template is used as a basis for template selection.

Fig. 10 Fig. 11
Fig. 10 Comparison of using RANSAC, or consistency, for homography estimation.Above: correspondences provided by coarse matching.Below: template registration results.Confidence is indicated by line colour from green (1) to red (0.)In RANSAC, inliers have a confidence of 1, and outliers, 0. For the ground-truth, the reprojection error represents confidence.

Table 1
Homography estimation on the Mechanical Parts dataset.The AUC of the measurement point error is reported as a percentage.SuperGlue−− and SuperGlue use the pre-trained SuperPoint and our fine-tuned SuperPoint for keypoint detection, respectively.

Table 2
Homography estimation on the Assembly Holes dataset.The AUC of the measurement point error is reported as a percentage.SuperGlue−− and SuperGlue use the pre-trained SuperPoint and our fine-tuned SuperPoint for keypoint detection, respectively.

Table 3
Homography estimation on the CoCo dataset.The AUC of the measurement point error is reported as a percentage.SuperGlue−− and SuperGlue use the pre-trained SuperPoint and our fine-tuned SuperPoint for keypoint detection, respectively.

Table 4
Evaluation of design choices using the Mechanical Parts dataset.Strategies marked with are the ones adopted in our method.

Table 5
Effects on speed and accuracy of varying the number of patches sampled.indicates the number used in our method.
This paper was supported in part by the National Key Research and Development Program of China (2018AAA0102200), the National Natural Science Foundation of China (62132021, 62002375, 62002376), the Natural Science Foundation of Hunan Province of China (2021JJ40696), Huxiang Youth Talent Support Program (2021RC3071,2022RC1104) and NUDT Research Grants (ZK19-30, ZK22-52).