Skip to main content

Weakly Supervised Deep Metric Learning for Template Matching

  • Conference paper
  • First Online:
Advances in Computer Vision (CVC 2019)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 943))

Included in the following conference series:

Abstract

Template matching by normalized cross correlation (NCC) is widely used for finding image correspondences. NCCNet improves the robustness of this algorithm by transforming image features with siamese convolutional nets trained to maximize the contrast between NCC values of true and false matches. The main technical contribution is a weakly supervised learning algorithm for the training. Unlike fully supervised approaches to metric learning, the method can improve upon vanilla NCC without receiving locations of true matches during training. The improvement is quantified through patches of brain images from serial section electron microscopy. Relative to a parameter-tuned bandpass filter, siamese convolutional nets significantly reduce false matches. The improved accuracy of the method could be essential for connectomics, because emerging petascale datasets may require billions of template matches during assembly. Our method is also expected to generalize to other computer vision applications that use template matching to find image correspondences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Computer vision–ECCV 2006, pp. 404–417 (2006)

    Google Scholar 

  2. Berg, A.C., Malik, J.: Geometric blur for template matching. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 1, p. I. IEEE (2001)

    Google Scholar 

  3. Bromley, J., Bentz, J.W., Bottou, L., Guyon, I., LeCun, Y., Moore, C., Säckinger, E., Shah, R.: Signature verification using a “siamese” time delay neural network. IJPRAI 7(4), 669–688 (1993)

    Google Scholar 

  4. Brown, M., Hua, G., Winder, S.: Discriminative learning of local image descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 43–57 (2011)

    Article  Google Scholar 

  5. Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 539–546. IEEE (2005)

    Google Scholar 

  6. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: FlowNet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766 (2015)

    Google Scholar 

  7. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)

    Article  MathSciNet  Google Scholar 

  8. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)

    Google Scholar 

  9. Han, X., Leung, T., Jia, Y., Sukthankar, R., Berg, A.C.: MatchNet: unifying feature and metric learning for patch-based matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3279–3286 (2015)

    Google Scholar 

  10. Hegde, V., Zadeh, R.: FusionNet: 3D object classification using multiple data representations. arXiv preprint arXiv:1607.05695 (2016)

  11. Heo, Y.S., Lee, K.M., Lee, S.U.: Robust stereo matching using adaptive normalized cross-correlation. IEEE Trans. Pattern Anal. Mach. Intell. 33(4), 807–822 (2011)

    Article  Google Scholar 

  12. Kulis, B., et al.: Metric learning: a survey. Found. Trends® Mach. Learn. 5(4), 287–364 (2013)

    Article  Google Scholar 

  13. Kumar, B.G., Carneiro, G., Reid, I., et al.: Learning local image descriptors with deep Siamese and triplet convolutional networks by minimising global loss functions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5385–5394 (2016)

    Google Scholar 

  14. Lenc, K., Vedaldi, A.: Learning covariant feature detectors. In: Computer Vision–ECCV 2016 Workshops, pp. 100–117. Springer, Heidelberg (2016)

    Google Scholar 

  15. Lewis, J.P.: Fast template matching. In: Vision Interface, vol. 95, pp. 15–19 (1995)

    Google Scholar 

  16. Lichtman, J.W., Pfister, H., Shavit, N.: The big data challenges of connectomics. Nat. Neurosci. 17(11), 1448–1454 (2014)

    Article  Google Scholar 

  17. Liu, C., Yuen, J., Torralba, A.: SIFT flow: dense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 978–994 (2011)

    Article  Google Scholar 

  18. Long, J.L., Zhang, N., Darrell, T.: Do convnets learn correspondence? In: Advances in Neural Information Processing Systems, pp. 1601–1609 (2014)

    Google Scholar 

  19. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)

    Article  Google Scholar 

  20. Luo, J., Konofagou, E.E.: A fast normalized cross-correlation calculation method for motion estimation. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 57(6), 1347–1357 (2010)

    Article  Google Scholar 

  21. Pathak, D., Girshick, R., Dollár, P., Darrell, T., Hariharan, B.: Learning features by watching objects move. arXiv preprint arXiv:1612.06370 (2016)

  22. Preibisch, S., Saalfeld, S., Rohlfing, T., Tomancak, P.: Bead-based mosaicing of single plane illumination microscopy images using geometric local descriptor matching. In: SPIE Medical Imaging, p. 72592S. International Society for Optics and Photonics (2009)

    Google Scholar 

  23. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Springer, Heidelberg (2015)

    Google Scholar 

  24. Saalfeld, S., Fetter, R., Cardona, A., Tomancak, P.: Elastic volume reconstruction from series of ultra-thin microscopy sections. Nat. Methods 9(7), 717–720 (2012)

    Article  Google Scholar 

  25. Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., Moreno-Noguer, F.: Discriminative learning of deep convolutional feature point descriptors. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 118–126 (2015)

    Google Scholar 

  26. Smeulders, A.W.M., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M.: Visual tracking: an experimental survey. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1442–1468 (2014)

    Article  Google Scholar 

  27. Subramaniam, A., Chatterjee, M., Mittal, A.: Deep neural networks with inexact matching for person re-identification. In: Advances in Neural Information Processing Systems, pp. 2667–2675 (2016)

    Google Scholar 

  28. Tulyakov, S., Ivanov, A., Fleuret, F.: Weakly supervised learning of deep metrics for stereo reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1339–1348 (2017)

    Google Scholar 

  29. Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., Torr, P.H.S.: End-to-end representation learning for correlation filter based tracking. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5000–5008. IEEE (2017)

    Google Scholar 

  30. Yang, L., Jin, R.: Distance metric learning: a comprehensive survey. Michigan State Univ. 2(2), 4 (2006)

    Google Scholar 

  31. Yi, K.M., Trulls, E., Lepetit, V., Fua, P.: Lift: learned invariant feature transform. In: European Conference on Computer Vision, pp. 467–483. Springer, Heidelberg (2016)

    Google Scholar 

  32. Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4353–4361 (2015)

    Google Scholar 

  33. Zheng, Z., Lauritzen, J.S., Perlman, E., Robinson, C.G., Nichols, M., Milkie, D., Torrens, O., Price, J., Fisher, C.B., Sharifi, N., Calle-Schuler, S.A., Kmecova, L., Ali, I.J., Karsh, B., Trautman, E.T., Bogovic, J., Hanslovsky, P., Jefferis, G.S.X.E., Kazhdan, M., Khairy, K., Saalfeld, S., Fetter, R.D., Bock, D.D.: A complete electron microscopy volume of the brain of adult drosophila melanogaster. bioRxiv (2017)

    Google Scholar 

Download references

Acknowledgment

This work has been supported by AWS Machine Learning Award and the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DoI/IBC) contract number D16PC0005. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/IBC, or the U.S. Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Davit Buniatyan .

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Training Set

Our training set consists of 10,000 source-template image pairs from consecutive slices in an affine-aligned stack of images that contained non-affine deformations. We include only source-template pairs such that \(r_{delta}<0.2\) for simple NCC. This rejects over 90% of pairs and increases the fraction of difficult examples, similar to hard example mining in the data collection process. During the training we alternate between a batch of eight source-template pairs and then the same batch with randomly permuted source-template pairings. We randomly cropped the source and template images such that the position of the peak is randomly distributed. We used random rotations/flipping of both channel inputs by \(90^{\circ }, 180^{\circ }, 270^{\circ }\).

1.2 A.2 Ground Truth Annotation

For training data, source image is \(512 \times 512\)px and each template is \(224 \times 224\)px. These pairs are sampled from 1040 superresolution slices, each 33,000 \(\times \) 33,000px. We validated our model on 95 serial images from the training set an unpublished EM dataset with a resolution of \(7 \times 7 \times 40\,\mathrm{nm}^3\). Each image was 15,000 \(\times \) 15,000px, and had been roughly aligned with an affine model but still contained considerable non-affine distortions up to 250px (full resolution).

Fig. 8.
figure 8

(Left) matches on the raw images; (Middle) matches on images filtered with a Gaussian bandpass filter; (Right) NCCNet: matches on the output of the convolutional network processed image. Displacement vector fields are a representation of output of template matching in a regular triangular grid (edge length 400px at full resolution) at each slice. Each node represents the centerpoint of a template image used in the template matching procedure. Each vector represents the displacement of that template image to its matching location in its source image. Matches shown are based on 224px template size on across (next-nearest neighbor) sections. Additionally see Figs. 10, 11 and 12

Fig. 9.
figure 9

Manual inspection difficulties. (a) The vector field around a match (circled in red) that prominently differs from its neighbors. (b) The template for the match, showing many neurites parallel to the sectioning plane. (c) The false color overlay of the template (green) over the source image (red) at the matched location, establishing the match as true.

In each experiment, both the template and the source images were downsampled by a factor of 3 before NCC, so that 160px and 224px templates were 480px and 672px at full resolution, while the source image was fixed at 512px downsampled (1,536px full resolution). The template matches were taken in a triangular grid covering the image, with an edge length of 400px at full resolution (Fig. 8 shows the locations of template matches across an image).

Our first method to evaluate performance was to compare error rates. Errors were detected manually with a tool that allowed human annotators to inspect the template matching inputs and outputs. The tool is based on the visualization of the displacement vectors that result from each template match across a section, as shown in Fig. 8. Any match that significantly differed (over 50px) from its neighbors were rejected, and matches that differed from neighbors but not significantly were individually inspected for correctness by visualizing a false color overlay of the template over the source at the match location. The latter step was needed as there were many true matches that deviated prominently from its neighbors: the template patch could contain neurites or other features parallel to the sectioning plane, resulting in large motions of specific features in a random direction that may not be consistent with the movement of the larger area around the template (see Fig. 9 for an example of this behavior). Table 1 summarizes the error counts in each experiment.

Table 2. Image parameters for training and testing. Unless otherwise noted, resolutions are given after 3x downsampling where 1px represents \(21\times 21\,\mathrm{nm}\).
Table 3. Dissociation of true matches set between the bandpass and NCCNet. Counts of true matches per category. Total possible adjacent matches: 144,500. Total possible across matches: 72,306.

1.3 A.3 Match Filtering

To assess how easily false matches could be removed, we evaluated matches with the following criteria:

  • norm: The Euclidean norm of the displacement required to move the template image to its match location in the source image at full resolution.

  • r max: The first peak of the correlogram serves as a proxy for confidence in the match.

  • r delta: The difference between the first peak and second peak (after removing a 5px square window surrounding the first peak) of the correlogram provides some estimate of the certainty that there is no other likely match in the source image, and the criteria the NCCNet was trained to optimize.

These criteria can serve as useful heuristics to accept or reject matches. It approximate the unknown partitions for the true and erroneous correspondences. The less overlap between the actual distributions when projected onto the criterion dimension, the more useful that criterion. Figure 3 plots these three criteria across the three image conditions. The improvement in rejection efficiency also generalized well across experiments, as evident in the Appendix, Sup. Fig. 15. Achieving a 0.1% error rate on the most difficult task we tested (across, 160px template size) required rejecting 20% of the true matches on bandpass, while less than 1% rejection of true matches was sufficient with the convolutional networks.

Table 4. Residual block architecture. i is the number of input channels, n ranges from {8, 16, 32, 64} to {32, 16, 8} on downsampling and upsamling levels.

False matches can be rejected efficiently with the match criteria. The NCCNet transformed the true match distributions for r max and r delta to be more left-skewed, while the erroneous match distribution for r delta remains with lower values (see Fig. 3a), resulting in a distribution more amenable to accurate error rejection. For the case of adjacent sections with 224px templates, we can remove every error in our NCCNet output by rejecting matches with an r delta below 0.05, which removes only 0.12% of the true matches. The same threshold also removes all false matches in the bandpass outputs, but removes 0.40% of the true matches (see Fig. 3b). This 3.5x improvement in rejection efficiency is critical to balancing the trade-off between complete elimination of false matches and retaining as many true matches as possible.

Table 5. Each channel architecture
Fig. 10.
figure 10

Bandpass vector field has no false matches

The NCCNet produced matches in the vast majority of cases where the bandpass produced matches. It introduce some false matches that the bandpass did not, but it correctly identified 3–20 times as many additional true matches relatively (see Table 3).

Fig. 11.
figure 11

Comparable quality

Fig. 12.
figure 12

NCCNet output is slightly better

Fig. 13.
figure 13

Match criteria for adjacent, 160px.

The majority of the false matches in the convnet output were also present in the bandpass case, which establishes the NCCNet as superior to and not merely different from bandpass.

Defining whether a patch pair is a true match depends on the tolerance for spatial localization error. Our method makes the tolerance explicit through the definition of the secondary peak. Explicitness is required by our weakly supervised approach and is preferable because blindly accepting a hidden tolerance parameter in the training set could produce suboptimal results in some applications (Table 2).

Fig. 14.
figure 14

Match criteria for across, 160px.

Fig. 15.
figure 15

Match criteria for across, 224px.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Buniatyan, D., Popovych, S., Ih, D., Macrina, T., Zung, J., Seung, H.S. (2020). Weakly Supervised Deep Metric Learning for Template Matching. In: Arai, K., Kapoor, S. (eds) Advances in Computer Vision. CVC 2019. Advances in Intelligent Systems and Computing, vol 943. Springer, Cham. https://doi.org/10.1007/978-3-030-17795-9_4

Download citation

Publish with us

Policies and ethics