Weakly Supervised Deep Metric Learning for Template Matching

Buniatyan, Davit; Popovych, Sergiy; Ih, Dodam; Macrina, Thomas; Zung, Jonathan; Seung, H. Sebastian

doi:10.1007/978-3-030-17795-9_4

Davit Buniatyan¹⁶,
Sergiy Popovych¹⁶,
Dodam Ih¹⁶,
Thomas Macrina¹⁶,
Jonathan Zung¹⁶ &
…
H. Sebastian Seung¹⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 943))

Included in the following conference series:

Science and Information Conference

2912 Accesses
2 Citations

Abstract

Template matching by normalized cross correlation (NCC) is widely used for finding image correspondences. NCCNet improves the robustness of this algorithm by transforming image features with siamese convolutional nets trained to maximize the contrast between NCC values of true and false matches. The main technical contribution is a weakly supervised learning algorithm for the training. Unlike fully supervised approaches to metric learning, the method can improve upon vanilla NCC without receiving locations of true matches during training. The improvement is quantified through patches of brain images from serial section electron microscopy. Relative to a parameter-tuned bandpass filter, siamese convolutional nets significantly reduce false matches. The improved accuracy of the method could be essential for connectomics, because emerging petascale datasets may require billions of template matches during assembly. Our method is also expected to generalize to other computer vision applications that use template matching to find image correspondences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Computer vision–ECCV 2006, pp. 404–417 (2006)
Google Scholar
Berg, A.C., Malik, J.: Geometric blur for template matching. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 1, p. I. IEEE (2001)
Google Scholar
Bromley, J., Bentz, J.W., Bottou, L., Guyon, I., LeCun, Y., Moore, C., Säckinger, E., Shah, R.: Signature verification using a “siamese” time delay neural network. IJPRAI 7(4), 669–688 (1993)
Google Scholar
Brown, M., Hua, G., Winder, S.: Discriminative learning of local image descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 43–57 (2011)
Article Google Scholar
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 539–546. IEEE (2005)
Google Scholar
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: FlowNet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766 (2015)
Google Scholar
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)
Article MathSciNet Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
Google Scholar
Han, X., Leung, T., Jia, Y., Sukthankar, R., Berg, A.C.: MatchNet: unifying feature and metric learning for patch-based matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3279–3286 (2015)
Google Scholar
Hegde, V., Zadeh, R.: FusionNet: 3D object classification using multiple data representations. arXiv preprint arXiv:1607.05695 (2016)
Heo, Y.S., Lee, K.M., Lee, S.U.: Robust stereo matching using adaptive normalized cross-correlation. IEEE Trans. Pattern Anal. Mach. Intell. 33(4), 807–822 (2011)
Article Google Scholar
Kulis, B., et al.: Metric learning: a survey. Found. Trends® Mach. Learn. 5(4), 287–364 (2013)
Article Google Scholar
Kumar, B.G., Carneiro, G., Reid, I., et al.: Learning local image descriptors with deep Siamese and triplet convolutional networks by minimising global loss functions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5385–5394 (2016)
Google Scholar
Lenc, K., Vedaldi, A.: Learning covariant feature detectors. In: Computer Vision–ECCV 2016 Workshops, pp. 100–117. Springer, Heidelberg (2016)
Google Scholar
Lewis, J.P.: Fast template matching. In: Vision Interface, vol. 95, pp. 15–19 (1995)
Google Scholar
Lichtman, J.W., Pfister, H., Shavit, N.: The big data challenges of connectomics. Nat. Neurosci. 17(11), 1448–1454 (2014)
Article Google Scholar
Liu, C., Yuen, J., Torralba, A.: SIFT flow: dense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 978–994 (2011)
Article Google Scholar
Long, J.L., Zhang, N., Darrell, T.: Do convnets learn correspondence? In: Advances in Neural Information Processing Systems, pp. 1601–1609 (2014)
Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Article Google Scholar
Luo, J., Konofagou, E.E.: A fast normalized cross-correlation calculation method for motion estimation. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 57(6), 1347–1357 (2010)
Article Google Scholar
Pathak, D., Girshick, R., Dollár, P., Darrell, T., Hariharan, B.: Learning features by watching objects move. arXiv preprint arXiv:1612.06370 (2016)
Preibisch, S., Saalfeld, S., Rohlfing, T., Tomancak, P.: Bead-based mosaicing of single plane illumination microscopy images using geometric local descriptor matching. In: SPIE Medical Imaging, p. 72592S. International Society for Optics and Photonics (2009)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Springer, Heidelberg (2015)
Google Scholar
Saalfeld, S., Fetter, R., Cardona, A., Tomancak, P.: Elastic volume reconstruction from series of ultra-thin microscopy sections. Nat. Methods 9(7), 717–720 (2012)
Article Google Scholar
Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., Moreno-Noguer, F.: Discriminative learning of deep convolutional feature point descriptors. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 118–126 (2015)
Google Scholar
Smeulders, A.W.M., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M.: Visual tracking: an experimental survey. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1442–1468 (2014)
Article Google Scholar
Subramaniam, A., Chatterjee, M., Mittal, A.: Deep neural networks with inexact matching for person re-identification. In: Advances in Neural Information Processing Systems, pp. 2667–2675 (2016)
Google Scholar
Tulyakov, S., Ivanov, A., Fleuret, F.: Weakly supervised learning of deep metrics for stereo reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1339–1348 (2017)
Google Scholar
Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., Torr, P.H.S.: End-to-end representation learning for correlation filter based tracking. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5000–5008. IEEE (2017)
Google Scholar
Yang, L., Jin, R.: Distance metric learning: a comprehensive survey. Michigan State Univ. 2(2), 4 (2006)
Google Scholar
Yi, K.M., Trulls, E., Lepetit, V., Fua, P.: Lift: learned invariant feature transform. In: European Conference on Computer Vision, pp. 467–483. Springer, Heidelberg (2016)
Google Scholar
Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4353–4361 (2015)
Google Scholar
Zheng, Z., Lauritzen, J.S., Perlman, E., Robinson, C.G., Nichols, M., Milkie, D., Torrens, O., Price, J., Fisher, C.B., Sharifi, N., Calle-Schuler, S.A., Kmecova, L., Ali, I.J., Karsh, B., Trautman, E.T., Bogovic, J., Hanslovsky, P., Jefferis, G.S.X.E., Kazhdan, M., Khairy, K., Saalfeld, S., Fetter, R.D., Bock, D.D.: A complete electron microscopy volume of the brain of adult drosophila melanogaster. bioRxiv (2017)
Google Scholar

Download references

Acknowledgment

This work has been supported by AWS Machine Learning Award and the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DoI/IBC) contract number D16PC0005. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/IBC, or the U.S. Government.

Author information

Authors and Affiliations

Princeton University, Princeton, NJ, 08544, USA
Davit Buniatyan, Sergiy Popovych, Dodam Ih, Thomas Macrina, Jonathan Zung & H. Sebastian Seung

Authors

Davit Buniatyan
View author publications
You can also search for this author in PubMed Google Scholar
Sergiy Popovych
View author publications
You can also search for this author in PubMed Google Scholar
Dodam Ih
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Macrina
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Zung
View author publications
You can also search for this author in PubMed Google Scholar
H. Sebastian Seung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Davit Buniatyan .

Editor information

Editors and Affiliations

Saga University, Saga, Saga, Japan
Kohei Arai
The Science and Information (SAI) Organization, Bradford, West Yorkshire, UK
Supriya Kapoor

A Appendix

1.1 A.1 Training Set

Our training set consists of 10,000 source-template image pairs from consecutive slices in an affine-aligned stack of images that contained non-affine deformations. We include only source-template pairs such that \(r_{delta}<0.2\) for simple NCC. This rejects over 90% of pairs and increases the fraction of difficult examples, similar to hard example mining in the data collection process. During the training we alternate between a batch of eight source-template pairs and then the same batch with randomly permuted source-template pairings. We randomly cropped the source and template images such that the position of the peak is randomly distributed. We used random rotations/flipping of both channel inputs by \(90^{\circ }, 180^{\circ }, 270^{\circ }\).

1.2 A.2 Ground Truth Annotation

For training data, source image is \(512 \times 512\)px and each template is \(224 \times 224\)px. These pairs are sampled from 1040 superresolution slices, each 33,000 \(\times \) 33,000px. We validated our model on 95 serial images from the training set an unpublished EM dataset with a resolution of \(7 \times 7 \times 40\,\mathrm{nm}^3\). Each image was 15,000 \(\times \) 15,000px, and had been roughly aligned with an affine model but still contained considerable non-affine distortions up to 250px (full resolution).

In each experiment, both the template and the source images were downsampled by a factor of 3 before NCC, so that 160px and 224px templates were 480px and 672px at full resolution, while the source image was fixed at 512px downsampled (1,536px full resolution). The template matches were taken in a triangular grid covering the image, with an edge length of 400px at full resolution (Fig. 8 shows the locations of template matches across an image).

Our first method to evaluate performance was to compare error rates. Errors were detected manually with a tool that allowed human annotators to inspect the template matching inputs and outputs. The tool is based on the visualization of the displacement vectors that result from each template match across a section, as shown in Fig. 8. Any match that significantly differed (over 50px) from its neighbors were rejected, and matches that differed from neighbors but not significantly were individually inspected for correctness by visualizing a false color overlay of the template over the source at the match location. The latter step was needed as there were many true matches that deviated prominently from its neighbors: the template patch could contain neurites or other features parallel to the sectioning plane, resulting in large motions of specific features in a random direction that may not be consistent with the movement of the larger area around the template (see Fig. 9 for an example of this behavior). Table 1 summarizes the error counts in each experiment.

Table 2. Image parameters for training and testing. Unless otherwise noted, resolutions are given after 3x downsampling where 1px represents \(21\times 21\,\mathrm{nm}\).

Full size table

Table 3. Dissociation of true matches set between the bandpass and NCCNet. Counts of true matches per category. Total possible adjacent matches: 144,500. Total possible across matches: 72,306.

Full size table

1.3 A.3 Match Filtering

To assess how easily false matches could be removed, we evaluated matches with the following criteria:

norm: The Euclidean norm of the displacement required to move the template image to its match location in the source image at full resolution.
r max: The first peak of the correlogram serves as a proxy for confidence in the match.
r delta: The difference between the first peak and second peak (after removing a 5px square window surrounding the first peak) of the correlogram provides some estimate of the certainty that there is no other likely match in the source image, and the criteria the NCCNet was trained to optimize.

These criteria can serve as useful heuristics to accept or reject matches. It approximate the unknown partitions for the true and erroneous correspondences. The less overlap between the actual distributions when projected onto the criterion dimension, the more useful that criterion. Figure 3 plots these three criteria across the three image conditions. The improvement in rejection efficiency also generalized well across experiments, as evident in the Appendix, Sup. Fig. 15. Achieving a 0.1% error rate on the most difficult task we tested (across, 160px template size) required rejecting 20% of the true matches on bandpass, while less than 1% rejection of true matches was sufficient with the convolutional networks.

Table 4. Residual block architecture. i is the number of input channels, n ranges from {8, 16, 32, 64} to {32, 16, 8} on downsampling and upsamling levels.

Full size table

False matches can be rejected efficiently with the match criteria. The NCCNet transformed the true match distributions for r max and r delta to be more left-skewed, while the erroneous match distribution for r delta remains with lower values (see Fig. 3a), resulting in a distribution more amenable to accurate error rejection. For the case of adjacent sections with 224px templates, we can remove every error in our NCCNet output by rejecting matches with an r delta below 0.05, which removes only 0.12% of the true matches. The same threshold also removes all false matches in the bandpass outputs, but removes 0.40% of the true matches (see Fig. 3b). This 3.5x improvement in rejection efficiency is critical to balancing the trade-off between complete elimination of false matches and retaining as many true matches as possible.

Table 5. Each channel architecture

Full size table

The NCCNet produced matches in the vast majority of cases where the bandpass produced matches. It introduce some false matches that the bandpass did not, but it correctly identified 3–20 times as many additional true matches relatively (see Table 3).

The majority of the false matches in the convnet output were also present in the bandpass case, which establishes the NCCNet as superior to and not merely different from bandpass.

Defining whether a patch pair is a true match depends on the tolerance for spatial localization error. Our method makes the tolerance explicit through the definition of the secondary peak. Explicitness is required by our weakly supervised approach and is preferable because blindly accepting a hidden tolerance parameter in the training set could produce suboptimal results in some applications (Table 2).

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Buniatyan, D., Popovych, S., Ih, D., Macrina, T., Zung, J., Seung, H.S. (2020). Weakly Supervised Deep Metric Learning for Template Matching. In: Arai, K., Kapoor, S. (eds) Advances in Computer Vision. CVC 2019. Advances in Intelligent Systems and Computing, vol 943. Springer, Cham. https://doi.org/10.1007/978-3-030-17795-9_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-17795-9_4
Published: 24 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17794-2
Online ISBN: 978-3-030-17795-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Weakly Supervised Deep Metric Learning for Template Matching

Abstract

Access this chapter

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Training Set

1.2 A.2 Ground Truth Annotation

1.3 A.3 Match Filtering

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation