Self-supervised Vision Transformers with Data Augmentation Strategies Using Morphological Operations for Writer Retrieval

Peer, Marco; Kleber, Florian; Sablatnig, Robert

doi:10.1007/978-3-031-21648-0_9

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13639))

Included in the following conference series:

International Conference on Frontiers in Handwriting Recognition

1201 Accesses
2 Citations

Abstract

This paper introduces a self-supervised approach using vision transformers for writer retrieval based on knowledge distillation. We propose morphological operations as a general data augmentation method for handwriting images to learn discriminative features independent of the pen. Our method operates on binarized \(224\times 224\)-sized patches extracted of the documents’ writing region, and we generate two different views based on randomly sampled kernels for erosion and dilation to learn a representative embedding space invariant to different pens. Our evaluation shows that morphological operations outperform data augmentation generally used in retrieval tasks, e.g., flipping, rotation, and translation, by up to 8%. Additionally, we evaluate our data augmentation strategy to existing approaches such as networks trained with triplet loss. We achieve a mean average precision of 66.4% on the Historical-WI dataset, competing with methods using algorithms like SIFT for patch extraction or computationally expensive encodings, e.g., mVLAD, NetVLAD, or E-SVM. In the end, we show by visualizing the attention mechanism that the heads of the vision transformer focus on different parts of the handwriting, e.g., loops or specific characters, enhancing the explainability of our writer retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

HWNet v2: an efficient word image representation for handwritten documents

Article 31 July 2019

Start, Follow, Read: End-to-End Full-Page Handwriting Recognition

WriterINet: a multi-path deep CNN for offline text-independent writer identification

Article 14 October 2022

References

Christlein, V., Gropp, M., Fiel, S., Maier, A.: Unsupervised feature learning for writer identification and writer retrieval. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 991–997 (2017)
Google Scholar
Rasoulzadeh, S., BabaAli, B.: Writer identification and writer retrieval based on NetVLAD with Re-ranking. IET Biometrics 11(1), 10–22 (2022)
Article Google Scholar
Keglevic, M., Fiel, S., Sablatnig, R.: Learning features for writer retrieval and identification using triplet CNNs. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 211–216 (2018)
Google Scholar
Christlein, V., Bernecker, D., Angelopoulou, E.: Writer identification using VLAD encoded Contour-Zernike moments. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 906–910 (2015)
Google Scholar
Kleber, F., Fiel, S., Diem, M., Sablatnig, R.: CVL-DataBase: an o-line database for writer retrieval, writer identification and word spotting. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 560–564 (2013)
Google Scholar
Louloudis, G., Gatos, B., Stamatopoulos, N., Papandreou, A.: ICDAR 2013 competition on writer identification. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1397–1401 (2013)
Google Scholar
Buhrmester, V., Münch, D., Arens, M.: Analysis of explainers of black box deep neural networks for computer vision: a survey. Mach. Learn. Knowl. Extr. 3(4), 966–989 (2021)
Article Google Scholar
Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers in vision: a survey. ACM Comput. Surv. 54(10s), 1–41 (2021)
Article Google Scholar
Dosovitskiy, A., et al.: An image is Worth 16x16Words: transformers for image recognition at scale. ICLR (2021)
Google Scholar
Abnar, S., Zuidema, W.: Quantifying attention flow in transformers. In: Proceedings of the 58th Annual Meeting of the Association for Compu- tational Linguistics, Online: Association for Computational Linguistics, pp. 4190–4197 (2020)
Google Scholar
Chefer, H., Gur, S., Wolf, L.: Transformer interpretability beyond attention visualization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 782–791 (2021)
Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Fiel, S., Sablatnig, R.: Writer identification and retrieval using a convolutional neural network. In: CAIP (2015)
Google Scholar
Fiel, S., et al.: ICDAR2017 competition on historical document writer identification (Historical-WI). In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 1377–1382 (2017)
Google Scholar
Wang, Z., Maier, A., Christlein, V.: Towards end-to-end deep learning-based writer identification. In: INFORMATIK,: Gesellschaft für Informatik. Bonn, pp. 1345–1354 (2020)
Google Scholar
Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M.: Data2vec: a general framework for self-supervised learning in speech, vision and language (2022)
Google Scholar
Souibgui, M.A., et al.: DocEnTr: an end-to-end document image enhancement transformer. arXiv preprint arXiv:-2201.10252 (2022)
Google Scholar
Khamekhem Jemni, S., Souibgui, M.A., Kessentini, Y., Fornés, A.: Enhance to read better: a multi-task adversarial network for handwritten document image enhancement. Pattern Recognit. 123, 108 370 (2022)
Google Scholar
Liu, Z., et al.: Swin transformer V2: scaling up capacity and resolution. In: International Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Yan, H., Zhang, C., Wu, M.: Lawin transformer: improving semantic segmentation transformer with multi-scale representations via large window attention. ArXiv, vol. abs/2201.01615 (2022)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar

Download references

Acknowledgment

The project has been funded by the Austrian security research programme KIRAS of the Federal Ministry of Agriculture, Regions and Tourism (BMLRT) under the Grant Agreement 879687.

Author information

Authors and Affiliations

Computer Vision Lab, Institute of Visual Computing and Human-Centered Technology, TU Wien, Favoritenstraße 9/193-1, Vienna, Austria
Marco Peer, Florian Kleber & Robert Sablatnig

Authors

Marco Peer
View author publications
You can also search for this author in PubMed Google Scholar
Florian Kleber
View author publications
You can also search for this author in PubMed Google Scholar
Robert Sablatnig
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco Peer .

Editor information

Editors and Affiliations

Walmart Inc., Hoboken, NJ, USA
Utkarsh Porwal
Universitat Autònoma de Barcelona, Barcelona, Spain
Alicia Fornés
National University of Sciences and Technology (NUST), Islamabad, Pakistan
Faisal Shafait

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peer, M., Kleber, F., Sablatnig, R. (2022). Self-supervised Vision Transformers with Data Augmentation Strategies Using Morphological Operations for Writer Retrieval. In: Porwal, U., Fornés, A., Shafait, F. (eds) Frontiers in Handwriting Recognition. ICFHR 2022. Lecture Notes in Computer Science, vol 13639. Springer, Cham. https://doi.org/10.1007/978-3-031-21648-0_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-21648-0_9
Published: 25 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21647-3
Online ISBN: 978-3-031-21648-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Self-supervised Vision Transformers with Data Augmentation Strategies Using Morphological Operations for Writer Retrieval

Abstract

Access this chapter

Similar content being viewed by others

HWNet v2: an efficient word image representation for handwritten documents

Start, Follow, Read: End-to-End Full-Page Handwriting Recognition

WriterINet: a multi-path deep CNN for offline text-independent writer identification

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Self-supervised Vision Transformers with Data Augmentation Strategies Using Morphological Operations for Writer Retrieval

Abstract

Access this chapter

Similar content being viewed by others

HWNet v2: an efficient word image representation for handwritten documents

Start, Follow, Read: End-to-End Full-Page Handwriting Recognition

WriterINet: a multi-path deep CNN for offline text-independent writer identification

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation