Skip to main content

Generating Synthetic Handwritten Historical Documents with OCR Constrained GANs

  • Conference paper
  • First Online:
Document Analysis and Recognition – ICDAR 2021 (ICDAR 2021)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12823))

Included in the following conference series:

Abstract

We present a framework to generate synthetic historical documents with precise ground truth using nothing more than a collection of unlabeled historical images. Obtaining large labeled datasets is often the limiting factor to effectively use supervised deep learning methods for Document Image Analysis (DIA). Prior approaches towards synthetic data generation either require human expertise or result in poor accuracy in the synthetic documents. To achieve high precision transformations without requiring expertise, we tackle the problem in two steps. First, we create template documents with user-specified content and structure. Second, we transfer the style of a collection of unlabeled historical images to these template documents while preserving their text and layout. We evaluate the use of our synthetic historical documents in a pre-training setting and find that we outperform the baselines (randomly initialized and pre-trained). Additionally, with visual examples, we demonstrate a high-quality synthesis that makes it possible to generate large labeled historical document datasets with precise ground truth.

L. Vögtlin and M. Drazyk—Both authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/DIVA-DIA/Generating-Synthetic-Handwritten-Historical-Documents.

  2. 2.

    This can be done with any other word processing tool such as MS Word.

References

  1. Alberti, M., Seuret, M., Ingold, R., Liwicki, M.: A pitfall of unsupervised pre-training (2017). arXiv: 1703.04332

  2. Alberti, M., Vögtlin, L., Pondenkandath, V., Seuret, M., Ingold, R., Liwicki, M.: Labeling, cutting, grouping: an efficient text line segmentation method for medieval manuscripts. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1200–1206. IEEE (2019)

    Google Scholar 

  3. Baird, H.S.: Document Image Defect Models. In: Baird, H.S., Bunke, H., Yamamoto, K. (eds.) Structured Document Image Analysis, pp. 546–556. Springer, Heidelberg (1992). https://doi.org/10.1007/978-3-642-77281-8_26

  4. Bluche, T., Louradour, J., Messina, R.: Scan, attend and read: end-to-end handwritten paragraph recognition with MDLSTM attention. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 1050–1055 (2017)

    Google Scholar 

  5. Capobianco, S., Marinai, S.: DocEmul: a toolkit to generate structured historical documents. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 1186–1191 (2017)

    Google Scholar 

  6. Chu, C., Zhmoginov, A., Sandler, M.: CycleGAN, a master of steganography (2017)

    Google Scholar 

  7. Clausner, C., Pletschacher, S., Antonacopoulos, A.: Aletheia - an advanced document layout and text ground-truthing system for production environments. In: 2011 International Conference on Document Analysis and Recognition, pp. 48–52 (2011)

    Google Scholar 

  8. Edwards, H.J.: Caesar: The Gallic War. Harvard University Press Cambridge, Cambridge (1917)

    Google Scholar 

  9. Fischer, A., Frinken, V., Fornés, A., Bunke, H.: Transcription alignment of Latin manuscripts using hidden Markov models. In: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing, HIP 2011, pp. 29–36. Association for Computing Machinery (2011)

    Google Scholar 

  10. Goodfellow, I.J., et al.: Generative Adversarial Networks (2014)

    Google Scholar 

  11. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 369–376. Association for Computing Machinery (2006)

    Google Scholar 

  12. Guan, M., Ding, H., Chen, K., Huo, Q.: Improving handwritten OCR with augmented text line images synthesized from online handwriting samples by style-conditioned GAN. In: 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 151–156 (2020)

    Google Scholar 

  13. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  14. Journet, N., Visani, M., Mansencal, B., Van-Cuong, K., Billy, A.: DocCreator: a new software for creating synthetic ground-truthed document images. J. Imaging 3(4), 62 (2017)

    Google Scholar 

  15. Kang, L., Riba, P., Wang, Y., Rusiñol, M., Fornés, A., Villegas, M.: GANwriting: content-conditioned generation of styled handwritten word images. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12368, pp. 273–289. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58592-1_17

    Chapter  Google Scholar 

  16. Kieu, V.C., Visani, M., Journet, N., Domenger, J.P., Mullot, R.: A character degradation model for grayscale ancient document images. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pp. 685–688 (2012)

    Google Scholar 

  17. Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization (2017)

    Google Scholar 

  18. Li, H., Wang, W.: Reinterpreting CTC training as iterative fitting. Pattern Recog. 105, 107392 (2020)

    Article  Google Scholar 

  19. Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. IJDAR 5(1), 39–46 (2002)

    Article  Google Scholar 

  20. Masci, J., Meier, U., Cireşan, D., Schmidhuber, J.: Stacked convolutional auto-encoders for hierarchical feature extraction. In: Honkela, T., Duch, W., Girolami, M., Kaski, S. (eds.) ICANN 2011. LNCS, vol. 6791, pp. 52–59. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21735-7_7

    Chapter  Google Scholar 

  21. Mehri, M., Héroux, P., Mullot, R., Moreux, J.P., Coüasnon, B., Barrett, B.: HBA 1.0: a pixel-based annotated dataset for historical book analysis. In: Proceedings of the 4th International Workshop on Historical Document Imaging and Processing, HIP 2017, pp. 107–112. Association for Computing Machinery (2017)

    Google Scholar 

  22. Märgner, V., Abed, H.E.: Tools and metrics for document analysis systems evaluation. In: Doermann, D., Tombre, K. (eds.) Handbook of Document Image Processing and Recognition, pp. 1011–1036. Springer, London (2014). https://doi.org/10.1007/978-0-85729-859-1_33

    Chapter  Google Scholar 

  23. Pondenkandath, V., Alberti, M., Diatta, M., Ingold, R., Liwicki, M.: Historical document synthesis with generative adversarial networks. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 5, pp. 146–151 (2019)

    Google Scholar 

  24. Scius-Bertrand, A., Voegtlin, L., Alberti, M., Fischer, A., Bui, M.: Layout analysis and text column segmentation for historical Vietnamese steles. In: Proceedings of the 5th International Workshop on Historical Document Imaging and Processing, HIP 2019, pp. 84–89. , Association for Computing Machinery (2019)

    Google Scholar 

  25. Seuret, M., Chen, K., Eichenbergery, N., Liwicki, M., Ingold, R.: Gradient-domain degradations for improving historical documents images layout analysis. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1006–1010 (2015)

    Google Scholar 

  26. Strauß, T., Leifert, G., Labahn, R., Hodel, T., Mühlberger, G.: ICFHR2018 competition on automated text recognition on a READ dataset. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 477–482 (2018)

    Google Scholar 

  27. Studer, L., et al.: A comprehensive study of imagenet pre-training for historical document image analysis. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 720–725 (2019)

    Google Scholar 

  28. Taigman, Y., Polyak, A., Wolf, L.: Unsupervised Cross-Domain Image Generation (2016)

    Google Scholar 

  29. Tensmeyer, C., Brodie, M., Saunders, D., Martinez, T.: Generating realistic binarization data with generative adversarial networks. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 172–177 (2019)

    Google Scholar 

  30. Touvron, H., Douze, M., Cord, M., Jégou, H.: Powers of layers for image-to-image translation (2020). arXiv:2008.05763

  31. Zhang, K.A., Cuesta-Infante, A., Xu, L., Veeramachaneni, K.: SteganoGAN: high capacity image steganography with GANs (2019). arXiv:1901.03892

  32. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)

    Google Scholar 

Download references

Acknowledgment

The work presented in this paper has been partially supported by the HisDoc III project funded by the Swiss National Science Foundation with the grant number 205120_169618. A big thanks to our co-workers Paul Maergner and Linda Studer for their support and advice.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lars Vögtlin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vögtlin, L., Drazyk, M., Pondenkandath, V., Alberti, M., Ingold, R. (2021). Generating Synthetic Handwritten Historical Documents with OCR Constrained GANs. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12823. Springer, Cham. https://doi.org/10.1007/978-3-030-86334-0_40

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86334-0_40

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86333-3

  • Online ISBN: 978-3-030-86334-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics