Skip to main content

Recognition and Information Extraction in Historical Handwritten Tables: Toward Understanding Early \(20^{th}\) Century Paris Census

Part of the Lecture Notes in Computer Science book series (LNCS,volume 13237)

Abstract

We aim to build a vast database (up to 9 million individuals) from the handwritten tabular nominal census of Paris of 1926, 1931 and 1936, each composed of about 100,000 handwritten simple pages in a tabular format. We created a complete pipeline that goes from the scan of double pages to text prediction while minimizing the need for segmentation labels. We describe how weighted finite state transducers, writer specialization and self-training further improved our results. We also introduce through this communication two annotated datasets for handwriting recognition that are now publicly available, and an open-source toolkit to apply WFST on CTC lattices.

Keywords

  • Handwriting recognition
  • Document layout analysis
  • Self-training
  • Table analysis
  • WFST
  • Semi-supervised learning

Project supported by CollEx-Persée (AAP19_20), with the financial collaboration of the TGIR Progedo and the Grand Équipement Documentaire Campus Condorcet.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-031-06555-2_10
  • Chapter length: 15 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   89.00
Price excludes VAT (USA)
  • ISBN: 978-3-031-06555-2
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   119.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.

Notes

  1. 1.

    https://gitlab.com/projet-popp/sigra/.

  2. 2.

    https://github.com/Shulk97/POPP-datasets/.

  3. 3.

    The complete Belleville census was written by three writers but their writing style are very similar and can be therefore considered as one unique writing style.

  4. 4.

    https://github.com/Shulk97/POPP-datasets/.

  5. 5.

    https://github.com/dhlab-epfl/dhSegment.

  6. 6.

    https://github.com/FactoDeepLearning/VerticalAttentionOCR.

  7. 7.

    Deceased people database since 1970 (INSEE, in French): https://www.insee.fr/fr/information/4190491.

  8. 8.

    http://www.toponymiefrancophone.org/divfranco/Bougainville/Liste_generale.aspx?nom=liste_pays.

  9. 9.

    https://gitlab.com/projet-popp/sigra/.

References

  1. Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., Mohri, M.: OpenFst: a general and efficient weighted finite-state transducer library. In: Holub, J., Žd’árek, J. (eds.) CIAA 2007. LNCS, vol. 4783, pp. 11–23. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76336-9_3

    CrossRef  MATH  Google Scholar 

  2. Aradillas, J.C., Murillo-Fuentes, J.J., Olmos, P.M.: Boosting offline handwritten text recognition in historical documents with few labeled lines. arXiv:2012.02544 (2020)

  3. Bluche, T.: Joint line segmentation and transcription for end-to-end handwritten paragraph recognition. arXiv:1604.08352 (2016)

  4. Bluche, T., Louradour, J., Messina, R.: Scan, attend and read: end-to-end handwritten paragraph recognition with MDLSTM attention. arXiv:1604.03286 (2016)

  5. Chowdhury, S., Garain, U., Chattopadhyay, T.: A weighted finite-state transducer (WFST)-based language model for online Indic script handwriting recognition. In: International Conference on Document Analysis and Recognition, pp. 599–602 (2011)

    Google Scholar 

  6. Coquenet, D., Chatelain, C., Paquet, T.: Handwritten text recognition: from isolated text lines to whole documents. In: ORASIS 2021 (2021)

    Google Scholar 

  7. Coquenet, D., Chatelain, C., Paquet, T.: End-to-end handwritten paragraph text recognition using a vertical attention network. arXiv:2012.03868 (2020)

  8. Coquenet, D., Chatelain, C., Paquet, T.: Recurrence-free unconstrained handwritten text recognition using gated fully convolutional network. In: International Conference on Frontiers in Handwriting Recognition, pp. 19–24 (2020)

    Google Scholar 

  9. Coquenet, D., Soullard, Y., Chatelain, C., Paquet, T.: Have convolutions already made recurrence obsolete for unconstrained handwritten text recognition? In: ICDAR Machine Learning Workshop, Sydney, Australia, pp. 65–70. IEEE (2019)

    Google Scholar 

  10. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)

    Google Scholar 

  11. Gay, V.: TRF-GIS Communes (1870–1940) Type: dataset

    Google Scholar 

  12. Grüning, T., Leifert, G., Strauß, T., Michael, J., Labahn, R.: A two-stage method for text line detection in historical documents. Int. J. Doc. Anal. Recogn. 22(3), 285–302 (2019)

    CrossRef  Google Scholar 

  13. Michael, J., Labahn, R., Grüning, T., Zöllner, J.: Evaluating sequence-to-sequence models for handwritten text recognition. arXiv:1903.07377 (2019)

  14. Mohri, M., Pereira, F., Riley, M.: Weighted finite-state transducers in speech recognition. Comput. Speech Lang. 16(1), 69–88 (2002)

    CrossRef  Google Scholar 

  15. Oliveira, S.A., Seguin, B., Kaplan, F.: dhSegment: a generic deep-learning approach for document segmentation. In: International Conference on Frontiers in Handwriting Recognition, pp. 7–12 (2018)

    Google Scholar 

  16. Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society (2011)

    Google Scholar 

  17. Puigcerver, J.: Are multidimensional recurrent layers really necessary for handwritten text recognition? In: International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 67–72 (2017)

    Google Scholar 

  18. Roark, B., Sproat, R., Allauzen, C., Riley, M., Sorensen, J., Tai, T.: The OpenGrm open-source finite-state grammar software libraries. In: Proceedings of the ACL 2012 System Demonstrations, Jeju Island, Korea, pp. 61–66. Association for Computational Linguistics, July 2012

    Google Scholar 

  19. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. arXiv:1505.04597 (2015)

  20. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. arXiv:1801.04381 (2019)

  21. Schall, M., Schambach, M.P., Franz, M.O.: Multi-dimensional connectionist classification: reading text in one step. In: International Workshop on Document Analysis Systems (DAS), pp. 405–410 (2018)

    Google Scholar 

  22. Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. J. Big Data 6(1), 1–48 (2019)

    CrossRef  Google Scholar 

  23. Stuner, B., Chatelain, C., Paquet, T.: Self-training of BLSTM with lexicon verification for handwriting recognition. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 633–638 (2017)

    Google Scholar 

  24. Tan, M., Le, Q.V.: EfficientNet: rethinking model scaling for convolutional neural networks. arXiv:1905.11946 (2020)

  25. Tensmeyer, C., Davis, B., Wigington, C., Lee, I., Barrett, B.: PageNet: page boundary extraction in historical handwritten documents. arXiv:1709.01618 (2017)

  26. Thomas, P.: Semi-supervised learning by Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien (review). IEEE Trans. Neural Netw. 20, 542 (2009)

    Google Scholar 

  27. Voigtlaender, P., Doetsch, P., Ney, H.: Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In: International Conference on Frontiers in Handwriting Recognition, pp. 228–233 (2016)

    Google Scholar 

  28. Xie, Q., Luong, M.T., Hovy, E., Le, Q.V.: Self-training with noisy student improves imagenet classification. arXiv:1911.04252 (2020)

  29. Yalniz, I.Z., Jégou, H., Chen, K., Paluri, M., Mahajan, D.: Billion-scale semi-supervised learning for image classification. arXiv:1905.00546 (2019)

  30. Yousef, M., Bishop, T.E.: OrigamiNet: weakly-supervised, segmentation-free, one-step, full page text recognition by learning to unfold. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14698–14707. IEEE (2020)

    Google Scholar 

  31. Yousef, M., Hussain, K.F., Mohammed, U.S.: Accurate, data-efficient, unconstrained text recognition with convolutional neural networks. Pattern Recogn. 108, 107482 (2020)

    Google Scholar 

  32. Zou, Y., Yu, Z., Liu, X., Kumar, B.V.K.V., Wang, J.: Confidence regularized self-training. arXiv:1908.09822 (2020)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Constum .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Constum, T. et al. (2022). Recognition and Information Extraction in Historical Handwritten Tables: Toward Understanding Early \(20^{th}\) Century Paris Census. In: Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. DAS 2022. Lecture Notes in Computer Science, vol 13237. Springer, Cham. https://doi.org/10.1007/978-3-031-06555-2_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-06555-2_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-06554-5

  • Online ISBN: 978-3-031-06555-2

  • eBook Packages: Computer ScienceComputer Science (R0)