Abstract
This paper presents a complete workflow designed for extracting information from Quebec handwritten parish registers. The acts in these documents contain individual and family information highly valuable for genetic, demographic and social studies of the Quebec population. From an image of parish records, our workflow is able to identify the acts and extract personal information. The workflow is divided into successive steps: page classification, text line detection, handwritten text recognition, named entity recognition and act detection and classification. For all these steps, different machine learning models are compared. Once the information is extracted, validation rules designed by experts are then applied to standardize the extracted information and ensure its consistency with the type of act (birth, marriage and death). This validation step is able to reject records that are considered invalid or merged. The full workflow has been used to process over two million pages of Quebec parish registers from the 19–20th centuries. On a sample comprising 65% of registers, 3.2 million acts were recognized. Verification of the birth and death acts from this sample shows that 74% of them are considered complete and valid. These records will be integrated into the BALSAC database and linked together to recreate family and genealogical relations at large scale.
Similar content being viewed by others
Notes
References
Abadie, N., Carlinet, E., Chazalon, J., Duménieu, B.: A Benchmark of Named Entity Recognition Approaches in Historical Documents Application to 19th Century French Directories. In: Document Analysis Systems. pp. 445–460 (2022)
Akbik, A., Blythe, D., Vollgraf, R.: Contextual String Embeddings for Sequence Labeling. In: Proceedings of the 27th International Conference on Computational Linguistics. pp. 1638–1649 (Aug 2018)
Ares Oliveira, S., Seguin, B., Kaplan, F.: dhSegment: A Generic Deep-learning Approach for Document Segmentation. In: 16th International Conference on Frontiers in Handwriting Recognition (ICFHR). pp. 7–12 (Aug 2018)
Arora, A., Chang, C.C., Rekabdar, B., BabaAli, B., Povey, D., Etter, D., Raj, D., Hadian, H., Trmal, J., Garcia, P., et al.: Using ASR Methods for OCR. In: 15th International Conference on Document Analysis and Recognition. pp. 663–668 (Sep 2019)
Bluche, T., Louradour, J., Messina, R.O.: Scan, Attend and Read: End-to-End Handwritten Paragraph Recognition with MDLSTM Attention. In: International Conference on Document Analysis and Recognition. pp. 1050–1055 (Nov 2017). https://doi.org/10.1109/ICDAR.2017.174
Boillet, M., Maarand, M., Paquet, T., Kermorvant, C.: Including Keyword Position in Image-Based Models for Act Segmentation of Historical Registers. In: 6th International Workshop on Historical Document Imaging and Processing. p. 31-36 (Sep 2021). https://doi.org/10.1145/3476887.3476905
Boillet, M., Kermorvant, C., Paquet, T.: Multiple Document Datasets Pre-training Improves Text Line Detection With Deep Neural Networks. In: 25th International Conference on Pattern Recognition. pp. 2134–2141 (Jan 2020)
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: Identifying Density-based Local Outliers. In: 2000 ACM SIGMOD International Conference on Management of Data. pp. 93–104 (2000)
Capobianco, S., Marinai, S.: Deep Neural Networks for Record Counting in Historical Handwritten Documents. Pattern Recogn. Lett. 119, 103–111 (2017). https://doi.org/10.1016/j.patrec.2017.10.023
Carbonell, M., Fornés, A., Villegas, M., Lladós, J.: A Neural Model for Text Localization, Transcription and Named Entity Recognition in Full Pages. Pattern Recogn. Lett. 136, 219–227 (2020). https://doi.org/10.1016/j.patrec.2020.05.001
Carbonell, M., Villegas, M., Fornés, A., Lladós, J.: Joint recognition of handwritten text and named entities with a neural end-to-end model. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). pp. 399–404. IEEE Computer Society, Los Alamitos, CA, USA (apr 2018). 10.1109/DAS.2018.52, https://doi.ieeecomputersociety.org/10.1109/DAS.2018.52
Constum, T., Kempf, N., Paquet, T., Tranouez, P., Chatelain, C., Brée, S., Merveille, F.: Recognition and Information Extraction in Historical Handwritten Tables: Toward Understanding Early 20th Century Paris Census. In: Document Analysis Systems. pp. 143–157 (2022)
Coquenet, D., Chatelain, C., Paquet, T.: DAN: a Segmentation-free Document Attention Network for Handwritten Document Recognition (2022). 10.48550/ARXIV.2203.12273
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: ImageNet: a Large-Scale Hierarchical Image Database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255 (Jun 2009). https://doi.org/10.1109/CVPR.2009.5206848
Douzon, T., Duffner, S., Garcia, C., Espinas, J.: Improving Information Extraction on Business Documents with Specific Pre-training Tasks. In: Document Analysis Systems. pp. 111–125 (2022)
Embley, D.W., Nagy, G.: Green Interaction for Extracting Family Information from OCR’d Books. In: 2018 13th IAPR International Workshop on Document Analysis Systems. pp. 127–132 (2018). https://doi.org/10.1109/DAS.2018.58
Fornés, A., Romero, V., Baró, A., Toledo, J.I., Sánchez, J.A., Vidal, E., Lladós, J.: ICDAR2017 Competition on Information Extraction in Historical Handwritten Records. In: 2017 14th IAPR International Conference on Document Analysis and Recognition. vol. 01, pp. 1389–1394 (2017). https://doi.org/10.1109/ICDAR.2017.227
Grüning, T., Labahn, R., Diem, M., Kleber, F., Fiel, S.: READ-BAD: A New Dataset and Evaluation Scheme for Baseline Detection in Archival Documents. In: 13th International Workshop on Document Analysis Systems. pp. 351–356 (May 2017)
Grüning, T., Leifert, G., Strauß, T., Labahn, R.: A Two-Stage Method for Text Line Detection in Historical Documents. In: International Journal on Document Analysis and Recognition. vol. 22, pp. 285–302 (Sep 2019). https://doi.org/10.1007/s10032-019-00332-1
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On Calibration of Modern Neural Networks. In: International Conference on Machine Learning (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (Jun 2016). https://doi.org/10.1109/CVPR.2016.90
Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: Industrial-strength Natural Language Processing in Python (2020). https://doi.org/10.5281/zenodo.1212303,
Kahle, P., Colutto, S., Hackl, G., Mühlberger, G.: Transkribus - A Service Platform for Transcription, Recognition and Retrieval of Historical Documents. In: 2017 14th IAPR International Conference on Document Analysis and Recognition. vol. 04, pp. 19–24 (Nov 2017). https://doi.org/10.1109/ICDAR.2017.307
Kiss, M., Kohút, J., Benes, K., Hradis, M.: Importance of Textlines in Historical Document Classification. In: Document Analysis Systems. pp. 158–170 (2022)
Lang, E., Puigcerver, J., Toselli, A.H., Vidal, E.: Probabilistic Indexing and Search for Information Extraction on Handwritten German Parish Records. In: 2018 16th International Conference on Frontiers in Handwriting Recognition. pp. 44–49 (2018). https://doi.org/10.1109/ICFHR-2018.2018.00017
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation Forest. In: 2008 Eighth IEEE International Conference on Data Mining. pp. 413–422 (2008)
Liu, X., Gao, F., Zhang, Q., Zhao, H.: Graph Convolution for Multimodal Information Extraction from Visually Rich Documents. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers). pp. 32–39 (Jun 2019). https://doi.org/10.18653/v1/N19-2005
Maarand, M., Beyer, Y., Kåsen, A., Fosseide, K.T., Kermorvant, C.: A comprehensive comparison of open-source libraries for handwritten text recognition in norwegian. In: Document Analysis Systems. pp. 399–413 (2022)
Martin, L., Muller, B., Ortiz Suárez, P.J., Dupont, Y., Romary, L., de la Clergerie, É., Seddah, D., Sagot, B.: CamemBERT: a Tasty French Language Model. In: 58th Annual Meeting of the Association for Computational Linguistics. pp. 7203–7219 (2020)
Monnier, T., Aubry, M.: docExtractor: An off-the-shelf historical document element extraction. In: International Conference on Frontiers in Handwriting Recognition (2020)
Bizon Monroc, C., Miret, B., Bonhomme, M.L., Kermorvant, C.: A Comprehensive Study of Open-Source Libraries for Named Entity Recognition on Handwritten Historical Documents. In: Document Analysis Systems. pp. 429–444 (2022)
Nion, T., Menasri, F., Louradour, J., Sibade, C., Retornaz, T., Métaireau, P.Y., Kermorvant, C.: Handwritten Information Extraction from Historical Census Documents. In: 2013 12th International Conference on Document Analysis and Recognition. pp. 822–826 (2013). https://doi.org/10.1109/ICDAR.2013.168
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Prieto, J.R., Bosch, V., Vidal, E., Stutzmann, D., Hamel, S.: Text Content Based Layout Analysis. In: 2020 17th International Conference on Frontiers in Handwriting Recognition. pp. 258–263 (Sep 2020). https://doi.org/10.1109/ICFHR2020.2020.00055
Puigcerver, J.: Are Multidimensional Recurrent Layers Really Necessary for Handwritten Text Recognition? In: 2017 14th IAPR International Conference on Document Analysis and Recognition. vol. 01, pp. 67–72 (2017). https://doi.org/10.1109/ICDAR.2017.20
Qi, P., Zhang, Y., Zhang, Y., Bolton, J., Manning, C.D.: Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In: 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. pp. 101–108 (Jul 2020). https://doi.org/10.18653/v1/2020.acl-demos.14
Rouhou, A.C., Dhiaf, M., Kessentini, Y., Salem, S.B.: Transformer-based Approach for Joint Handwriting and Named Entity Recognition in Historical Document. Pattern Recogn. Lett. 155, 128–134 (2022). https://doi.org/10.1016/j.patrec.2021.11.010
Sennrich, R., Haddow, B., Birch, A.: Neural Machine Translation of Rare Words with Subword Units. In: Annual Meeting of the Association for Computational Linguistics (2016)
Seuret, M., Nicolaou, A., Rodríguez-Salas, D., Weichselbaumer, N., Stutzmann, D., Mayr, M., Maier, A., Christlein, V.: ICDAR 2021 Competition on Historical Document Classification. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) International Conference on Document Analysis and Recognition. pp. 618–634 (2021)
Simistira, F., Seuret, M., Eichenberger, N., Garz, A., Liwicki, M., Ingold, R.: DIVA-HisDB: A Precisely Annotated Large Dataset of Challenging Medieval Manuscripts. In: 15th International Conference on Frontiers in Handwriting Recognition. pp. 471–476 (Oct 2016). https://doi.org/10.1109/ICFHR.2016.0093
Tarride, S., Lemaitre, A., Coüasnon, B., Tardivel, S.: Combination of deep neural networks and logical rules for record segmentation in historical handwritten registers using few examples. Int. J. Doc. Anal. Recogn. 24, 77–96 (2021). https://doi.org/10.1007/s10032-021-00362-8
Tarride, S., Lemaitre, A., Coüasnon, B., Tardivel, S.: A Comparative Study of Information Extraction Strategies Using an Attention-Based Neural Network. In: Document Analysis Systems. pp. 644–658 (2022)
Walton, S., Livermore, L., Bánki, O., N. Cubey, R.W., Drinkwater, R., Englund, M., Goble, C., Groom, Q., Kermorvant, C., Rey, I., M Santos, C., Scott, B., R. Williams, A., Wu, Z.: Landscape analysis for the specimen data refinery. Research Ideas and Outcomes 6, e57602 (2020). https://doi.org/10.3897/rio.6.e57602
Wang, J., Liu, C., Jin, L., Tang, G., Zhang, J., Zhang, S., Wang, Q., Wu, Y., Cai, M.: Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution. In: Proceedings of the AAAI Conference on Artificial Intelligence (2021)
Wigington, C., Tensmeyer, C., Davis, B., Barrett, W., Price, B., Cohen, S.: Start, Follow, Read: End-to-End Full-Page Handwriting Recognition. In: ECCV 2018: 15th European Conference. p. 372-388 (2018). https://doi.org/10.1007/978-3-030-01231-1_23
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In: 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. p. 1192-1200 (Aug 2020)
Yu, W., Lu, N., Qi, X., Gong, P., Xiao, R.: PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks. 2020 25th International Conference on Pattern Recognition pp. 4363–4370 (2021)
Acknowledgements
The i-BALSAC project was supported by the Canadian Foundation for Innovation through its Cyberinfrastructure Initiative. Mélodie Boillet is partly funded by the CIFRE ANRT grant No. 2020/0390.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Tarride, S., Maarand, M., Boillet, M. et al. Large-scale genealogical information extraction from handwritten Quebec parish records. IJDAR 26, 255–272 (2023). https://doi.org/10.1007/s10032-023-00427-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-023-00427-w