Information Extraction in Handwritten Marriage Licenses Books Using the MGGI Methodology

  • Verónica Romero
  • Alicia Fornés
  • Enrique Vidal
  • Joan Andreu Sánchez
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10255)

Abstract

Historical records of daily activities provide intriguing insights into the life of our ancestors, useful for demographic and genealogical research. For example, marriage license books have been used for centuries by ecclesiastical and secular institutions to register marriages. These books follow a simple structure of the text in the records with a evolutionary vocabulary, mainly composed of proper names that change along the time. This distinct vocabulary makes automatic transcription and semantic information extraction difficult tasks. In previous works we studied the use of category-based language models and how a Grammatical Inference technique known as MGGI could improve the accuracy of these tasks. In this work we analyze the main causes of the semantic errors observed in previous results and apply a better implementation of the MGGI technique to solve these problems. Using the resulting language model, transcription and information extraction experiments have been carried out, and the results support our proposed approach.

Keywords

Handwritten Text Recognition Information extraction Language modeling MGGI Categories-based language model 

Notes

Acknowledgment

This work has been partially supported through the European Union’s H2020 grant READ (Recognition and Enrichment of Archival Documents) (Ref: 674943), the European project ERC-2010-AdG-20100407-269796, the MINECO/FEDER, UE projects TIN2015-70924-C2-1-R and TIN2015-70924-C2-2-R, and the Ramon y Cajal Fellowship RYC-2014-16831.

References

  1. 1.
    Eilenberg, S.: Automata, Languages, and Machines, vol. 1. Academic Press, Orlando (1974)MATHGoogle Scholar
  2. 2.
    Garcia, P., Vidal, E., Casacuberta, F.: Local languages, the succesor method, and a step towards a general methodology for the inference of regular grammars. IEEE Trans. PAMI 6, 841–845 (1987)CrossRefGoogle Scholar
  3. 3.
    Graves, A., Schmidhuber, J.: Offline handwriting recognition with multidimensional recurrent neural networks. In: NIPS, pp. 545–552 (2008)Google Scholar
  4. 4.
    Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1998)Google Scholar
  5. 5.
    Marti, U.-V., Bunke, H.: Using a statistical language model to improve the preformance of an HMM-based cursive handwriting recognition system. IJPRAI 15(1), 65–90 (2001)Google Scholar
  6. 6.
    Niesler, T., Woodland, P.: A variable-length category-based n-gram language model. In: Proceedings of ICASSP 1996, vol. 1, pp. 164 –167, May 1996Google Scholar
  7. 7.
    Romero, V., Fornés, A., Serrano, N., Sánchez, J.A., Toselli, A., Frinken, V., Vidal, E., Lladós, J.: The ESPOSALLES database: an ancient marriage license corpus for off-line handwriting recognition. Pattern Recogn. 46, 1658–1669 (2013)CrossRefGoogle Scholar
  8. 8.
    Romero, V., Sánchez, J.A.: Category-based language models for handwriting recognition of marriage license books. In: Proceedings of ICDAR 2013, pp. 788–792 (2013)Google Scholar
  9. 9.
    Toselli, A.H., Juan, A., Keysers, D., González, J., Salvador, I., Ney, H., Vidal, E., Casacuberta, F.: Integrated handwriting recognition and interpretation using finite-state models. IJPRAI 18(4), 519–539 (2004)Google Scholar
  10. 10.
    Romero, E.V.V., Fornés, A., Sánchez, J.A.: Using the MGGI methodology for category-based language modeling in handwritten marriage licenses books. In: ICFHR, Shenzhen, China (2016)Google Scholar
  11. 11.
    Vidal, E., Llorens, D.: Using knowledge to improve N-gram language modelling through the MGGI methodology. In: Miclet, L., Higuera, C. (eds.) ICGI 1996. LNCS, vol. 1147, pp. 179–190. Springer, Heidelberg (1996). doi: 10.1007/BFb0033353 CrossRefGoogle Scholar
  12. 12.
    Vidal, E., Thollard, F., De La Higuera, C., Casacuberta, F., Carrasco, R.C.: Probabilistic finite-state machines-part I. IEEE Trans. PAMI 27(7), 1013–1025 (2005)CrossRefGoogle Scholar
  13. 13.
    Vidal, E., Thollard, F., De La Higuera, C., Casacuberta, F., Carrasco, R.C.: Probabilistic finite-state machines-part II. IEEE Trans. PAMI 27(7), 1026–1039 (2005)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Verónica Romero
    • 1
  • Alicia Fornés
    • 2
  • Enrique Vidal
    • 1
  • Joan Andreu Sánchez
    • 1
  1. 1.PRHLT Research CenterUniversitat Politécnica de ValénciaValenciaSpain
  2. 2.Department of Computer Science, Computer Vision CenterUniversitat Autónoma de BarcelonaBellaterraSpain

Personalised recommendations