Linguistically-Enhanced Search over an Open Diachronic Corpus

  • Rafael C. Carrasco
  • Isabel Martínez-Sempere
  • Enrique Mollá-Gandía
  • Felipe Sánchez-Martínez
  • Gustavo Candela Romero
  • Maria Pilar Escobar Esteban
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9022)

Abstract

The BVC section of the impact-es diachronic corpus of historical Spanish compiles 86 books —containing approximately 2 million words. About 27% of the words —providing a representative coverage of the most frequent word forms— have been annotated with their lemma, part of speech, and modern equivalent following the Text Encoding Initiative guidelines. We describe how this type of annotation can be exploited to provide linguistically-enhanced search over historical documents. The advanced search supports queries whose search terms can be a combination of surface forms, lemmata, parts of speech and modern forms of historical variants.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Kenter, T., Erjavec, T., Dulmin, M.Z., Fiser, D.: Lexicon construction and corpus annotation of historical language with the CoBaLT editor. In: Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Avignon, France, pp. 1–6 (April 2012)Google Scholar
  2. 2.
    Manning, C.D., Schütze, H.: Foundations of statistical natural language processing, pp. 1–680. MIT Press (2001)Google Scholar
  3. 3.
    Sánchez-Martínez, F., Forcada, M.L., Carrasco, R.C.: Searching for linguistic phenomena in literary digital libraries. In: Proceedings of the ECDL 2008 Workshop on Information Access to Cultural Heritage, Aarhus, Denmark (September 2008)Google Scholar
  4. 4.
    Sánchez-Martínez, F., Martínez-Sempere, I., Ivars-Ribes, X., Carrasco, R.C.: An open diachronic corpus of historical Spanish. Language Resources and Evaluation (2013), doi:10.1007/s10579-013-9239-yGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Rafael C. Carrasco
    • 1
  • Isabel Martínez-Sempere
    • 1
  • Enrique Mollá-Gandía
    • 1
  • Felipe Sánchez-Martínez
    • 1
  • Gustavo Candela Romero
    • 1
  • Maria Pilar Escobar Esteban
    • 1
  1. 1.Departament de Llenguatges i Sistemes InformàticsUniversitat d’AlacantAlacantSpain

Personalised recommendations