Language Resources and Evaluation

, Volume 47, Issue 4, pp 1327–1342 | Cite as

An open diachronic corpus of historical Spanish

  • Felipe Sánchez-Martínez
  • Isabel Martínez-Sempere
  • Xavier Ivars-Ribes
  • Rafael C. Carrasco
Project Note

Abstract

The impact-es diachronic corpus of historical Spanish compiles over one hundred books—containing approximately 8 million words—in addition to a complementary lexicon which links more than 10,000 lemmas with attestations of the different variants found in the documents. This textual corpus and the accompanying lexicon have been released under an open license (Creative Commons by-nc-sa) in order to permit their intensive exploitation in linguistic research. Approximately 7 % of the words in the corpus (a selection aimed at enhancing the coverage of the most frequent word forms) have been annotated with their lemma, part of speech, and modern equivalent. This paper describes the annotation criteria followed and the standards, based on the Text Encoding Initiative recommendations, used to represent the texts in digital form.

Keywords

Diachronic corpus Historical Spanish Linguistic annotation TEI 

References

  1. Carreras, X., Chao, I., Padró, L., & Padró, M. (2004). FreeLing: An open-source suite of language analyzers. In: Proceedings of the 4th international conference on language resources and evaluation, Lisbon, Portugal, pp. 239–242.Google Scholar
  2. Davies, M. (2002). Un corpus anotado de 100.000.000 palabras del español histórico y moderno. Procesamiento del Lenguaje Natural 29, 21–27.Google Scholar
  3. Davies, M. (2010a). The corpus of contemporary American English as the first reliable monitor corpus of English. Literary and Linguistic Computing 25(4), 447–464.CrossRefGoogle Scholar
  4. Davies, M. (2010b). Creating useful historical corpora: A comparison of CORDE, the Corpus del Español, and the Corpus do Português. In Diacronía de las lenguas iberorromances: nuevas perspectivas desde la lingüística de corpus, Vervuert/Iberoamericana, Frankfurt, Germany/Madrid, Spain, pp. 137–166.Google Scholar
  5. Depuydt, K., & de Does, J. (2009). Fons Verborum. Feestbundel voor prof. dr. A.M.F.J. (Fons) Moerdijk, aangeboden door vrienden en collega’s bij zijn afscheid van het INL, Instituut voor Nederlandse Lexicologie, Leiden/Amsterdam (chap Computational tools and lexica to improve access to text) pp. 187–199.Google Scholar
  6. de Does, J., & Depuydt, K. (2012). Lexicon-supported OCR of eighteenth century Dutch books: A case study. In Proceedings of the 20th document recognition and retrieval conference, San Francisco, CA USA (to appear).Google Scholar
  7. Erjavec, T. (2012). The goo300k corpus of historical Slovene. In Proceedings of the eight international conference on language resources and evaluation, European Language Resources Association (ELRA), Istanbul, Turkey.Google Scholar
  8. Forcada, M. L., Ginestí-Rosell, M., Nordfalk, J., O’Regan, J., Ortiz-Rojas, S., Pérez-Ortiz, J. A., Sánchez-Martínez, F., Ramírez-Sánchez, G., & Tyers, F. M. (2011). Apertium: A free/open-source platform for rule-based machine translation. Machine Translation 25(2), 127–144.CrossRefGoogle Scholar
  9. Francis, W. N., & Kucera, H. (1979). Brown corpus manual. Online at http://www.hit.uib.no/icame/brown/bcm.html.
  10. Kenter, T., Erjavec, T., Dulmin, M. Z., & Fiser, D. (2012). Lexicon construction and corpus annotation of historical language with the CoBaLT editor. In Proceedings of the 6th workshop on language technology for cultural heritage, social sciences, and humanities, Association for Computational Linguistics, Avignon, France, pp. 1–6.Google Scholar
  11. Kocjančič, P. (2009). Internet y los recursos lingüísticos para la lengua española: Diccionarios y corpus. Verba hispanica: Anuario del Departamento de la Lengua y Literatura Españolas de la Facultad de Filosofía y Letras de la Universidad de Ljubljana, Vol. 17, pp. 145–164.Google Scholar
  12. Medina Urrea, A., & Méndez Cruz, C. F. (2011). El corpus histórico del español en México. Revista Digital Universitaria 12(7), 3–25.Google Scholar
  13. Montgomery, D. C. (2009). Introduction to statistical quality control. New York: Wiley.Google Scholar
  14. Neudecker, C., Schlarb, S., Dogan, M., Missier, P., Sufi, S., Williams, A., et al. (2011). An experimental workflow development platform for historical document digitisation and analysis. In: Proceedings of the 2011 workshop on historical document imaging and processing, Beijing, China, pp. 161–168.Google Scholar
  15. Procházková, P. (2006). Fundamentos de la lingüística de corpus. Concepción de los corpus y métodos de investigación con corpus. Available online at http://prochazkova.de/fundamentos_de_la_lingüística_de_corpus.pdf.Google Scholar
  16. Real Academia Española. (2001a). Diccionario De La Lengua Española (22nd ed.). Espasa Calpe, Madrid. Online at http://lema.rae.es/drae.
  17. Real Academia Española. (2001b). Nuevo tesoro lexicográfico de la lengua española (1st ed.). Espasa Calpe, Madrid. Online at http://buscon.rae.es/ntlle/SrvltGUILoginNtlle.
  18. Real Academia Española. (s.a.). Banco de datos CORDE, corpus diacrónico del español. Online at http://corpus.rae.es/cordenet.html. Last accessed 2012.09.24.
  19. Sánchez Marco, C., Boleda, G., & Fontana, J. M. (2009). Propuesta de codificación de la información paleográfica y lingüística para textos diacrónicos del español. uso del estándar TEI. In Proceedings of the Congreso Internacional Tradición e innovación: Nuevas perspectivas para la edición y el estudio de documentos antiguos, Madrid, Spain.Google Scholar
  20. Sánchez-Marco, C., Boleda, G., & Padró, L. (2011). Extending the tool, or how to annotate historical language varieties. In Proceedings of the 5th ACL-HLT workshop on language technology for cultural heritage, social sciences, and humanities, Portland, OR, USA, pp. 1–9.Google Scholar
  21. Sánchez-Prieto Borja, P. (2012). Desarrollo y explotación de un corpus de documentos españoles anteriores a 1700 (CODEA). Scriptum Digital 1, 5–35.Google Scholar
  22. World Wide Web Consortium. (2008). Extensible markup language (XML) 1.0 (5th ed.). Online at http://www.w3.org/TR/2008/REC-xml-20081126.

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  • Felipe Sánchez-Martínez
    • 1
  • Isabel Martínez-Sempere
    • 1
  • Xavier Ivars-Ribes
    • 1
  • Rafael C. Carrasco
    • 1
  1. 1.Dep. de Llenguatges i Sistemes InformàticsUniversitat d’AlacantAlacantSpain

Personalised recommendations