Processing Quechua and Guarani Historical Texts Query Expansion at Character and Word Level for Information Retrieval

  • Johanna CordovaEmail author
  • Capucine Boidin
  • César Itier
  • Marie-Anne Moreaux
  • Damien Nouvel
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 898)


The LANGAS project provides an online database containing historical (16th–19th) texts in Quechua, Guarani and Tupi, for sociolinguistic studies. Querying texts for such low-resourced languages raises several questions, issues and challenges. Among them, our work addresses word variation (diacritization, typographic variations) as an optional query expansion mechanism of the search engine. For such processing, taking into account the peculiarities of considered languages is unavoidable. This paper describes the morphology of considered languages, collected linguistic resources, implemented modules (regular expressions, stemming, word clusters) and some preliminary evaluations. Our work will be an opportunity to release resources for those languages. We plan to deepen this work in the near future and hopefully expect it to be useful for other researchers interested in the matter.


Under resourced languages Query expansion Historical spelling variations 



The LANGAS project was funded by the French National Research Agency (ANR). This work benefited from the support of Université Sorbonne Paris Cité (USPC) and National Institute for Oriental Languages and Civilizations (INALCO). Many thanks to Joséphine Castaing and Elégant Mateus who developed the site and database.

Supplementary material


  1. 1.
    Baron, A., Rayson, P.: VARD2: a tool for dealing with spelling variation in historical corpora. In: Postgraduate Conference in Corpus Linguistics (2008)Google Scholar
  2. 2.
    Barteld, F.: Detecting spelling variants in non-standard texts. In: Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics, pp. 11–22 (2017)Google Scholar
  3. 3.
    Cerrón-Palomino, D.R.: Quechua sureño: diccionario unificado (1994). Accessed 17 Apr 2018
  4. 4.
    Duran, M.: Morphological and syntactic grammars for recognition of verbal lemmas in Quechua. In: Formalising Natural Languages with Nooj 2014, p. 28 (2015)Google Scholar
  5. 5.
    Gasser, M.: Antimorfo 1.0 user’s guide (2009)Google Scholar
  6. 6.
    Giusti, R., Candido, A., Muniz, M., Cucatto, L., Aluísio, S.: Automatic detection of spelling variation in historical corpus. In: Proceedings of the Corpus Linguistics Conference (CL) (2007)Google Scholar
  7. 7.
    Jacobs, P.: Vocabulary (2006).
  8. 8.
    Koolen, M., Adriaans, F., Kamps, J., de Rijke, M.: A cross-language approach to historic document retrieval. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 407–419. Springer, Heidelberg (2006). Scholar
  9. 9.
    Piotrowski, M.: Natural Language Processing for Historical Texts. Morgan & Claypool, San Rafael (2012)CrossRefGoogle Scholar
  10. 10.
    Rios, A.: A basic language technology toolkit for Quechua. Ph.D. thesis, Faculty of Arts, University of Zurich (2015)Google Scholar
  11. 11.
    Rios, A., Göhring, A., Volk, M.: A Quechua-Spanish parallel treebank (12 2008)Google Scholar
  12. 12.
    Rios, A., Mamani, R.: Allin Qillqay! a free online web spell checking service for Quechua (11 2014)Google Scholar
  13. 13.
    Rios, A., Mamani, R.: Morphological disambiguation and text normalization for southern Quechua varieties. In: Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, pp. 39–47 (2014)Google Scholar
  14. 14.
    Torero, A.: Idiomas de los Andes. Lingüística e historia. Editorial horizonte (2002)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Johanna Cordova
    • 1
    • 2
    Email author
  • Capucine Boidin
    • 2
  • César Itier
    • 3
  • Marie-Anne Moreaux
    • 1
  • Damien Nouvel
    • 1
  1. 1.INALCO ERTIMParisFrance
  2. 2.Paris 3 IHEALParisFrance
  3. 3.INALCO CERLOMParisFrance

Personalised recommendations