Language Identification with Scarce Data: A Case Study from Peru

  • Alexandra Espichán-Linares
  • Arturo Oncevay-MarcosEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 795)


Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of the most well-known methods is the best fit. In this way, Peru offers a great challenge as a multi-cultural and linguistic country. Therefore, this study focuses in three steps: (1) to build from scratch a digital annotated corpus for 49 Peruvian indigenous languages and dialects, (2) to fit both standard and deep learning approaches for language identification, and (3) to statistically compare the results obtained. The standard model outperforms the deep learning one as it was expected, with 95.9% in average precision, and both corpus and model will be advantageous inputs for more complex tasks in the future.



The authors are thankful to J. Rubén Ruiz, bilingual education professor at NOPOKI, for providing access to some private books written in indigenous languages [5, 22]. Likewise, it is appreciated the collaboration of Dr. Roberto Zariquiey, linguistic professor at PUCP, for allowing the use of his own corpus for the Panoan family [24].

Furthermore, it is acknowledged the support of the “Concejo Nacional de Ciencia, Tecnología e Innovación Tecnológica” (CONCYTEC Perú) under the contract 225-2015-FONDECYT.


  1. 1.
    Bjerva, J.: Byte-based language identification with deep convolutional networks. arXiv preprint arXiv:1609.09004
  2. 2.
    Botha, G.R., Barnard, E.: Factors that affect the accuracy of text-based language identification. Comput. Speech Lang. 26(5), 307–320 (2012)CrossRefGoogle Scholar
  3. 3.
    Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161–169 (1994)Google Scholar
  4. 4.
    Christodouloupoulos, C., Steedman, M.: A massively parallel corpus: the bible in 100 languages. Lang. Resour. Eval. 49(2), 375–395 (2015)CrossRefGoogle Scholar
  5. 5.
    Díaz, D.P. (ed.): Relatos de Nopoki. Universidad Católica Sedes Sapientiae (2012)Google Scholar
  6. 6.
    Forcada, M.: Open source machine translation: an opportunity for minor languages. In: Proceedings of the Workshop “Strategies for Developing Machine Translation for Minority Languages”, LREC, vol. 6, pp. 1–6. Citeseer (2006)Google Scholar
  7. 7.
    Grothe, L., De Luca, E.W., Nürnberger, A.: A comparative study on language identification methods. In: LREC (2008)Google Scholar
  8. 8.
    Hochreiter, S., Schmidhuber, J.: LSTM can solve hard long time lag problems. In: Advances in Neural Information Processing Systems, pp. 473–479 (1997)Google Scholar
  9. 9.
    Jaech, A., Mulcaire, G., Hathi, S., Ostendorf, M., Smith, N.A.: Hierarchical character-word models for language identification (2016). arXiv preprint arXiv:1608.03030
  10. 10.
    Kocmi, T., Bojar, O.: LanideNN: Multilingual language identification on character window (2017). arXiv preprint arXiv:1701.03338
  11. 11.
    Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. Technical report, Stanford InfoLab (1997)Google Scholar
  12. 12.
    Malmasi, S., Dras, M.: Automatic language identification for persian and dari texts. In: Proceedings of PACLING, pp. 59–64 (2015)Google Scholar
  13. 13.
    Martins, B., Silva, M.J.: Language identification in web pages. In: Proceedings of the 2005 ACM Symposium on Applied Computing, pp. 764–768. ACM (2005)Google Scholar
  14. 14.
    Mathur, P., Misra, A., Budur, E.: LIDE: Language identification from text documents (2017). arXiv preprint arXiv:1701.03682
  15. 15.
    McCallum, A., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: ICML, vol. 98, pp. 359–367 (1998)Google Scholar
  16. 16.
    Ministerio de Educación, Perú: Documento nacional de lenguas originarias del Perú (2013).
  17. 17.
    Pienaar, W., Snyman, D.: Spelling checker-based language identification for the eleven official south african languages. In: Proceedings of the 21st Annual Symposium of Pattern Recognition of SA, Stellenbosch, South Africa, pp. 213–216 (2011)Google Scholar
  18. 18.
    Prager, J.M.: Linguini: Language identification for multilingual documents. J. Manage. Inf. Syst. 16(3), 71–101 (1999)CrossRefGoogle Scholar
  19. 19.
    Quasthoff, U., Richter, M., Biemann, C.: Corpus portal for search in monolingual corpora. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, vol. 17991802, p. 21 (2006)Google Scholar
  20. 20.
    Rios, A.: A Basic Language Technology Toolkit for Quechua (2016)Google Scholar
  21. 21.
    Selamat, A., Akosu, N.: Word-length algorithm for language identification of under-resourced languages. J. King Saud Univ. Comput. Inf. Sci. 28(4), 457–469 (2016)Google Scholar
  22. 22.
    Universidad Católica Sedes Sapientiae: Relatos Matsigenkas. Universidad Católica Sedes Sapientiae (2015)Google Scholar
  23. 23.
    Valenzuela, P.: Transitivity in shipibo-konibo grammar. Ph.D. thesis, University of Oregon (2003)Google Scholar
  24. 24.
    Zariquiey Biondi, R.: A grammar of Kashibo-Kakataibo. Ph.D. thesis, La Trobe University (2011)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Research Group on Artificial Intelligence (IA-PUCP)Pontificia Universidad Católica del PerúLimaPeru

Personalised recommendations