Skip to main content

Language Identification with Scarce Data: A Case Study from Peru

  • Conference paper
  • First Online:
  • 410 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 795))

Abstract

Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of the most well-known methods is the best fit. In this way, Peru offers a great challenge as a multi-cultural and linguistic country. Therefore, this study focuses in three steps: (1) to build from scratch a digital annotated corpus for 49 Peruvian indigenous languages and dialects, (2) to fit both standard and deep learning approaches for language identification, and (3) to statistically compare the results obtained. The standard model outperforms the deep learning one as it was expected, with 95.9% in average precision, and both corpus and model will be advantageous inputs for more complex tasks in the future.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    libenchant: https://github.com/AbiWord/enchant.

  2. 2.

    chana.inf.pucp.edu.pe/resources/multi-lang-corpus.

References

  1. Bjerva, J.: Byte-based language identification with deep convolutional networks. arXiv preprint arXiv:1609.09004

  2. Botha, G.R., Barnard, E.: Factors that affect the accuracy of text-based language identification. Comput. Speech Lang. 26(5), 307–320 (2012)

    Article  Google Scholar 

  3. Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161–169 (1994)

    Google Scholar 

  4. Christodouloupoulos, C., Steedman, M.: A massively parallel corpus: the bible in 100 languages. Lang. Resour. Eval. 49(2), 375–395 (2015)

    Article  Google Scholar 

  5. Díaz, D.P. (ed.): Relatos de Nopoki. Universidad Católica Sedes Sapientiae (2012)

    Google Scholar 

  6. Forcada, M.: Open source machine translation: an opportunity for minor languages. In: Proceedings of the Workshop “Strategies for Developing Machine Translation for Minority Languages”, LREC, vol. 6, pp. 1–6. Citeseer (2006)

    Google Scholar 

  7. Grothe, L., De Luca, E.W., Nürnberger, A.: A comparative study on language identification methods. In: LREC (2008)

    Google Scholar 

  8. Hochreiter, S., Schmidhuber, J.: LSTM can solve hard long time lag problems. In: Advances in Neural Information Processing Systems, pp. 473–479 (1997)

    Google Scholar 

  9. Jaech, A., Mulcaire, G., Hathi, S., Ostendorf, M., Smith, N.A.: Hierarchical character-word models for language identification (2016). arXiv preprint arXiv:1608.03030

  10. Kocmi, T., Bojar, O.: LanideNN: Multilingual language identification on character window (2017). arXiv preprint arXiv:1701.03338

  11. Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. Technical report, Stanford InfoLab (1997)

    Google Scholar 

  12. Malmasi, S., Dras, M.: Automatic language identification for persian and dari texts. In: Proceedings of PACLING, pp. 59–64 (2015)

    Google Scholar 

  13. Martins, B., Silva, M.J.: Language identification in web pages. In: Proceedings of the 2005 ACM Symposium on Applied Computing, pp. 764–768. ACM (2005)

    Google Scholar 

  14. Mathur, P., Misra, A., Budur, E.: LIDE: Language identification from text documents (2017). arXiv preprint arXiv:1701.03682

  15. McCallum, A., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: ICML, vol. 98, pp. 359–367 (1998)

    Google Scholar 

  16. Ministerio de Educación, Perú: Documento nacional de lenguas originarias del Perú (2013). http://repositorio.minedu.gob.pe/handle/123456789/3549

  17. Pienaar, W., Snyman, D.: Spelling checker-based language identification for the eleven official south african languages. In: Proceedings of the 21st Annual Symposium of Pattern Recognition of SA, Stellenbosch, South Africa, pp. 213–216 (2011)

    Google Scholar 

  18. Prager, J.M.: Linguini: Language identification for multilingual documents. J. Manage. Inf. Syst. 16(3), 71–101 (1999)

    Article  Google Scholar 

  19. Quasthoff, U., Richter, M., Biemann, C.: Corpus portal for search in monolingual corpora. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, vol. 17991802, p. 21 (2006)

    Google Scholar 

  20. Rios, A.: A Basic Language Technology Toolkit for Quechua (2016)

    Google Scholar 

  21. Selamat, A., Akosu, N.: Word-length algorithm for language identification of under-resourced languages. J. King Saud Univ. Comput. Inf. Sci. 28(4), 457–469 (2016)

    Google Scholar 

  22. Universidad Católica Sedes Sapientiae: Relatos Matsigenkas. Universidad Católica Sedes Sapientiae (2015)

    Google Scholar 

  23. Valenzuela, P.: Transitivity in shipibo-konibo grammar. Ph.D. thesis, University of Oregon (2003)

    Google Scholar 

  24. Zariquiey Biondi, R.: A grammar of Kashibo-Kakataibo. Ph.D. thesis, La Trobe University (2011)

    Google Scholar 

Download references

Acknowledgements

The authors are thankful to J. Rubén Ruiz, bilingual education professor at NOPOKI, for providing access to some private books written in indigenous languages [5, 22]. Likewise, it is appreciated the collaboration of Dr. Roberto Zariquiey, linguistic professor at PUCP, for allowing the use of his own corpus for the Panoan family [24].

Furthermore, it is acknowledged the support of the “Concejo Nacional de Ciencia, Tecnología e Innovación Tecnológica” (CONCYTEC Perú) under the contract 225-2015-FONDECYT.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arturo Oncevay-Marcos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Espichán-Linares, A., Oncevay-Marcos, A. (2018). Language Identification with Scarce Data: A Case Study from Peru. In: Lossio-Ventura, J., Alatrista-Salas, H. (eds) Information Management and Big Data. SIMBig 2017. Communications in Computer and Information Science, vol 795. Springer, Cham. https://doi.org/10.1007/978-3-319-90596-9_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-90596-9_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-90595-2

  • Online ISBN: 978-3-319-90596-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics