Language Identification with Scarce Data: A Case Study from Peru

Espichán-Linares, Alexandra; Oncevay-Marcos, Arturo

doi:10.1007/978-3-319-90596-9_7

Language Identification with Scarce Data: A Case Study from Peru

Conference paper
First Online: 21 April 2018

410 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 795))

Abstract

Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of the most well-known methods is the best fit. In this way, Peru offers a great challenge as a multi-cultural and linguistic country. Therefore, this study focuses in three steps: (1) to build from scratch a digital annotated corpus for 49 Peruvian indigenous languages and dialects, (2) to fit both standard and deep learning approaches for language identification, and (3) to statistically compare the results obtained. The standard model outperforms the deep learning one as it was expected, with 95.9% in average precision, and both corpus and model will be advantageous inputs for more complex tasks in the future.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
libenchant: https://github.com/AbiWord/enchant.
2.
chana.inf.pucp.edu.pe/resources/multi-lang-corpus.

References

Bjerva, J.: Byte-based language identification with deep convolutional networks. arXiv preprint arXiv:1609.09004
Botha, G.R., Barnard, E.: Factors that affect the accuracy of text-based language identification. Comput. Speech Lang. 26(5), 307–320 (2012)
Article Google Scholar
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161–169 (1994)
Google Scholar
Christodouloupoulos, C., Steedman, M.: A massively parallel corpus: the bible in 100 languages. Lang. Resour. Eval. 49(2), 375–395 (2015)
Article Google Scholar
Díaz, D.P. (ed.): Relatos de Nopoki. Universidad Católica Sedes Sapientiae (2012)
Google Scholar
Forcada, M.: Open source machine translation: an opportunity for minor languages. In: Proceedings of the Workshop “Strategies for Developing Machine Translation for Minority Languages”, LREC, vol. 6, pp. 1–6. Citeseer (2006)
Google Scholar
Grothe, L., De Luca, E.W., Nürnberger, A.: A comparative study on language identification methods. In: LREC (2008)
Google Scholar
Hochreiter, S., Schmidhuber, J.: LSTM can solve hard long time lag problems. In: Advances in Neural Information Processing Systems, pp. 473–479 (1997)
Google Scholar
Jaech, A., Mulcaire, G., Hathi, S., Ostendorf, M., Smith, N.A.: Hierarchical character-word models for language identification (2016). arXiv preprint arXiv:1608.03030
Kocmi, T., Bojar, O.: LanideNN: Multilingual language identification on character window (2017). arXiv preprint arXiv:1701.03338
Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. Technical report, Stanford InfoLab (1997)
Google Scholar
Malmasi, S., Dras, M.: Automatic language identification for persian and dari texts. In: Proceedings of PACLING, pp. 59–64 (2015)
Google Scholar
Martins, B., Silva, M.J.: Language identification in web pages. In: Proceedings of the 2005 ACM Symposium on Applied Computing, pp. 764–768. ACM (2005)
Google Scholar
Mathur, P., Misra, A., Budur, E.: LIDE: Language identification from text documents (2017). arXiv preprint arXiv:1701.03682
McCallum, A., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: ICML, vol. 98, pp. 359–367 (1998)
Google Scholar
Ministerio de Educación, Perú: Documento nacional de lenguas originarias del Perú (2013). http://repositorio.minedu.gob.pe/handle/123456789/3549
Pienaar, W., Snyman, D.: Spelling checker-based language identification for the eleven official south african languages. In: Proceedings of the 21st Annual Symposium of Pattern Recognition of SA, Stellenbosch, South Africa, pp. 213–216 (2011)
Google Scholar
Prager, J.M.: Linguini: Language identification for multilingual documents. J. Manage. Inf. Syst. 16(3), 71–101 (1999)
Article Google Scholar
Quasthoff, U., Richter, M., Biemann, C.: Corpus portal for search in monolingual corpora. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, vol. 17991802, p. 21 (2006)
Google Scholar
Rios, A.: A Basic Language Technology Toolkit for Quechua (2016)
Google Scholar
Selamat, A., Akosu, N.: Word-length algorithm for language identification of under-resourced languages. J. King Saud Univ. Comput. Inf. Sci. 28(4), 457–469 (2016)
Google Scholar
Universidad Católica Sedes Sapientiae: Relatos Matsigenkas. Universidad Católica Sedes Sapientiae (2015)
Google Scholar
Valenzuela, P.: Transitivity in shipibo-konibo grammar. Ph.D. thesis, University of Oregon (2003)
Google Scholar
Zariquiey Biondi, R.: A grammar of Kashibo-Kakataibo. Ph.D. thesis, La Trobe University (2011)
Google Scholar

Download references

Acknowledgements

The authors are thankful to J. Rubén Ruiz, bilingual education professor at NOPOKI, for providing access to some private books written in indigenous languages [5, 22]. Likewise, it is appreciated the collaboration of Dr. Roberto Zariquiey, linguistic professor at PUCP, for allowing the use of his own corpus for the Panoan family [24].

Furthermore, it is acknowledged the support of the “Concejo Nacional de Ciencia, Tecnología e Innovación Tecnológica” (CONCYTEC Perú) under the contract 225-2015-FONDECYT.

Author information

Authors and Affiliations

Research Group on Artificial Intelligence (IA-PUCP), Pontificia Universidad Católica del Perú, Lima, Peru
Alexandra Espichán-Linares & Arturo Oncevay-Marcos

Authors

Alexandra Espichán-Linares
View author publications
You can also search for this author in PubMed Google Scholar
Arturo Oncevay-Marcos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arturo Oncevay-Marcos .

Editor information

Editors and Affiliations

University of Florida, Gainesville, Florida, USA
Juan Antonio Lossio-Ventura
Universidad del Pacífico, Lima, Peru
Hugo Alatrista-Salas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Espichán-Linares, A., Oncevay-Marcos, A. (2018). Language Identification with Scarce Data: A Case Study from Peru. In: Lossio-Ventura, J., Alatrista-Salas, H. (eds) Information Management and Big Data. SIMBig 2017. Communications in Computer and Information Science, vol 795. Springer, Cham. https://doi.org/10.1007/978-3-319-90596-9_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-90596-9_7
Published: 21 April 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-90595-2
Online ISBN: 978-3-319-90596-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics