Abstract
Language identification is an elemental task in natural language processing, where corpus-based methods reign the state-of-the-art results in multi-lingual setups. However, there is a need to extend this application to other scenarios with scarce data and multiple classes to face, analyzing which of the most well-known methods is the best fit. In this way, Peru offers a great challenge as a multi-cultural and linguistic country. Therefore, this study focuses in three steps: (1) to build from scratch a digital annotated corpus for 49 Peruvian indigenous languages and dialects, (2) to fit both standard and deep learning approaches for language identification, and (3) to statistically compare the results obtained. The standard model outperforms the deep learning one as it was expected, with 95.9% in average precision, and both corpus and model will be advantageous inputs for more complex tasks in the future.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
References
Bjerva, J.: Byte-based language identification with deep convolutional networks. arXiv preprint arXiv:1609.09004
Botha, G.R., Barnard, E.: Factors that affect the accuracy of text-based language identification. Comput. Speech Lang. 26(5), 307–320 (2012)
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161–169 (1994)
Christodouloupoulos, C., Steedman, M.: A massively parallel corpus: the bible in 100 languages. Lang. Resour. Eval. 49(2), 375–395 (2015)
Díaz, D.P. (ed.): Relatos de Nopoki. Universidad Católica Sedes Sapientiae (2012)
Forcada, M.: Open source machine translation: an opportunity for minor languages. In: Proceedings of the Workshop “Strategies for Developing Machine Translation for Minority Languages”, LREC, vol. 6, pp. 1–6. Citeseer (2006)
Grothe, L., De Luca, E.W., Nürnberger, A.: A comparative study on language identification methods. In: LREC (2008)
Hochreiter, S., Schmidhuber, J.: LSTM can solve hard long time lag problems. In: Advances in Neural Information Processing Systems, pp. 473–479 (1997)
Jaech, A., Mulcaire, G., Hathi, S., Ostendorf, M., Smith, N.A.: Hierarchical character-word models for language identification (2016). arXiv preprint arXiv:1608.03030
Kocmi, T., Bojar, O.: LanideNN: Multilingual language identification on character window (2017). arXiv preprint arXiv:1701.03338
Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. Technical report, Stanford InfoLab (1997)
Malmasi, S., Dras, M.: Automatic language identification for persian and dari texts. In: Proceedings of PACLING, pp. 59–64 (2015)
Martins, B., Silva, M.J.: Language identification in web pages. In: Proceedings of the 2005 ACM Symposium on Applied Computing, pp. 764–768. ACM (2005)
Mathur, P., Misra, A., Budur, E.: LIDE: Language identification from text documents (2017). arXiv preprint arXiv:1701.03682
McCallum, A., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: ICML, vol. 98, pp. 359–367 (1998)
Ministerio de Educación, Perú: Documento nacional de lenguas originarias del Perú (2013). http://repositorio.minedu.gob.pe/handle/123456789/3549
Pienaar, W., Snyman, D.: Spelling checker-based language identification for the eleven official south african languages. In: Proceedings of the 21st Annual Symposium of Pattern Recognition of SA, Stellenbosch, South Africa, pp. 213–216 (2011)
Prager, J.M.: Linguini: Language identification for multilingual documents. J. Manage. Inf. Syst. 16(3), 71–101 (1999)
Quasthoff, U., Richter, M., Biemann, C.: Corpus portal for search in monolingual corpora. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, vol. 17991802, p. 21 (2006)
Rios, A.: A Basic Language Technology Toolkit for Quechua (2016)
Selamat, A., Akosu, N.: Word-length algorithm for language identification of under-resourced languages. J. King Saud Univ. Comput. Inf. Sci. 28(4), 457–469 (2016)
Universidad Católica Sedes Sapientiae: Relatos Matsigenkas. Universidad Católica Sedes Sapientiae (2015)
Valenzuela, P.: Transitivity in shipibo-konibo grammar. Ph.D. thesis, University of Oregon (2003)
Zariquiey Biondi, R.: A grammar of Kashibo-Kakataibo. Ph.D. thesis, La Trobe University (2011)
Acknowledgements
The authors are thankful to J. Rubén Ruiz, bilingual education professor at NOPOKI, for providing access to some private books written in indigenous languages [5, 22]. Likewise, it is appreciated the collaboration of Dr. Roberto Zariquiey, linguistic professor at PUCP, for allowing the use of his own corpus for the Panoan family [24].
Furthermore, it is acknowledged the support of the “Concejo Nacional de Ciencia, Tecnología e Innovación Tecnológica” (CONCYTEC Perú) under the contract 225-2015-FONDECYT.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Espichán-Linares, A., Oncevay-Marcos, A. (2018). Language Identification with Scarce Data: A Case Study from Peru. In: Lossio-Ventura, J., Alatrista-Salas, H. (eds) Information Management and Big Data. SIMBig 2017. Communications in Computer and Information Science, vol 795. Springer, Cham. https://doi.org/10.1007/978-3-319-90596-9_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-90596-9_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-90595-2
Online ISBN: 978-3-319-90596-9
eBook Packages: Computer ScienceComputer Science (R0)