Language Identification Using Deep Convolutional Recurrent Neural Networks

Bartz, Christian; Herold, Tom; Yang, Haojin; Meinel, Christoph

doi:10.1007/978-3-319-70136-3_93

Christian Bartz¹⁸,
Tom Herold¹⁸,
Haojin Yang¹⁸ &
…
Christoph Meinel¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10639))

Included in the following conference series:

International Conference on Neural Information Processing

3779 Accesses
55 Citations

Abstract

Language Identification (LID) systems are used to classify the spoken language from a given audio sample and are typically the first step for many spoken language processing tasks, such as Automatic Speech Recognition (ASR) systems. Without automatic language detection, speech utterances cannot be parsed correctly and grammar rules cannot be applied, causing subsequent speech recognition steps to fail. We propose a LID system that solves the problem in the image domain, rather than the audio domain. We use a hybrid Convolutional Recurrent Neural Network (CRNN) that operates on spectrogram images of the provided audio snippets. In extensive experiments we show, that our model is applicable to a range of noisy scenarios and can easily be extended to previously unknown languages, while maintaining its classification accuracy. We release our code and a large scale training set for LID systems to the community.

C.Bartz and T. Herold—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems arXiv:1603.04467 (2016)
Blackman, R.B., Tukey, J.W.: The measurement of power spectra from the point of view of communications engineering-part I. Bell Labs Tech. J. 37(1), 185–282 (1958)
Article Google Scholar
Chollet, F.: keras: deep learning library for python. Runs on TensorFlow, Theano or CNTK (2017). https://github.com/fchollet/keras
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
Article Google Scholar
Gelly, G., Gauvain, J.L., Le, V., Messaoudi, A.: A divide-and-conquer approach for language identification based on recurrent neural networks. In: INTERSPEECH 2016, pp. 3231–3235 (2016)
Google Scholar
Gelly, G., Gauvain, J.L., Lamel, L., Laurent, A., Le, V.B., Messaoudi, A.: Language Recognition for Dialects and Closely Related Languages. Odyssey, Bilbao (2016)
Book Google Scholar
Gonzalez-Dominguez, J., Lopez-Moreno, I., Moreno, P.J., Gonzalez-Rodriguez, J.: Frame-by-frame language identification in short utterances using deep neural networks. Neural Netw. 64, 49–58 (2015)
Article Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 448–456 (2015)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations, San Diego (2015)
Google Scholar
Lozano-Dez, A., Zazo Candil, R., Gonzlez Domnguez, J., Toledano, D.T., Gonzlez-Rodrguez, J.: An end-to-end approach to language identification in short utterances using convolutional neural networks. International Speech and Communication Association (2015)
Google Scholar
Martnez, D., Plchot, O., Burget, L., Glembek, O., Matjka, P.: Language recognition in ivectors space. In: Twelfth Annual Conference of the International Speech Communication Association (2011)
Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 807–814 (2010)
Google Scholar
Plchot, O., Matejka, P., Glembek, O., Fer, R., Novotny, O., Pesan, J., Burget, L., Brummer, N., Cumani, S.: Bat system description for NIST LRE 2015. Odyssey 2016, pp. 166–173 (2016)
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 91–99. Curran Associates Inc., New York (2015)
Google Scholar
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2298–2304 (2016)
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision, pp. 2818–2826 (2016)
Google Scholar
Zazo, R., Lozano-Diez, A., Gonzalez-Dominguez, J., Toledano, D.T., Gonzalez-Rodriguez, J.: Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks. PLoS ONE 11(1), e0146917 (2016)
Article Google Scholar
Zissman, M.A., et al.: Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Process. 4(1), 31 (1996)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Hasso Plattner Institute, University of Potsdam, Potsdam, Germany
Christian Bartz, Tom Herold, Haojin Yang & Christoph Meinel

Authors

Christian Bartz
View author publications
You can also search for this author in PubMed Google Scholar
Tom Herold
View author publications
You can also search for this author in PubMed Google Scholar
Haojin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Meinel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Bartz .

Editor information

Editors and Affiliations

Guangdong University of Technology, Guangzhou, China
Derong Liu
Guangdong University of Technology, Guangzhou, China
Shengli Xie
South China University of Technology, Guangzhou, China
Yuanqing Li
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Dongbin Zhao
King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
El-Sayed M. El-Alfy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bartz, C., Herold, T., Yang, H., Meinel, C. (2017). Language Identification Using Deep Convolutional Recurrent Neural Networks. In: Liu, D., Xie, S., Li, Y., Zhao, D., El-Alfy, ES. (eds) Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science(), vol 10639. Springer, Cham. https://doi.org/10.1007/978-3-319-70136-3_93

Download citation

DOI: https://doi.org/10.1007/978-3-319-70136-3_93
Published: 26 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70135-6
Online ISBN: 978-3-319-70136-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics