Advertisement

Language Identification Using Deep Convolutional Recurrent Neural Networks

  • Christian Bartz
  • Tom Herold
  • Haojin Yang
  • Christoph Meinel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10639)

Abstract

Language Identification (LID) systems are used to classify the spoken language from a given audio sample and are typically the first step for many spoken language processing tasks, such as Automatic Speech Recognition (ASR) systems. Without automatic language detection, speech utterances cannot be parsed correctly and grammar rules cannot be applied, causing subsequent speech recognition steps to fail. We propose a LID system that solves the problem in the image domain, rather than the audio domain. We use a hybrid Convolutional Recurrent Neural Network (CRNN) that operates on spectrogram images of the provided audio snippets. In extensive experiments we show, that our model is applicable to a range of noisy scenarios and can easily be extended to previously unknown languages, while maintaining its classification accuracy. We release our code and a large scale training set for LID systems to the community.

References

  1. 1.
    Abadi, M., Agarwal, A., Barham, P., Brevdo, E., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems arXiv:1603.04467 (2016)
  2. 2.
    Blackman, R.B., Tukey, J.W.: The measurement of power spectra from the point of view of communications engineering-part I. Bell Labs Tech. J. 37(1), 185–282 (1958)CrossRefGoogle Scholar
  3. 3.
    Chollet, F.: keras: deep learning library for python. Runs on TensorFlow, Theano or CNTK (2017). https://github.com/fchollet/keras
  4. 4.
    Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)CrossRefGoogle Scholar
  5. 5.
    Gelly, G., Gauvain, J.L., Le, V., Messaoudi, A.: A divide-and-conquer approach for language identification based on recurrent neural networks. In: INTERSPEECH 2016, pp. 3231–3235 (2016)Google Scholar
  6. 6.
    Gelly, G., Gauvain, J.L., Lamel, L., Laurent, A., Le, V.B., Messaoudi, A.: Language Recognition for Dialects and Closely Related Languages. Odyssey, Bilbao (2016)CrossRefGoogle Scholar
  7. 7.
    Gonzalez-Dominguez, J., Lopez-Moreno, I., Moreno, P.J., Gonzalez-Rodriguez, J.: Frame-by-frame language identification in short utterances using deep neural networks. Neural Netw. 64, 49–58 (2015)CrossRefGoogle Scholar
  8. 8.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 448–456 (2015)Google Scholar
  9. 9.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations, San Diego (2015)Google Scholar
  10. 10.
    Lozano-Dez, A., Zazo Candil, R., Gonzlez Domnguez, J., Toledano, D.T., Gonzlez-Rodrguez, J.: An end-to-end approach to language identification in short utterances using convolutional neural networks. International Speech and Communication Association (2015)Google Scholar
  11. 11.
    Martnez, D., Plchot, O., Burget, L., Glembek, O., Matjka, P.: Language recognition in ivectors space. In: Twelfth Annual Conference of the International Speech Communication Association (2011)Google Scholar
  12. 12.
    Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 807–814 (2010)Google Scholar
  13. 13.
    Plchot, O., Matejka, P., Glembek, O., Fer, R., Novotny, O., Pesan, J., Burget, L., Brummer, N., Cumani, S.: Bat system description for NIST LRE 2015. Odyssey 2016, pp. 166–173 (2016)Google Scholar
  14. 14.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)Google Scholar
  15. 15.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 91–99. Curran Associates Inc., New York (2015)Google Scholar
  16. 16.
    Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2298–2304 (2016)CrossRefGoogle Scholar
  17. 17.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)Google Scholar
  18. 18.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision, pp. 2818–2826 (2016)Google Scholar
  19. 19.
    Zazo, R., Lozano-Diez, A., Gonzalez-Dominguez, J., Toledano, D.T., Gonzalez-Rodriguez, J.: Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks. PLoS ONE 11(1), e0146917 (2016)CrossRefGoogle Scholar
  20. 20.
    Zissman, M.A., et al.: Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Process. 4(1), 31 (1996)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Christian Bartz
    • 1
  • Tom Herold
    • 1
  • Haojin Yang
    • 1
  • Christoph Meinel
    • 1
  1. 1.Hasso Plattner InstituteUniversity of PotsdamPotsdamGermany

Personalised recommendations