Advertisement

The Representation of Speech in Deep Neural Networks

  • Odette Scharenborg
  • Nikki van der Gouw
  • Martha LarsonEmail author
  • Elena Marchiori
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11296)

Abstract

In this paper, we investigate the connection between how people understand speech and how speech is understood by a deep neural network. A naïve, general feed-forward deep neural network was trained for the task of vowel/consonant classification. Subsequently, the representations of the speech signal in the different hidden layers of the DNN were visualized. The visualizations allow us to study the distance between the representations of different types of input frames and observe the clustering structures formed by these representations. In the different visualizations, the input frames were labeled with different linguistic categories: sounds in the same phoneme class, sounds with the same manner of articulation, and sounds with the same place of articulation. We investigate whether the DNN clusters speech representations in a way that corresponds to these linguistic categories and observe evidence that the DNN does indeed appear to learn structures that humans use to understand speech without being explicitly trained to do so.

Keywords

Deep neural networks Speech representations Visualizations 

Notes

Acknowledgements

This work was carried out by the second author as part of a thesis project under the supervision of the first, third, and fourth authors. The first author was supported by a Vidi-grant from NWO (grant number: 276-89-003).

References

  1. 1.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: 25th International Conference on Neural Information Processing Systems (NIPS 2012), vol. 1, pp. 1097–1105 (2012)Google Scholar
  2. 2.
    van den Oord, A., Dieleman, S., Schrauwen, B.: Deep content-based music recommendation. In: 26th International Conference on Neural Information Processing Systems (NIPS 2013), vol. 2, pp. 2643–2651 (2013)Google Scholar
  3. 3.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), pp. 1725–1732 (2014)Google Scholar
  4. 4.
    Wan, J., et al.: Deep learning for content-based image retrieval: a comprehensive study. In: 22nd ACM International Conference on Multimedia (MM 2014), pp. 157–166 (2014)Google Scholar
  5. 5.
    Juneja, A.: A comparison of automatic and human speech recognition in null grammar. J. Acoust. Soc. Am. 131(3), EL256–EL261 (2012)CrossRefGoogle Scholar
  6. 6.
    Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10590-1_53CrossRefGoogle Scholar
  7. 7.
    Rauber, P.E., Fadel, S.G., Falcão, A.X., Telea, A.C.: Visualizing the hidden activity of artificial neural networks. IEEE Trans. Vis. Comput. Graph. 23(1), 101–110 (2017)CrossRefGoogle Scholar
  8. 8.
    Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)zbMATHGoogle Scholar
  9. 9.
    Mohamed, A.-R., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), pp. 4273–4276 (2012)Google Scholar
  10. 10.
    Scharenborg, O., Tiesmeyer, S., Hasegawa-Johnson, M., Dehak, N.: Visualizing phoneme category adaptation in deep neural networks. In: Interspeech (2018)Google Scholar
  11. 11.
    Oostdijk, N.H.J., et al.: Experiences from the spoken Dutch corpus project. In: Third International Conference on Language Resources and Evaluation, (LREC 2002), pp. 340–347 (2002)Google Scholar
  12. 12.
    Mohamed, A.-R., Dahl, G.E., Hinton, G.: Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20, 14–22 (2012)CrossRefGoogle Scholar
  13. 13.
    Zeiler, M.D., et al.: On rectified linear units for speech processing. In: 2013 IEEE Acoustics, Speech and Signal Processing (ICASSP 2013), pp. 3517–3521 (2013)Google Scholar
  14. 14.
    McQueen, J.M., Cutler, A., Norris, D.: Phonological abstraction in the mental lexicon. Cogn. Sci. 30(6), 1113–1126 (2006)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Odette Scharenborg
    • 1
    • 2
  • Nikki van der Gouw
    • 2
  • Martha Larson
    • 1
    • 2
    Email author
  • Elena Marchiori
    • 2
  1. 1.Multimedia Computing GroupDelft University of TechnologyDelftThe Netherlands
  2. 2.Radboud UniversityNijmegenThe Netherlands

Personalised recommendations