The Representation of Speech in Deep Neural Networks
In this paper, we investigate the connection between how people understand speech and how speech is understood by a deep neural network. A naïve, general feed-forward deep neural network was trained for the task of vowel/consonant classification. Subsequently, the representations of the speech signal in the different hidden layers of the DNN were visualized. The visualizations allow us to study the distance between the representations of different types of input frames and observe the clustering structures formed by these representations. In the different visualizations, the input frames were labeled with different linguistic categories: sounds in the same phoneme class, sounds with the same manner of articulation, and sounds with the same place of articulation. We investigate whether the DNN clusters speech representations in a way that corresponds to these linguistic categories and observe evidence that the DNN does indeed appear to learn structures that humans use to understand speech without being explicitly trained to do so.
KeywordsDeep neural networks Speech representations Visualizations
This work was carried out by the second author as part of a thesis project under the supervision of the first, third, and fourth authors. The first author was supported by a Vidi-grant from NWO (grant number: 276-89-003).
- 1.Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: 25th International Conference on Neural Information Processing Systems (NIPS 2012), vol. 1, pp. 1097–1105 (2012)Google Scholar
- 2.van den Oord, A., Dieleman, S., Schrauwen, B.: Deep content-based music recommendation. In: 26th International Conference on Neural Information Processing Systems (NIPS 2013), vol. 2, pp. 2643–2651 (2013)Google Scholar
- 3.Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), pp. 1725–1732 (2014)Google Scholar
- 4.Wan, J., et al.: Deep learning for content-based image retrieval: a comprehensive study. In: 22nd ACM International Conference on Multimedia (MM 2014), pp. 157–166 (2014)Google Scholar
- 9.Mohamed, A.-R., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), pp. 4273–4276 (2012)Google Scholar
- 10.Scharenborg, O., Tiesmeyer, S., Hasegawa-Johnson, M., Dehak, N.: Visualizing phoneme category adaptation in deep neural networks. In: Interspeech (2018)Google Scholar
- 11.Oostdijk, N.H.J., et al.: Experiences from the spoken Dutch corpus project. In: Third International Conference on Language Resources and Evaluation, (LREC 2002), pp. 340–347 (2002)Google Scholar
- 13.Zeiler, M.D., et al.: On rectified linear units for speech processing. In: 2013 IEEE Acoustics, Speech and Signal Processing (ICASSP 2013), pp. 3517–3521 (2013)Google Scholar