Skip to main content

The Representation of Speech and Its Processing in the Human Brain and Deep Neural Networks

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11658))

Included in the following conference series:

  • 1209 Accesses

Abstract

For most languages in the world and for speech that deviates from the standard pronunciation, not enough (annotated) speech data is available to train an automatic speech recognition (ASR) system. Moreover, human intervention is needed to adapt an ASR system to a new language or type of speech. Human listeners, on the other hand, are able to quickly adapt to nonstandard speech and can learn the sound categories of a new language without having been explicitly taught to do so. In this paper, I will present comparisons between human speech processing and deep neural network (DNN)-based ASR and will argue that the cross-fertilisation of the two research fields can provide valuable information for the development of ASR systems that can flexibly adapt to any type of speech in any language. Specifically, I present results of several experiments carried out on both human listeners and DNN-based ASR systems on the representation of speech and lexically-guided perceptual learning, i.e., the ability to adapt a sound category on the basis of new incoming information resulting in improved processing of subsequent speech. The results showed that DNNs appear to learn structures that humans use to process speech without being explicitly trained to do so, and that, similar to humans, DNN systems learn speaker-adapted phone category boundaries from a few labelled examples. These results are the first steps towards building human-speech processing inspired ASR systems that, similar to human listeners, can adjust flexibly and fast to all kinds of new speech.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Adda, G., et al.: Breaking the unwritten language barrier: the BULB project. In: Proceedings 5th Workshop on Spoken Language Technologies for Under-Resourced Languages (2016)

    Google Scholar 

  2. Waibel, A., Schultz, T.: Experiments on cross-language acoustic modelling. In: Proceedings of Interspeech (2001)

    Google Scholar 

  3. Vu, N.T., Metze, F., Schultz, T.: Multilingual bottleneck features and its application for under-resourced languages. In: Proceedings of the 3rd Workshop on Spoken Language Technologies for Under-Resourced Languages, Cape Town, South Africa (2012)

    Google Scholar 

  4. Xu, H., Do, V.H., Xiao, X., Chng, E.S.: A comparative study of BNF and DNN multilingual training on cross-lingual low-resource speech recognition. In: Proceedings of Interspeech, pp. 2132–2136 (2015)

    Google Scholar 

  5. Scharenborg, O., Ebel, P., Ciannella, F., Hasegawa-Johnson, M., Dehak, N.: Building an ASR system for Mboshi using a cross-language definition of acoustic units approach. In: Proceedings of the International Workshop on Spoken Language Technologies for Under-Resourced Languages, Gurugram, India (2018)

    Google Scholar 

  6. Scharenborg, O., et al.: Building an ASR system for a low-resource language through the adaptation of a high-resource language ASR system: preliminary results. In: Proceedings of the International Conference on Natural Language, Signal and Speech Processing, Casablanca, Morocco (2017)

    Google Scholar 

  7. Norris, D., McQueen, J.M., Cutler, A.: Perceptual learning in speech. Cogn. Psychol. 47(2), 204–238 (2003)

    Article  Google Scholar 

  8. Best, C.T., Tyler, M.C.: Nonnative and second-language speech perception. Commonalities and complementaries. In: Bohn, O.-S., Munro, M.J. (eds.) Language Experience in Second Language Speech Learning: In Honor of James Emil Flege, pp. 13–34. John Benjamins, Amsterdam (2007)

    Google Scholar 

  9. Davis, M.H., Scharenborg, O.: Speech perception by humans and machines. In: Gaskell, M.G., Mirkovic, J. (eds.) Speech Perception and Spoken Word Recognition, Part of the Series “Current Issues in the Psychology of Language”, pp. 181–203. Routledge, London (2017)

    Google Scholar 

  10. Scharenborg, O., Norris, D., ten Bosch, L., McQueen, J.M.: How should a speech recognizer work? Cogn. Sci. 29(6), 867–918 (2005)

    Article  Google Scholar 

  11. Scharenborg, O.: Modeling the use of durational information in human spoken-word recognition. J. Acoust. Soc. Am. 127(6), 3758–3770 (2010)

    Article  Google Scholar 

  12. Scharenborg, O., Tiesmeyer, S., Hasegawa-Johnson, M., Dehak, N.: Visualizing phoneme category adaptation in deep neural networks. In: Proceedings of Interspeech (2018)

    Google Scholar 

  13. Dusan, S., Rabiner, L.R.: On integrating insights from human speech recognition into automatic speech recognition. In: Proceedings of Interspeech, pp. 1233–1236 (2005)

    Google Scholar 

  14. Hermansky, H.: Should recognizers have ears? Speech Commun. 25, 3–27 (1998)

    Article  Google Scholar 

  15. Scharenborg, O.: Reaching over the gap: a review of efforts to link human and automatic speech recognition research. Speech Commun. 49, 336–347 (2007)

    Article  Google Scholar 

  16. Davis, S., Mermelstein, P.: Comparison of the parametric representation for monosyllabic word recognition. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)

    Article  Google Scholar 

  17. De Wachter, M., Demuynck, K., van Compernolle, D., Wambaq, P.: Data driven example based continuous speech recognition. In: Proceedings of Eurospeech, Geneva, Switzerland, pp. 1133–1136 (2003)

    Google Scholar 

  18. Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig. Process. Mag. 29(6), 82–97 (2012)

    Article  Google Scholar 

  19. Wan, J., et al.: Deep learning for content-based image retrieval: a comprehensive study. In: Proceedings of the 22nd ACM International conference on Multimedia (MM 2014), pp. 157–166 (2014)

    Google Scholar 

  20. Scharenborg, O., van der Gouw, N., Larson, M., Marchiori, E.: The representation of speech in deep neural networks. In: Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, W.-H., Vrochidis, S. (eds.) MMM 2019. LNCS, vol. 11296, pp. 194–205. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05716-9_16

    Chapter  Google Scholar 

  21. Oostdijk, N.H.J., et al.: Experiences from the spoken Dutch Corpus project. In: Proceedings of LREC, pp. 340–347 (2002)

    Google Scholar 

  22. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)

    MATH  Google Scholar 

  23. Samuel, A.G., Kraljic, T.: Perceptual learning in speech perception. Atten. Percept. Psychophys. 71, 1207–1218 (2009)

    Article  Google Scholar 

  24. Drozdova, P., van Hout, R., Scharenborg, O.: Processing and adaptation to ambiguous sounds during the course of perceptual learning. In: Proceedings of Interspeech, pp. 2811–2815 (2016)

    Google Scholar 

  25. Poellmann, K., McQueen, J.M., Mitterer, H.: The time course of perceptual learning. In: Proceedings of ICPhS (2011)

    Google Scholar 

  26. Clarke-Davidson, C., Luce, P.A., Sawusch, J.R.: Does perceptual learning in speech reflect changes in phonetic category representation or decision bias? Percept. Psychophys. 70, 604–618 (2008)

    Article  Google Scholar 

  27. Drozdova, P., van Hout, R., Scharenborg, O.: Lexically-guided perceptual learning in non-native listening. Bilingualism: Lang. Cogn. 19(5), 914–920 (2016). https://doi.org/10.1017/s136672891600002x

    Article  Google Scholar 

  28. Scharenborg, O., Janse, E.: Comparing lexically-guided perceptual learning in younger and older listeners. Atten. Percept. Psychophys. 75(3), 525–536 (2013). https://doi.org/10.3758/s13414-013-0422-4

    Article  Google Scholar 

  29. Gales, M.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)

    Article  Google Scholar 

  30. Liao, H.: Speaker adaptation of context dependent deep neural networks. In: Proceedings of ICASSP, pp. 7947–7951 (2013)

    Google Scholar 

Download references

Acknowledgments

I would like to thank Junrui Ni for carrying out the experiments described in Sect. 3.2 and Mark Hasegawa-Johnson for fruitful discussions on the experiments in Sect. 3.2.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Odette Scharenborg .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Scharenborg, O. (2019). The Representation of Speech and Its Processing in the Human Brain and Deep Neural Networks. In: Salah, A., Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science(), vol 11658. Springer, Cham. https://doi.org/10.1007/978-3-030-26061-3_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-26061-3_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-26060-6

  • Online ISBN: 978-3-030-26061-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics