Skip to main content

Generalized Structure of Active Speech Perception Based on Multiagent Intelligence

  • Conference paper
  • First Online:
Biologically Inspired Cognitive Architectures 2021 (BICA 2021)

Abstract

Recent success in the field of speech technology is undoubted. Developers from Microsoft and IBM reported on the efficiency of automated speech recognition systems at the human level in transcribing conversational telephone speech. According to various estimates, their WER now is about 5.8–5.1%. However, the most challenging problems in speech recognition – diarization and noise cancellation – are still open. A comparative analysis of the most frequent errors made by systems and people when solving the recognition problem shows that, in general, the errors are similar. Errors made by a human when solving speech recognition problems are much less critical; they seldom distort the meaning of a statement. In other words, these errors are not sematic. That is why the mechanisms of human speech perception are the most promising area of research. This paper proposes the model of a general structure for active auditory perception theory and the neurobiological basis of the hypothesis put forward. The proposed concept is a basic platform for general multiagent architecture. We assume that speech recognition is guided by attention, even in its early stages, a change in the early auditory code determined by context and experience. This model simulates the involuntary attention used by children in mastering their native language, based on an emotional assessment of perceptually significant auditory information. The multiagent internal dynamics of auditory speech coding can provide new insights into how hearing impairment can be treated. The formal description of the structure of speech perception can be used as a general theoretical basis for the development of universal systems for automatic speech recognition, highly effective in noisy conditions and cocktail-party situations. Formal means for program implementation of the present model are multiagent systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Hershey, J.R., Rennie, S.J., Olsen, P.A., Kristjansson, T.T.: Super-human multi-talker speech recognition: a graphical modeling approach. Comput. Speech Lang. 24, 45–66 (2010)

    Article  Google Scholar 

  2. Weng, C., Yu, D., Seltzer, M. L., Droppo, J.: Single-channel mixed speech recognition using deep neural networks. In: Proceedings IEEE ICASSP, pp. 5632–5636 (2014)

    Google Scholar 

  3. Matsoukas, S., et al.: Advances in transcription of broadcast news and conversational telephone speech within the combined ears bbn/limsi system. IEEE Trans. Audio Speech Lang. Process. 14, 1541–1556 (2006)

    Article  Google Scholar 

  4. Evermann, G., et al.: Development of the 2003 CU-HTK conversational telephone speech transcription system. In: Proceedings IEEE ICASSP 1, p. I–249 (2004)

    Google Scholar 

  5. Glenn, M. L., Strassel, S. M., Lee, H., Maeda, K., Zakhary, R., Li, X.: Transcription methods for consistency, volume and efficiency. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC, pp. 2915–2920 (2010)

    Google Scholar 

  6. Hannun, A.: Writing about Machine Learning. https://awni.github.io/speech-recognition/. Accessed 21 Aug 2021

  7. Han, C., O’Sullivan, J., Luo, Y., Herrero, J., Mehta, A.D., Mesgarani, N.: Speaker-independent auditory attention decoding without access to clean speech sources. Sci. Adv. 5(5), 1–11 (2019). https://doi.org/10.1126/sciadv.aav6134

    Article  Google Scholar 

  8. Amodei, D., et al.: Deep Speech 2: End-to-end speech recognition in English and Mandarin. arXiv preprint arXiv:1512.02595. Accessed 11 May 2020

    Google Scholar 

  9. Galbraith, G.C., Arroyo, C.: Selective attention and brainstem frequency-following responses. Biol. Psychol. 37, 3–22 (1993)

    Article  Google Scholar 

  10. Giard, M.-E., Collet, L., Bouchet, P., Pernier, J.: Auditory selective attention in the human cochlea. Brain Res. 633, 353–356 (1994)

    Article  Google Scholar 

  11. Sakharny, L.V.: Introduction into Psycholinguistics. Publishing House of Leningrad University, Leningrad (1989). [Sakharny, L. V.: Vvedeniye v psikholingvistiku. Izdatel’stvo Leningradskogo Universiteta, Leningrag (1989)]

    Google Scholar 

  12. Ventzov, A.V., Kasevich, V.B.: Problems of Speech Perception. Publishing House Editorial, Moscow (2003). [Ventzov, A. V., Kasevich, V. B.: Problemy Vospriyatia Rechi. Izdatel’stvo Editorial, Moscow (2003)]

    Google Scholar 

  13. Morton, J.: The integration of information in word recognition. Psychol. Rev. 76, 165–178 (1969)

    Article  Google Scholar 

  14. Marslen-Wilson, W.D.: Functional parallelism in spoken word-recognition. Cognition 25, 71–102 (1987)

    Article  Google Scholar 

  15. Marslen-Wilson, W.D.: Activation, competition and frequency in lexical access. In: Altman, G.T.M. (ed.) Cognitive Models of Speech Processing: Psycholinguistic and Computational Perspectives, pp. 148–172. MIT Press, Cambridge (1990)

    Google Scholar 

  16. Marslen-Wilson, W.D., Brown, C.M., Tyler, L.K.: Lexical representations in spoken language comprehension. Lang Cogn. Process. 3, 1–16 (1988)

    Article  Google Scholar 

  17. Cole, R.A.: Listening for mispronunciations: a measure of what we hear during speech. Percept Psychophys. 1, 153–156 (1973)

    Article  Google Scholar 

  18. Taft, M., Hambly, G.: Exploring the cohort model of spoken word recognition. Cognition 22, 259–328 (1986)

    Article  Google Scholar 

  19. Bard, E.G., Shillcock, R.C., Altmann, G.E.: The recognition of words after their acoustic offsets in spontaneous speech: evidence of subsequent context. Percept Psychophys. 44, 395–408 (1988)

    Article  Google Scholar 

  20. Luce, P.A.: A computational analysis of uniqueness points in auditory word recognition. Percept Psychophys. 39, 155–158 (1986)

    Article  Google Scholar 

  21. Norris, D.: Shortlist: a connectionist model of continuous speech recognition. Cognition 52, 189–234 (1994)

    Article  Google Scholar 

  22. Massaro, D.W., Cohen, M.M.: The paradigm and the fuzzy logical model of perception are alive and well. J. Exp. Psychol. 122(1), 115–124 (1993)

    Article  Google Scholar 

  23. Hintzman, D.L.: Minerva 2: a simulation model of human memory. Behav. Res. Methods Instrum. Comput. 16(2), 96–101 (1984)

    Article  Google Scholar 

  24. Hintzman, D.L., Block, R., Inskeep, N.: Memory for mode of input. J. Verb. Learn. Verb. Behav. 11, 741–749 (1972)

    Article  Google Scholar 

  25. Heald, S.L.M., Van Hedger, S.C., Nusbaum, H. C.: Understanding Sound: Auditory Skill Acquisition. https://www.researchgate.net/publication/316866628_Understanding_Sound_Auditory_Skill_Acquisition. https://doi.org/10.1016/bs.plm.2017.03.003. Accessed 12 June 2020

  26. Nagoev, Z.V.: Intellectics, or thinking in living and artificial systems. Publishing House KBSC RAS, Nalchik (2013). [Nagoev, Z. V.: Intellektika ili myshleniye v zhyvych i iskusstvennych sistemach. Izdatel’stvo KBNC, Nal’chik (2013)]

    Google Scholar 

  27. Nagoev, Z., Lyutikova, L., Gurtueva, I.: Model for automatic speech recognition using multi-agent recursive cognitive architecture. In: Annual International Conference on Biologically Inspired Cognitive Architectures BICA, Prague, Czech Republic. https://doi.org/10.1016/j.procs.2018.11.089

  28. Nagoev, Z., Gurtueva, I., Malyshev, D., Sundukov, Z.: Multi-agent algorithm imitating formation of phonemic awareness. In: Samsonovich, A.V. (ed.) BICA 2019. AISC, vol. 948, pp. 364–369. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-25719-4_47

    Chapter  Google Scholar 

  29. Nagoev, Z. V., Gurtueva, I.: Fundamental elements for cognitive model of speech perception mechanism based on multiagent recursive intellect. News of Kabardino-Balkarian Scientific Center of RAS 3(89), 3–14 (2019). [Nagoev, Z. V., Gurtueva, I. A.: Bazovye element kognitivnoi modeli mehanizma vospriyatiya rechi na osnove multiagentnogo rekursivnogo intellekta. Izvestiya Kabardino-Balkarskogo nauchnogo tsentra RAN (89), 3–14 (2019)]

    Google Scholar 

  30. Nagoev, Z., Gurtueva, I.: Multiagent model of perceptual space formation in the process of mastering linguistic competence. Adv. Intell. Syst. Comput., 327–334. https://doi.org/10.1007/978-3-030-65596-9_39

  31. Maye, J., Werker, J.F., Gerken, L.: Infant sensitivity to distributional information can affect phonetic discrimination. Cognition 82(3), B101–B111 (2002)

    Article  Google Scholar 

  32. Holt, L.L., Lotto, A.J.: Behavioral examinations of the level of auditory processing of speech context effects. Hear. Res. 167(1–2), 156–169 (2002). https://doi.org/10.1016/S0378-5955(02)00383-0

    Article  Google Scholar 

  33. Lim, S.-J., Fiez, J.A., Holt, L.L.: How may the basal ganglia contribute to auditory categorization and speech perception? Front. Neurosci. 8, 1–18 (2014)

    Article  Google Scholar 

  34. Ashby, F.G., Maddox, W.T.: Human category learning. Annu. Rev. Psychol. 56, 149–178 (2005)

    Article  Google Scholar 

  35. Elman, J.L., McClelland, J.L.: Exploiting lawful variability in the speech wave. In: Perkell, J.S., Klatt, D.H.: (eds.) Invariance and Variability in Speech Processes, pp. 360–385. Lawrence Erlbaum Associates, Inc., Hillsdale (1986)

    Google Scholar 

Download references

The research was supported by the Russian Foundation of Basic Research, grant No. 19–01-00648.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Irina Gurtueva .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nagoev, Z., Gurtueva, I., Anchekov, M. (2022). Generalized Structure of Active Speech Perception Based on Multiagent Intelligence. In: Klimov, V.V., Kelley, D.J. (eds) Biologically Inspired Cognitive Architectures 2021. BICA 2021. Studies in Computational Intelligence, vol 1032. Springer, Cham. https://doi.org/10.1007/978-3-030-96993-6_35

Download citation

Publish with us

Policies and ethics