Abstract
Recent success in the field of speech technology is undoubted. Developers from Microsoft and IBM reported on the efficiency of automated speech recognition systems at the human level in transcribing conversational telephone speech. According to various estimates, their WER now is about 5.8–5.1%. However, the most challenging problems in speech recognition – diarization and noise cancellation – are still open. A comparative analysis of the most frequent errors made by systems and people when solving the recognition problem shows that, in general, the errors are similar. Errors made by a human when solving speech recognition problems are much less critical; they seldom distort the meaning of a statement. In other words, these errors are not sematic. That is why the mechanisms of human speech perception are the most promising area of research. This paper proposes the model of a general structure for active auditory perception theory and the neurobiological basis of the hypothesis put forward. The proposed concept is a basic platform for general multiagent architecture. We assume that speech recognition is guided by attention, even in its early stages, a change in the early auditory code determined by context and experience. This model simulates the involuntary attention used by children in mastering their native language, based on an emotional assessment of perceptually significant auditory information. The multiagent internal dynamics of auditory speech coding can provide new insights into how hearing impairment can be treated. The formal description of the structure of speech perception can be used as a general theoretical basis for the development of universal systems for automatic speech recognition, highly effective in noisy conditions and cocktail-party situations. Formal means for program implementation of the present model are multiagent systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hershey, J.R., Rennie, S.J., Olsen, P.A., Kristjansson, T.T.: Super-human multi-talker speech recognition: a graphical modeling approach. Comput. Speech Lang. 24, 45–66 (2010)
Weng, C., Yu, D., Seltzer, M. L., Droppo, J.: Single-channel mixed speech recognition using deep neural networks. In: Proceedings IEEE ICASSP, pp. 5632–5636 (2014)
Matsoukas, S., et al.: Advances in transcription of broadcast news and conversational telephone speech within the combined ears bbn/limsi system. IEEE Trans. Audio Speech Lang. Process. 14, 1541–1556 (2006)
Evermann, G., et al.: Development of the 2003 CU-HTK conversational telephone speech transcription system. In: Proceedings IEEE ICASSP 1, p. I–249 (2004)
Glenn, M. L., Strassel, S. M., Lee, H., Maeda, K., Zakhary, R., Li, X.: Transcription methods for consistency, volume and efficiency. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC, pp. 2915–2920 (2010)
Hannun, A.: Writing about Machine Learning. https://awni.github.io/speech-recognition/. Accessed 21 Aug 2021
Han, C., O’Sullivan, J., Luo, Y., Herrero, J., Mehta, A.D., Mesgarani, N.: Speaker-independent auditory attention decoding without access to clean speech sources. Sci. Adv. 5(5), 1–11 (2019). https://doi.org/10.1126/sciadv.aav6134
Amodei, D., et al.: Deep Speech 2: End-to-end speech recognition in English and Mandarin. arXiv preprint arXiv:1512.02595. Accessed 11 May 2020
Galbraith, G.C., Arroyo, C.: Selective attention and brainstem frequency-following responses. Biol. Psychol. 37, 3–22 (1993)
Giard, M.-E., Collet, L., Bouchet, P., Pernier, J.: Auditory selective attention in the human cochlea. Brain Res. 633, 353–356 (1994)
Sakharny, L.V.: Introduction into Psycholinguistics. Publishing House of Leningrad University, Leningrad (1989). [Sakharny, L. V.: Vvedeniye v psikholingvistiku. Izdatel’stvo Leningradskogo Universiteta, Leningrag (1989)]
Ventzov, A.V., Kasevich, V.B.: Problems of Speech Perception. Publishing House Editorial, Moscow (2003). [Ventzov, A. V., Kasevich, V. B.: Problemy Vospriyatia Rechi. Izdatel’stvo Editorial, Moscow (2003)]
Morton, J.: The integration of information in word recognition. Psychol. Rev. 76, 165–178 (1969)
Marslen-Wilson, W.D.: Functional parallelism in spoken word-recognition. Cognition 25, 71–102 (1987)
Marslen-Wilson, W.D.: Activation, competition and frequency in lexical access. In: Altman, G.T.M. (ed.) Cognitive Models of Speech Processing: Psycholinguistic and Computational Perspectives, pp. 148–172. MIT Press, Cambridge (1990)
Marslen-Wilson, W.D., Brown, C.M., Tyler, L.K.: Lexical representations in spoken language comprehension. Lang Cogn. Process. 3, 1–16 (1988)
Cole, R.A.: Listening for mispronunciations: a measure of what we hear during speech. Percept Psychophys. 1, 153–156 (1973)
Taft, M., Hambly, G.: Exploring the cohort model of spoken word recognition. Cognition 22, 259–328 (1986)
Bard, E.G., Shillcock, R.C., Altmann, G.E.: The recognition of words after their acoustic offsets in spontaneous speech: evidence of subsequent context. Percept Psychophys. 44, 395–408 (1988)
Luce, P.A.: A computational analysis of uniqueness points in auditory word recognition. Percept Psychophys. 39, 155–158 (1986)
Norris, D.: Shortlist: a connectionist model of continuous speech recognition. Cognition 52, 189–234 (1994)
Massaro, D.W., Cohen, M.M.: The paradigm and the fuzzy logical model of perception are alive and well. J. Exp. Psychol. 122(1), 115–124 (1993)
Hintzman, D.L.: Minerva 2: a simulation model of human memory. Behav. Res. Methods Instrum. Comput. 16(2), 96–101 (1984)
Hintzman, D.L., Block, R., Inskeep, N.: Memory for mode of input. J. Verb. Learn. Verb. Behav. 11, 741–749 (1972)
Heald, S.L.M., Van Hedger, S.C., Nusbaum, H. C.: Understanding Sound: Auditory Skill Acquisition. https://www.researchgate.net/publication/316866628_Understanding_Sound_Auditory_Skill_Acquisition. https://doi.org/10.1016/bs.plm.2017.03.003. Accessed 12 June 2020
Nagoev, Z.V.: Intellectics, or thinking in living and artificial systems. Publishing House KBSC RAS, Nalchik (2013). [Nagoev, Z. V.: Intellektika ili myshleniye v zhyvych i iskusstvennych sistemach. Izdatel’stvo KBNC, Nal’chik (2013)]
Nagoev, Z., Lyutikova, L., Gurtueva, I.: Model for automatic speech recognition using multi-agent recursive cognitive architecture. In: Annual International Conference on Biologically Inspired Cognitive Architectures BICA, Prague, Czech Republic. https://doi.org/10.1016/j.procs.2018.11.089
Nagoev, Z., Gurtueva, I., Malyshev, D., Sundukov, Z.: Multi-agent algorithm imitating formation of phonemic awareness. In: Samsonovich, A.V. (ed.) BICA 2019. AISC, vol. 948, pp. 364–369. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-25719-4_47
Nagoev, Z. V., Gurtueva, I.: Fundamental elements for cognitive model of speech perception mechanism based on multiagent recursive intellect. News of Kabardino-Balkarian Scientific Center of RAS 3(89), 3–14 (2019). [Nagoev, Z. V., Gurtueva, I. A.: Bazovye element kognitivnoi modeli mehanizma vospriyatiya rechi na osnove multiagentnogo rekursivnogo intellekta. Izvestiya Kabardino-Balkarskogo nauchnogo tsentra RAN (89), 3–14 (2019)]
Nagoev, Z., Gurtueva, I.: Multiagent model of perceptual space formation in the process of mastering linguistic competence. Adv. Intell. Syst. Comput., 327–334. https://doi.org/10.1007/978-3-030-65596-9_39
Maye, J., Werker, J.F., Gerken, L.: Infant sensitivity to distributional information can affect phonetic discrimination. Cognition 82(3), B101–B111 (2002)
Holt, L.L., Lotto, A.J.: Behavioral examinations of the level of auditory processing of speech context effects. Hear. Res. 167(1–2), 156–169 (2002). https://doi.org/10.1016/S0378-5955(02)00383-0
Lim, S.-J., Fiez, J.A., Holt, L.L.: How may the basal ganglia contribute to auditory categorization and speech perception? Front. Neurosci. 8, 1–18 (2014)
Ashby, F.G., Maddox, W.T.: Human category learning. Annu. Rev. Psychol. 56, 149–178 (2005)
Elman, J.L., McClelland, J.L.: Exploiting lawful variability in the speech wave. In: Perkell, J.S., Klatt, D.H.: (eds.) Invariance and Variability in Speech Processes, pp. 360–385. Lawrence Erlbaum Associates, Inc., Hillsdale (1986)
The research was supported by the Russian Foundation of Basic Research, grant No. 19–01-00648.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Nagoev, Z., Gurtueva, I., Anchekov, M. (2022). Generalized Structure of Active Speech Perception Based on Multiagent Intelligence. In: Klimov, V.V., Kelley, D.J. (eds) Biologically Inspired Cognitive Architectures 2021. BICA 2021. Studies in Computational Intelligence, vol 1032. Springer, Cham. https://doi.org/10.1007/978-3-030-96993-6_35
Download citation
DOI: https://doi.org/10.1007/978-3-030-96993-6_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-96992-9
Online ISBN: 978-3-030-96993-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)