Automatic Recognition of Sound Categories from Their Vocal Imitation Using Audio Primitives Automatically Found by SI-PLCA and HMM

  • Enrico Marchetto
  • Geoffroy PeetersEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11265)


In this paper we study the automatic recognition of sound categories (such as fridge, mixers or sawing sounds) from their vocal imitations. Vocal imitations are made of a succession over time of sounds produced using vocal mechanisms that can largely differ from the ones used in speech. We develop here a recognition approach inspired by automatic-speech-recognition systems, with an acoustic model (that maps the audio signal to a set of probability over “phonemes”) and a language model (that represents the expected succession of “phonemes” for each sound category). Since we do not know what are the underlying “phonemes” of vocal imitations we propose to automatically estimate them using Shift-Invariant Probabilistic Latent Component Analysis (SI-PLCA) applied to a dataset of vocal imitations. The kernel distributions of the SI-PLCA are considered as the “phonemes” of vocal imitation and its impulse distributions are used to compute the emission probabilities of the states of a set of Hidden Markov Models (HMMs). To evaluate our proposal, we test it for a task of automatically recognizing 12 sound categories from their vocal imitations.


Vocal imitation Sound design Sound recognition Shift-invariant probabilistic-latent-component-analysis Hidden markov model 



This work was supported by the 7th FP of the EU (FP7-ICT-2013-C FET-Future Emerging Technologies) under grant agreement 618067 (SkAT-VG project).


  1. 1.
    Baldan, S., Delle Monache, S., Rocchesso, D.: The sound design toolkit. Softw. X 6, 255–260 (2017)Google Scholar
  2. 2.
    Brown, J.C.: Calculation of a constant Q spectral transform. J. Acoust. Soc. Am. 89(1), 425–434 (1991)CrossRefGoogle Scholar
  3. 3.
    Houix, O., Monache, S.D., Lachambre, H., Bevilacqua, F., Rocchesso, D., Lemaitre, G.: Innovative tools for sound sketching combining vocalizations and gestures. In: Proceedings of the Audio Mostly 2016, pp. 12–10. ACM (2016)Google Scholar
  4. 4.
    Ishihara, K., Nakatani, T., Ogata, T., Okuno, H.G.: Automatic sound-imitation word recognition from environmental sounds focusing on ambiguity problem in determining phonemes. In: Zhang, C., W. Guesgen, H., Yeap, W.-K. (eds.) PRICAI 2004. LNCS (LNAI), vol. 3157, pp. 909–918. Springer, Heidelberg (2004). Scholar
  5. 5.
    Juang, B.H., Rabiner, L.R.: Automatic speech recognition-a brief history of the technology development. Georgia Institute of Technology. Atlanta Rutgers University and the University of California 1:67 (2005)Google Scholar
  6. 6.
    Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)CrossRefGoogle Scholar
  7. 7.
    Lemaitre, G., Dessein, A., Aura, K., Susini, P.: Do vocal imitations enable the identification of the imitated sounds. In: Proceedings of the 8th Annual Auditory Perception, Cognition and Action Meeting (APCAM 2009), Boston, MA (2009)Google Scholar
  8. 8.
    Lemaitre, G., Houix, O., Voisin, F., Misdariis, N., Susini, P.: Vocal imitations of non-vocal sounds. PLoS ONE 11(12), e0168167 (2016). Public Library of ScienceCrossRefGoogle Scholar
  9. 9.
    Lemaitre, G., Rocchesso, D.: On the effectiveness of vocal imitations and verbal descriptions of sounds. J. Acoust. Soc. Am. 135(2), 862–873 (2014). Scholar
  10. 10.
    Marchetto, E., Peeters, G.: A set of audio features for the morphological description of vocal imitations. In: Proceedings of DAFx (2015)Google Scholar
  11. 11.
    Paatero, P., Tapper, U.: Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2), 111–126 (1994). Scholar
  12. 12.
    Peeters, G., Deruty, E.: Sound indexing using morphological description. IEEE Trans. Audio Speech Lang. Process. 18(3), 675–687 (2010)CrossRefGoogle Scholar
  13. 13.
    Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)CrossRefGoogle Scholar
  14. 14.
    Rabiner, L.R., Juang, B.H.: Fundamentals of speech recognition (1993)Google Scholar
  15. 15.
    Ricard, J., Herrera, P.: Morphological sound description: computational model and usability evaluation. In: Audio Engineering Society Convention 116 (2004)Google Scholar
  16. 16.
    Saon, G., Chien, J.T.: Large-vocabulary continuous speech recognition systems: a look at some recent advances. IEEE Sig. Process. Mag. 29(6), 18–33 (2012)CrossRefGoogle Scholar
  17. 17.
    Schaeffer, P.: Traité des objets musicaux. Le Seuil (1966)Google Scholar
  18. 18.
    Schörkhuber, C., Klapuri, A., Holighaus, N., Dörfler, M.: A Matlab toolbox for efficient perfect reconstruction time-frequency transforms with log-frequency resolution. In: Audio Engineering Society Conference: 53rd International Conference: Semantic Audio, January 2014.
  19. 19.
    Shashanka, M., Raj, B., Smaragdis, P.: Probabilistic latent variable models as nonnegative factorizations. Comput. Intell. Neurosci. 2008, 8 (2008). Article ID 947438. Scholar
  20. 20.
    Smaragdis, P., Raj, B.: Shift-invariant probabilistic latent component analysis. Technical report, MERL (2007)Google Scholar
  21. 21.
    Sundaram, S., Narayanan, S.: Vector-based representation and clustering of audio using onomatopoeia words. In: Proceedings of AAAI (2006)Google Scholar
  22. 22.
    Sundaram, S., Narayanan, S.: Classification of sound clips by two schemes: using onomatopoeia and semantic labels. In: 2008 IEEE International Conference on Multimedia and Expo, pp. 1341–1344. IEEE (2008)Google Scholar
  23. 23.
    Velasco, G.A., Holighaus, N., Dörfler, M., Grill, T.: Constructing an invertible constant-Q transform with non-stationary Gabor frames. In: Proceedings of DAFx, Paris, pp. 93–99 (2011)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.UMR STMS 9912 (IRCAM – CNRS – Sorbonne-University)ParisFrance

Personalised recommendations