Hesitations in Spontaneous Speech: Acoustic Analysis and Detection

  • Vasilisa VerkhodanovaEmail author
  • Vladimir Shapranov
  • Irina Kipyatkova
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10458)


Spontaneous speech is different from any other type of speech in many ways, with speech disfluencies being the prominent feature. These phenomena both play an important role in communication, and also cause problems for automatic speech processing. In this study we present the results of acoustic analysis of the most frequent disfluencies - voiced hesitations (filled pauses and lengthenings) across different speaking styles in spontaneous Russian speech, as well as results of experiments on their detection using SVM classifier on a joint Russian and English spontaneous speech corpus. Results of acoustic analysis showed significant differences in fundamental frequency and energy distribution ratios of hesitations and their contexts across speaking styles in Russian: comparing to the dialogues, in monologues speakers exhibit more prosodic cues for the adjacent context and hesitations. Experiments on detection of voiced hesitations on a mixed language and style corpus with SVM resulted in achieving F1–score = 0.48 (With F1–score = 0.55 for only Russian data).


Speech disfluencies Hesitations Filled pauses Lengthenings Speech processing Support vector machines 



This research is supported by the grant of Russian Foundation for Basic Research (project No. 15-06-04465) and by the Council for Grants of the President of the Russian Federation (projects No. MK-1000.2017.8).


  1. 1.
    Department of Phonetics of Saint Petersburg University.
  2. 2.
    Scikit-Learn: Machine learning in Python.
  3. 3.
    Allwood, J., Nivre, J., Ahlsén, E.: Speech management on the non-written life of speech. Nordic J. Linguist. 13(1), 3–48 (1990)CrossRefGoogle Scholar
  4. 4.
    Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2, 1–27 (2011). CrossRefGoogle Scholar
  5. 5.
    Clark, H.H., Tree, J.E.F.: Using uh and um in spontaneous speaking. Cognition 84(1), 73–111 (2002)CrossRefGoogle Scholar
  6. 6.
    Clark, H.: Using Language. Cambridge University Press, Cambridge (1996)CrossRefGoogle Scholar
  7. 7.
    Du Bois, J.W., Chafe, W.L., Meyer, C., Thompson, S.A., Martey, N.: Santa Barbara Corpus of Spoken American English, Linguistic Data Consortium. Philadelphia (2000–2005)Google Scholar
  8. 8.
    Eyben, F., Wöllmer, M., Schuller, B.: OpenSMILE: the Munich versatile and fast open-source audio feature extractor. In: Proceeding of 18th ACM International Conference on Multimedia, pp. 1459–1462. ACM (2010)Google Scholar
  9. 9.
    Giannini, A.: Hesitation phenomena in spontaneous Italian. In: Proceeding of 15th International Congress of Phonetic Sciences, Barcelona, Spain, pp. 2653–2656 (2003)Google Scholar
  10. 10.
    Godfrey, J.J., Holliman, E.C., McDaniel, J.: SwitchBoard: telephone speech corpus for research and development. In: Proceeding of International Conference on Acoustics, Speech, and Signal Processing (ICASSP-1992), vol. 1, pp. 517–520. IEEE (1992)Google Scholar
  11. 11.
    Heijmans, H.J.: Mathematical morphology: a modern approach in image processing based on algebra and geometry. SIAM Rev. 37(1), 1–36 (1995)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    INTERSPEECH: Computational Paralinguistic Challenge (2013).
  13. 13.
    Khurshudian, V.: Hesitation in typologically different languages: an experimental study. In: Proceeding of International Conference on Computational Linguistics Dialogue, pp. 497–501 (2005)Google Scholar
  14. 14.
    Kibrik, A., Podlesskaya, V. (eds.): Rasskazy o Snovideniyah: Korpusnoye Issledovaniye Ustnogo Russkogo Diskursa [Night dream stories: Corpus study of Russian discourse]. Litres (2014)Google Scholar
  15. 15.
    Medeiros, H., Batista, F., Moniz, H., Trancoso, I., Meinedo, H.: Experiments on automatic detection of filled pauses using prosodic features. Actas de Inforum 2013, 335–345 (2013)Google Scholar
  16. 16.
    Medeiros, H., Moniz, H., Batista, F., Trancoso, I., Nunes, L., et al.: Disfluency detection based on prosodic features for university lectures. In: Proceeding of INTERSPEECH 2013, Lyon, France, pp. 2629–2633 (2013)Google Scholar
  17. 17.
    Moniz, H., Batista, F., Mata, A.I., Trancoso, I.: Speaking style effects in the production of disfluencies. Speech Commun. 65, 20–35 (2014)CrossRefGoogle Scholar
  18. 18.
    O’Connel, D.C., Kowal, S.: Communicating with One Another: Toward a Psychology of Spontaneous Spoken Discourse. Cognition and Language: A Series in Psycholinguistics. Springer Science & Business Media, New York (2009). doi: 10.1007/978-0-387-77632-3 Google Scholar
  19. 19.
    O’Connell, D., Kowal, S.: The history of research on the filled pause as evidence of the written language bias in linguistics. J. Psycholinguist. Res. 33(6), 459–474 (2004)CrossRefGoogle Scholar
  20. 20.
    Ogden, R.: Turn-holding, turn-yielding and laryngeal activity in finnish talk-in-interaction. J. Int. Phonetics Assoc. 31(1), 139–52 (2001)Google Scholar
  21. 21.
    O’Shaughnessy, D.: Recognition of hesitations in spontaneous speech. In: Proceeding of International Conference on Acoustics, Speech, and Signal Processing, (ICASSP-1992), vol. 1, pp. 521–524. IEEE (1992)Google Scholar
  22. 22.
    Ostendorf, M., Shriberg, E., Stolcke, A.: Human language technology: opportunities and challenges. Technical report, DTIC Document (2005)Google Scholar
  23. 23.
    Prylipko, D., Egorow, O., Siegert, I., Wendemuth, A.: Application of image processing methods to filled pauses detection from spontaneous speech. In: Proceeding of INTERSPEECH 2014, Singapore, pp. 1816–1820. ISCA (2014)Google Scholar
  24. 24.
    Ranganath, R., Jurafsky, D., McFarland, D.A.: Detecting friendly, flirtatious, awkward, and assertive speech in speed-dates. Comput. Speech Lang. 27(1), 89–115 (2013)CrossRefGoogle Scholar
  25. 25.
    Shriberg, E.: Preliminaries to a theory of speech disfluencies. Ph.D. thesis, University of California at Berkeley (1994)Google Scholar
  26. 26.
    Shriberg, E.: To ‘Errrr’ is human: ecology and acoustics of speech disfluencies. J. Int. Phonetic Assoc. 31(1), 153–169 (2001)CrossRefGoogle Scholar
  27. 27.
    Shriberg, E., Bates, R.A., Stolcke, A.: A prosody only decision-tree model for disfluency detection. In: Proceeding of the Eurospeech 1997, 5th European Conference on Speech Communication and Technology, Rhodes, Greece, pp. 2383–2386 (1997)Google Scholar
  28. 28.
    Stepanova, S.: Some features of filled hesitation pauses in spontaneous Russian. In: Proceeding of 16th International Congress of Phonetic Sciences, Saarbrucken, Germany, vol. 16, pp. 1325–1328 (2007)Google Scholar
  29. 29.
    Stolcke, A., Shriberg, E., Bates, R.A., Ostendorf, M., Hakkani, D., Plauche, M., Tür, G., Lu, Y.: Automatic detection of sentence boundaries and disfluencies based on recognized words. In: ICSLP (1998)Google Scholar
  30. 30.
    Thordardottir, E.T., Weismer, S.E.: Content mazes and filled pauses in narrative language samples of children with specific language impairment. Brain Cogn. 48(2–3), 587–592 (2001)Google Scholar
  31. 31.
    Verkhodanova, V., Shapranov, V.: Automatic detection of filled pauses and lengthenings in the spontaneous Russian speech. In: Proceeding of 7th International Conference Speech Prosody, pp. 1110–1114 (2014)Google Scholar
  32. 32.
    Verkhodanova, V., Shapranov, V.: Multi-factor method for detection of filled pauses and lengthenings in Russian spontaneous speech. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS (LNAI), vol. 9319, pp. 285–292. Springer, Cham (2015). doi: 10.1007/978-3-319-23132-7_35 CrossRefGoogle Scholar
  33. 33.
    Verkhodanova, V., Shapranov, V.: Detecting filled pauses and lengthenings in Russian spontaneous speech using SVM. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 224–231. Springer, Cham (2016). doi: 10.1007/978-3-319-43958-7_26 CrossRefGoogle Scholar
  34. 34.
    Watanabe, M., Hirose, K., Den, Y., Minematsu, N.: Filled pauses as cues to the complexity of upcoming phrases for native and non-native listeners. Speech Commun. 50(2), 81–94 (2008)CrossRefGoogle Scholar
  35. 35.
    Zahorian, S.A., Wu, J., Karnjanadecha, M., Vootkur, C.S., Wong, B., Hwang, A., Tokhtamyshev, E.: Open-source multi-language audio database for spoken language processing applications. In: Proceeding of INTERSPEECH 2011, Florence, Italy, pp. 1493–1496 (2011)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Vasilisa Verkhodanova
    • 1
    Email author
  • Vladimir Shapranov
    • 1
  • Irina Kipyatkova
    • 1
  1. 1.SPIIRASSt. PetersburgRussia

Personalised recommendations