Advertisement

International Journal of Speech Technology

, Volume 15, Issue 3, pp 407–417 | Cite as

Analysis and detection of mimicked speech based on prosodic features

  • Leena MaryEmail author
  • K. K. Anish Babu
  • Aju Joseph
Article

Abstract

This paper describes a work aimed towards understanding the art of mimicking by professional mimicry artists while imitating the speech characteristics of known persons, and also explores the possibility of detecting a given speech as genuine or impostor. This includes a systematic approach of collecting three categories of speech data, namely original speech of the mimicry artists, speech while mimicking chosen celebrities and original speech of the chosen celebrities, to analyze the variations in prosodic features. A method is described for the automatic extraction of relevant prosodic features in order to model speaker characteristics. Speech is automatically segmented as intonation phrases using speech/nonspeech classification. Further segmentation is done using valleys in energy contour. Intonation, duration and energy features are extracted for each of these segments. Intonation curve is approximated using Legendre polynomials. Other useful prosodic features include average jitter, average shimmer, total duration, voiced duration and change in energy. These prosodic features extracted from original speech of celebrities and mimicry artists are used for creating speaker models. Support Vector Machine (SVM) is used for creating speaker models, and detection of a given speech as genuine or impostor is attempted using a speaker verification framework of SVM models.

Keywords

Speaker recognition Analysis of mimicked speech Prosodic features Intonation Support vector machines 

Notes

Acknowledgement

The authors would like to thank Kerala State Council for Science, Technology and Environment, India for giving financial support to carry out the study described in this paper.

References

  1. Adami, A. G., Mihaescu, R., Reynolds, D. A., & Godfrey, J. J. (2003). Modeling prosodic dynamics for speaker recognition. In Proceeding of int. conf. acoust., speech and signal processing, Hong Kong, China (Vol. 4, pp. 788–791). Google Scholar
  2. Atal, B. (1972). Automatic speaker recognition based on pitch contours. The Journal of the Acoustical Society of America, 52(3), 1687–1697. CrossRefGoogle Scholar
  3. Blomberg, M., Elenius, D., & Zetterholm, E. (2004). Speaker verification scores and acoustic analysis of a professional impersonator. In Proceedings FONETIK 2004 the XVIIth Swedish phonetics conference (pp. 84–87). Google Scholar
  4. Campbell, J. P. (1997). Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9), 1437–1462. CrossRefGoogle Scholar
  5. Drygajlo, A. (2007). Forensic automatic speaker recognition. IEEE Signal Processing Magazine, 132–135. Google Scholar
  6. Farrús, M., Wagner, M., Erro, D., & Hernando, J. (2010). Automatic speaker recognition as a measurement of voice imitation and conversion. The International Journal of Speech, Language and the Law, 17(1), 119–142. Google Scholar
  7. Heck, L. P. (2002). Integrating high-level information for robust speaker recognition in John Hopkins University workshop on SuperSID. Baltimore, Maryland. http://www.cslp.jhu.edu/ws2002/groups/supersid.
  8. Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: from features to supervectors. Speech Communication, 52, 12–40. CrossRefGoogle Scholar
  9. Lin, C., & Wang, H. (2005). Language identification using pitch contour information. In Proceedings of int. conf. acoust., speech and signal processing, Philadelphia. USA (Vol. I, pp. 601–605). Google Scholar
  10. Mary, L. (2006). Multilevel implicit features for language and speaker recognition. Ph.D. Thesis, Indian Institute of Technology, Madras, India. Google Scholar
  11. Mary, L. (2011). Prosodic features for speaker recognition. In A. Neustein & H. A. Patil (Eds.), Forensic speaker recognition—law enforcement and counter-terrorism (pp. 365–388). Berlin: Springer. Google Scholar
  12. Mary, L., & Yegnanarayana, B. (2006). Prosodic features for speaker verification. In Proceedings of interspeech, Pittsburgh, Pennsylvania (pp. 917–920). Google Scholar
  13. Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50, 782–796. CrossRefGoogle Scholar
  14. Moattar, M. H., Homayounpour, M. M., & Kalantari, N. K. (2010). A new approach for robust realtime voice activity detection using spectral pattern. In Proceeding of int. conf. acoust., speech and signal processing (pp. 4478–4481). Google Scholar
  15. NIST (2001). Speaker recognition evaluation website. http://www.nist.gov/speech/tests/spk/2001.
  16. Perrot, P., & Chollet, G. (2008). The question of disguised voice. In Proceedings of acoustics 08, Paris (pp. 5681–5685). Google Scholar
  17. Perrot, P., Aversano, G., & Chollet, G. (2007). Voice disguise and automatic detection: review and perspectives. Lecture notes in computer science.: Vol. 4391. In Progress in nonlinear speech processing (pp. 101–117). Berlin: Springer. CrossRefGoogle Scholar
  18. Perrot, P., Morel, M., Razik, G., & Chollet, G. (2009). Lecture notes of the Institute for Computer Sciences: Vol. 8. Vocal forgery in forensic sciences, Social Informatics and Telecommunication Engineering (pp. 179–185). Berlin: Springer. Google Scholar
  19. Rabinerl, L. R., & Schafer, R. W. (2007). Introduction to digital speech processing. Foundations and Trends in Signal Processing, 1(1–2), 1–194. CrossRefGoogle Scholar
  20. Reynolds, D., Andrews, W., Campbell, J., Navratil, J., Peskin, B., Adami, A., Jin, Q., Klusacek, D., Abramson, J., Mihaescu, R., Godfrey, J., Jones, D., & Xiang, B. (2003). The superSID project: exploiting high-level information for high-accuracy speaker recognition. In Proceedings of int. conf. acoust., speech and signal processing, Hong Kong, China (Vol. 4, pp. 784–787). Google Scholar
  21. Rose, P. (2006). Technical speaker recognition: evaluation, types and testing of evidence. Computer Speech & Language, 20, 159–191. CrossRefGoogle Scholar
  22. Shriberg, E., & Stolcke (2008). The case for automatic higher level features in forensic speaker recognition. In Proceedings of interspeech (pp. 1509–1512). Google Scholar
  23. Shriberg, E., & Stolcke (2008). The case for automatic higher level features in forensic speaker recognition. In Proceedings of interspeech (pp. 1509–1512). Google Scholar
  24. Zetterholm, E. (2006). Same speaker–different voices. A study of one impersonator and some of his different imitations. In Proceedings of the 11th Australian international conference on speech science and technology (pp. 70–75). Google Scholar
  25. Zetterholm, E., & Sullivan, K. P. H. (2002). The impact of semantic expectation on the acceptance of a voice imitation. In Proceedings of the 9th Australian conference on speech science and technology (pp. 291–296). Google Scholar
  26. Zetterholm, E., Blomberg, M., & Elenius, D. A. (2004). Comparison between human perception and a speaker verification system score of a voice imitation. In Proceedings of the 10th Australian international conference on speech science and technology (pp. 393–397). Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Government Engg. CollegeBartonhill TrivandrumKeralaIndia
  2. 2.Rajiv Gandhi Institute of TechnologyKottayamKeralaIndia

Personalised recommendations