Abstract
In this chapter the effectiveness of syllable-based prosodic features for speaker recognition is discussed. The term prosody represents a collection of characteristics such as intonation, stress and timing, primarily expressed using variations in pitch, energy and duration at various levels of speech. Prosody reflects the learned/acquired speaking habits of a person and hence contributes for speaker recognition. Because prosodic features are less affected by channel mismatch and noise, they are particularly well suited for speaker forensics, a field that demands accurate identification of suspects with as few mitigating conditions as possible. In this chapter, the author describes a method for extracting prosodic features directly from speech signal. Applying this method, speech is segmented into syllable-like regions using vowel onset points (VOP). The locations of VOPs serve as reference for extraction and representation of prosodic features. The effectiveness of the prosodic features for speaker recognition is demonstrated for extended task of NIST speaker recognition evaluation 2003. Combining evidence from spectral features with that of the proposed prosodic features helps to improve overall speaker recognition accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Heck LP (2002) Integrating high-level information for robust speaker recognition in John Hopkins University workshop on SuperSID, Baltimore, Maryland. http:\\www.cslp.jhu.edu/ws2002/groups/supersid
Doddington GG (2001) Speaker recognition based on idiolectic differences between speakers. Proc. EUROSPEECH, Aalborg, Denmark, pp 2521–2524
Campbell JP (1997) Speaker recognition: a tutorial. Proc IEEE 85(9):1437–1462
Mary L (2006) Multilevel implicit features for language and speaker recognition. Ph. D. Thesis, Indian Institute of Technology, Madras
Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Commun 52:12–40
NIST (2001) Speaker recognition evaluation website: http://www.nist.gov/speech/tests/spk/2001
Reynolds D, Andrews W, Campbell J, Navratil J, Peskin B, Adami A, Jin Q, Klusacek D, Abramson J, Mihaescu R, Godfrey J, Jones D, Xiang B (2003) The superSID project: exploiting high-level information for high-accuracy speaker recognition Proc. IEEE Int. Conf. Acoust., Speech and Signal Processing, Hong Kong, China, 4, pp 784–787
Shriberg E, Stolcke A, Hakkani-Tur D, Tur G (2000) Prosody-based automatic segmentation of speech into sentences and topics. Speech Commun 32:127–154
Sonmez MK, Heck L, Weintraub M, Shriberg E (1997) A lognormal tied mixture model of pitch for prosody-based speaker recognition. Proc. EUROSPEECH, Rhodes, Greece. 3, pp 1391–1394
Atkinson JE (1978) Correlation analysis of the physiological factors controlling fundamental voice frequency. J Acoust Soc Am 63(1):211–222
Yegnanarayana B, Prasanna SRM, Zachariah JM, Gupta CS (2005) Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system. IEEE Trans Speech Audio Process 13(4):575–582
Atal B (1972) Automatic speaker recognition based on pitch contours. J Acous Soc Am 52(3):1687–1697
Adami AG, Mihaescu R, Reynolds DA, Godfrey JJ (2003) Modeling prosodic dynamics for speaker recognition. Proc. ICASSP, Hong Kong, China, 4, pp 788–791
Makhoul J (1975) Linear prediction: a tutorial review. Proc IEEE 63:561–580
Furui S (1981) Cepstral analysis technique for automatic speaker verification. IEEE Trans Speech Audio Process 29:254–272
Reynolds DA, Rose R (1995) Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Process 3:72–83
Reynolds DA (1996) The effect of handset variability on speaker recognition performance: Experiments on the switchboard corpus. Proc. ICASSP, Atlanta, GA, USA, 1, pp 113–116
Thyme-Gobbel AE, Hutchins SE (1996) On using prosodic cues in automatic language identification. Proc. Int. Conf. Spoken Language Processing, Philadelphia, PA, USA, 3, pp 1768–1772
Mary L, Yegnanarayana B (2008) Extraction and representation of prosodic features for language and speaker recognition. Speech Commun 50:782–796
Drygajlo A (2007) Forensic automatic speaker recognition. IEEE Signal Process Mag 132–135
Shriberg E, Stolcke A (2008) The case for automatic higher level features in forensic speaker recognition. Proc. Interspeech, Brisbane, Australia, pp 1509–1512
Rose P (2006) Technical speaker recognition: evaluation, types and testing of evidence. Comp Speech Lang 20:159–1914
Shriberg E, Ferrer L, Kajarekar S, Venkataraman A, Stolcke A (2005) Modeling prosodic feature sequences for speaker recognition. Speech Commun 46:455–472
Sonmez MK, Shriberg E, Heck L, Weintraub M (1998) Modeling dynamic prosodic variation for speaker variation. Proc. ICSLP, Sydney, Australia, 7, pp 3189–3192
Adami AG, Mihaescu R, Reynolds DA, Godfrey JJ (2003) Modeling prosodic dynamics for speaker recognition. Proc. ICASSP, Hong kong, China, 4, pp 788–791
Peskin B, Navratil J, Abramson J, Jones D, Klusacek D, Reynolds D, Xiang B (2003) Using prosodic and conversational features for high-performance speaker recognition: report from JHU WS`02. Proc. ICASSP, Hong kong, China, 4, pp 792–795
Rouas J, Farinas J, Pellegrino F, Andre-Obrecht R (2005) Rhythmic unit extraction and modelling for automatic language identification. Speech Commun 47:436–456
Nagarajan T, Murthy HA (2006) Language identification using acoustic log-likelihoods of syllable-like units. Speech Commun 48:913–926
Dehak N, Kenny P, Dumouchel P (2007) Continuous prosodic features and formant modeling with joint factor analysis for speaker verification. Proc. of Interspeech, pp 1234–1237
Mary L, Yegnanarayana B (2006) Prosodic features for speaker verification. Proc. of Interspeech, Pittsburgh, Pennsylvania, pp 917–920
MacNeilage PF (1998) The frame/content theory of evolution of speech production. Behav Brain Sci 21:499–546
Krakow RA (1999) Physiological organization of syllables: a review. J Phonetics 27:23–54
Atterer M, Ladd DR (2004) On the phonetics and phonology of “segmental anchoring” of F0: evidence from German. J Phonetics 32:177–197
Prasanna SRM, Gangashetty SV, Yegnanarayana B (2001) Significance of vowel onset point for speech analysis. Proc. Signal Proc. Com, Indian Institute of Science, pp. 81–88
Prasanna SRM (2004) Event-based analysis of speech. Ph D Thesis, Indian Institute of Technology, Madras
Prasanna SRM, Yegnanarayana B (2005) Detection of vowel onset point events using excitation source information, Proc. of Interspeech, pp 1133–1136
Prasanna SRM, Zachariah JM (2002) Detection of vowel onset point in speech. Proc. IEEE Int Conf Acoust Speech, Signal Processing, Orlando, Fl, USA 4:4159
Ananthapadmanabha TV (1978) Epoch extraction of voice speech. Ph. D. Thesis, Indian institute of Science, Bangalore
Hess W (1983) Pitch determination of speech signals. Springer, Berlin
Ananthapadmanabha TV, Yegnanarayana B (1979) Epoch extraction fromlinear prediction residual for identification of closed glottis interval. IEEE Trans ASSP 27:309–319
Ananthapadmanabha TV, Yegnanarayana B (1975) Epoch extraction of voice speech. IEEE Trans ASSP 23:562–570
Taylor P (2000) Analysis and synthesis of intonation using the tilt model. J Acoust Soc Am 107(3):1697–1714
Gussenhoven C, Reepp BH, Rietveld A, Rump HH, Terken J (1997) The perceptual prominence of fundamental frequency peaks. J Acoust Soc Am 102(5):3009–3022
Yegnanarayana B (1999) Artificial neural network. Prentice Hall of India, New Delhi
Yegnanarayana B, Kishore SP (2002) AANN-An alternative for GMM for pattern recognition. Neural Netw 15(3):459–469
Acknowledgement
The author would like to thank Prof. B. Yegnanarayana and members of Speech and Vision Laboratory of IIT Madras, India during 2002–2006 for their support to carry out the study described in this chapter.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Mary, L. (2012). Prosodic Features for Speaker Recognition. In: Neustein, A., Patil, H. (eds) Forensic Speaker Recognition. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-0263-3_13
Download citation
DOI: https://doi.org/10.1007/978-1-4614-0263-3_13
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-0262-6
Online ISBN: 978-1-4614-0263-3
eBook Packages: EngineeringEngineering (R0)