Abstract
Speech analysis is traditionally performed using short-time analysis to extract features in time and frequency domains. The window size for the analysis is fixed somewhat arbitrarily, mainly to account for the time varying vocal tract system during production. However, speech in its primary mode of excitation is produced due to impulse-like excitation in each glottal cycle. Anchoring the speech analysis around the glottal closure instants (epochs) yields significant benefits for speech analysis. Epoch-based analysis of speech helps not only to segment the speech signals based on speech production characteristics, but also helps in accurate analysis of speech. It enables extraction of important acoustic-phonetic features such as glottal vibrations, formants, instantaneous fundamental frequency, etc. Epoch sequence is useful to manipulate prosody in speech synthesis applications. Accurate estimation of epochs helps in characterizing voice quality features. Epoch extraction also helps in speech enhancement and multispeaker separation. In this tutorial article, the importance of epochs for speech analysis is discussed, and methods to extract the epoch information are reviewed. Applications of epoch extraction for some speech applications are demonstrated.
Similar content being viewed by others
References
Abberton E R M, Howard D M, Fourcin A J 1989 Laryngographic assessment of normal voice: A tutorial. Clinical Linguistics and Phonetics 3(3): 263–296
Ambramson A S, Lisker L 1965 Voice onset time in stop consonants: acoustic analysis and synthesis. Proc. 5th Int. Congr. Phonetic Sciences, Liege, A51
Ananthapadmanabha T V, Fant G 1982 Calculations of true glottal volume-velocity and its components. Speech Commun. 1: 167–184
Ananthapadmanabha T V, Yegnanarayana B 1975 Epoch extraction of voiced speech. IEEE Trans. Acoust. Speech Signal Process. 23(6): 562–570
Ananthapadmanabha T V, Yegnanarayana B 1979 Epoch extraction from linear prediction residual for identification of closed glottis interval. IEEE Trans. Acoust. Speech Signal Process. 27(4): 309–319
Atal B S, Hanauer S L 1971 Speech analysis and synthesis by linear prediction of the speech wave. J. Acoust. Soc. Am. 50(2): 637–655
Bachorowski J, Smoski M, Owren M 2001 The acoustic features of human laughter. J. Acoust. Soc. Am. 111: 1582–1597
Bagshaw P C, Hiller S M, Jack M A 1993 Enhanced pitch tracking and the processing of F 0 contours for computer and intonation teaching. Proc. European Conf. on Speech Commun. (Eurospeech), Berlin, Germany, 1003–1006. URL http://www.cstr.ed.ac.uk/research/projects/fda/
Bapineedu G 2010 Analysis of Lombard effect speech and its application in speaker verification for imposter detection. MS thesis, Language Technologies Research Centre, International Institute of Information Technology, Hyderabad, India
Barros A K, Rutkowski T, Itakura F, Ohnishi N 2002 Estimation of speech embedded in a reverberant and noisy environment by independent component analysis and wavlets. IEEE Trans. Neural Netw. 13: 888–893
Boersma P 1993 Accurate short-term analysis of fundamental frequency and the hormincs-to-noise ratio of a sampled sound. Proc. Inst. Phonetic Sci. 17: 97–110
Boll S F 1979 Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. ASSP-27: 113–120
Borroff M L 2007 A landmark underspecification account of the patterning of glottal stop. PhD thesis, Stony Brook University, New York
Brookes M, Naylor P A, Gudnason J 2006 A quantitative assessment of group delay methods for identifying glottal closures in voiced speech. IEEE Trans. Audio Speech Lang. Process. 14(2): 456–466
Cardoso J-F 1998 Blind signal separation: statistical principles. Proc. IEEE 86: 2009–2025
Cheng Y M, O’Shaughnessy D 1989 Automatic and reliable estimation of glottal closure instant and period. IEEE Trans. Acoust. Speech Signal Process. 27: 1805–1815
Cheng Y M, O’Shaughnessy D 1991 Speech enhancement based conceptually on auditory evidence. IEEE Trans. Signal Process. 39: 1943–1954
CMU-ARCTIC speech synthesis databases. URL http://festvox.org/cmu_arctic/index.html
d’Alessandro C, Scherer K R 2003 Voice quality: Functions, analysis and synthesis (VOQUAL’03). ISCA Tutorial and Research Workshop, Geneva, Switzerland, http://archives.limsi.fr/VOQUAL/voicematerial.html (last viewed 04/08/2009)
de Cheveigne A, Kawahara H 2002 YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. 111(4): 1917–1930
Dilley L, Shattuck-Hufnagel S, Ostendorf M 1996 Glottalization of word-initial vowels as a function of prosodic structure. J. Phonetics 24: 423–444
Ephraim Y, Van Trees H L 1995 A signal subspace approach for speech enhancement. IEEE Trans. Speech Audio Process. 3(4): 251–266
FFabre P 1957 Un procede electrique percutane d’inscrition de l’accolement glottique au cours de la phonation: glottographie de haute frequence. premiers resultats. Bull. Acad. Natl. Med. 141: 66
Flanagan J L, Jonston J D, Zahn R, Elko G W 1985 Computer-steered microphone arrays for sound transduction in large rooms. J. Acoust. Soc. Am. 78(5): 1508–1518
Fourcin A J, Abberton E 1971 First applications of a new laryngograph. Med. Biol. Illus. 21: 172–182
Frazier R H et al 1990 Enhancement of speech by adaptive filtering. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. New York, NY, USA
Frokjaer-Jensen B 1967 A photo-electric glottograph. Annual Report of the Institute of Phonetics of University of Copenhagen 2: 5–19
Frokjaer-Jensen B, Thorvaldsen P 1968 Construction of a fabre glottograph. ARIPUC 3: 1
Gauffin J, Sundberg J 1989 Spectral correlates of glottal voice source waveform characteristics. J. Speech. Hear. Res. 32: 556–565
Goldberg R, Riek L 2000 A practical handbook of speech coders. (Boca Raton, FL: CRC Press)
Gordon M, Ladefoged P 2001 Phonation types: a cross-linguistic overview. J. Phonetics 29(4): 383–406
Guruprasad S 2010 Significance of processing regions of high signal-to-noise ratio in speech signals. PhD thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras
Guruprasad S, Yegnanarayana B 2009 Perceived loudness of speech based on the characteristics of excitation source. J. Acoust. Soc. Am. 126(4): 2061–2071
Guruprasad S, Yegnanarayana B 2011 Performance of an event-based instantaneous fundamental frequency estimator for distant speech signals. IEEE Trans. Audio Speech Lang. Process. 19(7): 1853–1864
Hamon C, Moulines E, Charpentier F 1989 A diphone synthesis system based on time domain prosodic modifications of speech. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. Glasgow, 238–241
Hermes D J 1988 Measurement of pitch by subharmonic summation. J. Acoust. Soc. Am. 83(1): 257–264
Hess W, Indefrey H 1987 Accurate time-domain pitch determination of speech signals by means of a laryngograph. Speech Commun. 6: 55–68
Huang J, Zhao Y 1998 An energy-constrained signal subspace method for speech enhancement and recognition in colored noise. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. Seattle, WA, USA, 377–380
Huckvale M 2000 Speech filing system: Tools for speech research. URL http://www.phon.ucl.ac.uk/resource/sfs/
Iskra D, Grosskopf B, Marasek K, Van Den Heuvel H, Diehl F, Kiessling A 2002 SPEECON – speech databases for consumer devices: Database specification and validation. Proc. Third Int. Conf. Lang. Resources Eval. (LREC), Las Palmas, Canary Islands - Spain, 329–333
Jankowski C R Jr, Quatieri T F, Reynolds D A 1995 Measuring fine structure in speech: Application to speaker identification. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. Detroit, MI, USA, 325–328
Jensen J, Hansen J H L 2001 Speech enhancement using a constrained iterative sinusoidal model. IEEE Trans. Speech Audio Process. 9(7): 731–740
Joseph A M, Guruprasad S, Yegnanarayana B 2006 Extracting formants from short segments using group delay functions. Proc. Int. Conf. Spoken Language Processing, Pittsburgh, USA, 1009–1012
Joseph A M, Yegnanarayana B, Gupta S, Kesheorey M R 2009 Speaker dependent mapping for low bit rate coding of throat microphone speech. Proc. Interspeech 2009, Brighton, UK, 1087–1090
Kominek J, Black A 2004 The CMU Arctic speech databases. Proc. 5th ISCA Speech Synthesis Workshop, Pittsburgh, USA, 223–224
Kounoudes A, Naylor P A, Brookes M 2002 The DYPSA algorithm for estimation of glottal closure instants in voiced speech. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. vol. 11, Orlando, FL, 349–352
Kumar S K 2010 Analysis of laugh signals for automatic detection and synthesis. MS thesis, Language Technologies Research Centre, International Institute of Information Technology, Hyderabad, India
Lecluse F L E 1977 Elecroglottography. Dissertation, Univ. of Rotterdam
Lee C K, Childers D G 1988 Cochannel speech separation. J. Acoust. Soc. Am. 83(1): 274–280
Leslau W 1995 Reference grammar of Amharic. (Wiesbaden: Otto Harrassowitz)
Lombard E 1911 Le signe de l’elevation de la voix, annals maladiers oreille. Larynx. Nez. Pharynx. 37: 101–119
Ma Y K C, Willems L F 1994 A Frobenius norm approach to glottal closure detection from the speech signal. IEEE Trans. Speech Audio Process. 2: 258–265
Makhoul J 1975 Linear prediction: A tutorial review. Proc. IEEE, 63(4): 561–580
Martin A, Doddington G, Kamm T, Ordowski M, Przybocki M 1977 The DET curve in assessment of detection task performance. Proc. European Conf. Speech Process. Technol. Greece, 1895–1898
McKenna J G 2001 Automatic glottal closed-phase location and analysis by Kalman filtering. Proc. 4th ISCA Tutorial and Research Workshop on Speech Synthesis, Perthshire, Scotland
Meyer G F 1995 Keele pitch database, School of Psychology, University of Liverpool, UK. URL http://www.liv.ac.uk/Psychology/hmp/projects/pitch.html
Mitchell O M M, Ross C A, Yates G H 1971 Signal processing for a cocktail party effect. J. Acoust. Soc. Am. 50: 656–660
Mittal U, Phamdo N 2000 Signal/noise klt based approach for enhancing speech degraded by colored noise. IEEE Trans. Speech Audio Process. 8: 159–167
Miyoshi M, Kaneda Y 1988 Inverse filtering of room acoustics. IEEE Trans. Acoust. Speech Signal Process. ASSP-36: 145–152
Monsen R B, Engebretson A M 1977 Study of variations in the male and female glottal wave. J. Acoust. Soc. Am. 62(4): 981–993
Morgan D P, George E B, Lee L T, Kay S M 1997 Cochannel speech separation by harmonic enhancement and supression. IEEE Trans. Speech Audio Process. 5: 407–424
Murty K S R 2009 Significance of Excitation Source Information for Speech Analysis. PhD thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras
Murty K S R, Yegnanarayana B 2006 Combining evidence from residual phase and MFCC features for speaker recognition. IEEE Signal Process. Lett. 13(1): 52–56
Murty K S R, Yegnanarayana B 2008 Epoch extraction from speech signals. IEEE Trans. Audio Speech Lang. Process. 16(8): 1602–1613
Murty K S R, Khurana S, Itankar Y U, Kesheorey M R, Yegnanarayana B 2008 Efficient representation of throat microphone speech. Proc. Interspeech 2008, Brisbane, Australia, 2610–2613
Murty K S R, Yegnanarayana B, Joseph M A 2009 Characterization of glottal activity from speech signals. IEEE Signal Process. Lett. 16(6): 469–472
Murty P S, Yegnanarayana B 1999 Robustness of group-delay-based method for extraction of significant excitation from speech signals. IEEE Trans. Speech Audio Process. 7(6): 609–619
Navarro-Mesa J L, Lleida-Solano E, Moreno-Bilbao A 2001 A new method for epoch detection based on the Cohen’s class of time frequency representations. IEEE Signal Process. Lett. 8(8): 225–227
Naylor P A, Kounoudes A, Gudnason J, Brookes M 2007 Estimation of glottal closure instants in voiced speech using the DYPSA algorithm. IEEE Trans. Audio Speech Lang. Process. 15(1): 34–43
Nemer E, Goubran R, Mahmoud S 2002 Speech enhancement using fourth-order cumulants and optimum filters in the subband domain. Speech Commun. 36: 219–246
Neocleous A, Naylor P A 1998 Voice source parameters for speaker verification. Proc. Eur. Signal Process. Conf. 697–700
Noisex-92. URL http://www.speech.cs.cmu.edu/comp.speech/Section1/Data/noisex.html
Oh S, Viswanathan V 1992 Hands-free voice communication in an automobile with a microphone array. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. San Francisco, California, 281–284
Parsons T W 1976 Separation of speech from interfering speech by means of harmonic selection. J. Acoust. Soc. Am. 60: 911–918
Plante F, Meyer G F, Ainsworth W A 1995 A pitch extraction reference database. Proc. European Conf. Speech Commun. (Eurospeech) Madrid, Spain, 827–840
Plumpe M D, Quatieri T F, Reynolds D A 1999 Modeling of the glottal flow derivative waveform with application to speaker identification. IEEE Trans. Acoust. Speech Signal Process. 7: 569–586
Prasanna S R M 2004 Event Based Analysis of Speech. PhD thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras
Rao K S 2005 Acquisition and incorporation of prosody knowledge for speech systems in Indian languages. PhD thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras
Rao K S, Yegnanarayana B 2006 Prosody modification using instants of significant excitation. IEEE Trans. Audio, Speech Lang. Process. 14(3): 972–980
Rao K S, Prasanna S R M, Yegnanarayana B 2007 Determination of instants of significant excitation in speech using Hilbert envelope and group-delay function. IEEE Signal Process. Lett. 14(10): 762–765
Reddy S H M 2010 Analysis of Speech at Different Speaking Rates using Excitation Source Information. MS thesis, Language Technologies Research Centre, International Institute of Information Technology, Hyderabad, India
Reddy S H M, Prahallad K, Gangashetty S V, Yegnanarayana B 2010 Significance of pitch synchronous analysis for speaker recognition using AANN models. Proc. Interspeech 2010, Makuhari, Chiba, Japan, 669–672
Swamy R K, Murty K S R, Yegnanarayana B 2007 Determining number of speakers from multispeaker speech signals using excitation source information. IEEE Signal Process. Lett. 14(7): 481–484
Satyanarayana P 1999 Short segment analysis of speech for enhancement. PhD thesis, Indian Institute of Technology Madras, Department of Computer Science and Engg., Chennai, India
Scalart P, Benmar A 1996 A system for speech enhancement in the context of hands-free radiotelephony with combined noise reduction and acoustic echo cancellation. Speech Commun. 20: 203–214
Scherer R C, Druker D G, Titze I R 1988 Vocal physiology: Voice production mechanisms and functions. (New York: Raven Press Ltd.)
Silverman H F 1987 Some analysis of microphone arrays for speech data acquisition. IEEE Trans. Acoust. Speech Signal Process. ASSP-35(12): 1699–1712
Smits R, Yegnanarayana B 1995 Determination of instants of significant excitation in speech using group delay function. IEEE Trans. Speech Audio Process. 3(5): 325–333
Sobakin A N 1972 Digital computer determination of formant parameters of the vocal tract from a speech signal. Soviet Phys.-Acoust. 18: 84–90
Stevens K N 1977 Physics of laryngeal behavior and larynx models. Phonetica 34: 264–279
Stevens M, Hajek J 2004 A preliminary investigation of some acoustic characteristics of ejectives in Waima’a: VOT and closure duration. In S Cassidy, F Cox, R Mannell, S Palethorpe (eds) Proc. Tenth Australian Int. Conf. Speech Science and Technology Macquaire University, Sydney, ASSTA, 277–282
Strube H W 1974 Determination of the instant of glottal closures from the speech wave. J. Acoust. Soc. Am. 56: 1625–1629
Subramaniam S, Petropulu A P, Wendt C 1996 Cepstrum-based deconvolution for speech dereverberation. IEEE Trans. Speech Audio Process. 4: 392–396
Sun X 2002 Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio. Proc. IEEE Int. Conf. Acoust. Speech Signal Processing, Orlando, FL, USA, 333–336
Talkin D 1995 A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis, (Amsterdam: Elsevier Science)
Tuan V N, d’Alessandro C 1999 Robust glottal closure detection using the wavelet transform. Proc. European Conf. Speech Processing, Technology, Budapest, 2805–2808
Van Den Berg J 1958 Myoelastic-aerodynamic theory of voice production. J. Speech Hearing 1: 227–244
Varga A, Steeneken H J M 1993 Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Comm. 12(3): 247–251. [Online] Available: http://www.speech.cs.smu.edu/comp.speech/Section1/Data/noisex.html
Vassiére J 1997 Phonological use of the larynx. In ISCA LARYNX-1997, Marseille, France, 115–126
Veeneman D, BeMent S 1985 Automatic glottal inverse filtering from speech and electroglottographic signals. IEEE Trans. Signal Process. 33: 369–377
Wallace C 2007 The phonetics of laughter – A linguistic approach. Interdisciplinary Workshop on the Phonetics of Laughter, Saarbrucken, August 4–5 Saarbruchen
Wong D Y, Markel J D, Gray A H 1979 Least squares glottal inverse filtering from the acoustic speech waveform. IEEE Trans. Acoust. Speech Signal Process. 27: 350–355
Worku H S 2010 Acoustic characterization of glottal stop and glottalized sounds in Amharic using non-spectral methods of speech analysis. PhD thesis, Language Technologies Research Centre, International Institute of Information Technology, Hyderabad, India
Yegnanarayana B 1978 Formant extraction from linear prediction phase spectra. J. Acoust. Soc. Am. 63(5): 1638–1640
Yegnanarayana B, Murty P S 2000 Enhancement of reverberant speech using LP residual signal. IEEE Trans. Speech Audio Process. 8(3): 267–281
Yegnanarayana B, Smits R L H M 1995 A robust method for determining instants of major excitations in voiced speech. Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Detroit, USA, 776–779
Yegnanarayana B, Veldhuis R N J 1998 Extraction of vocal-tract system characteristics from speech signals. IEEE Trans. Speech and Audio Process. 6(4): 313–327
Yegnanarayana B, Avendaño C, Hermansky H, Murty P S 1997 Processing linear prediction residual for speech enhancement. Proc. European Conf. Speech Process. Technol. Rhodes, Greece, 1399–1402
Yegnanarayana B, Avendaño C, Hermansky H, Murthy P S 1999 Speech enhancement using linear prediction residual. Speech Commun. 28(1): 25–42
Yegnanarayana B, Prasanna S R M, Duraiswami R, Zotkin D 2005 Processing of reverberent speech for time-delay estimation. IEEE Trans. Speech Audio Process. 13(6): 1110–1118
Yegnanarayana B, Murty K S R, Rajendran S 2008 Analysis of stop consonants in indian languages using excitation source information in speech signal. Proc. Workshop Speech Anal. Process. Knowledge Discovery, June 4–6 Aalborg, Denmark
Yegnanarayana B, Murty K S R 2009 Event-based instantaneous fundamental frequency estimation from speech signals. IEEE Trans. Audio Speech Lang. Process. 17(4): 614–624
Yegnanarayana B, Prasanna S R M, Guruprasad S 2011 Study of robustness of zero frequency resonator method for extraction of fundamental frequency. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. Prague, Czech Republic
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
YEGNANARAYANA, B., GANGASHETTY, S.V. Epoch-based analysis of speech signals. Sadhana 36, 651–697 (2011). https://doi.org/10.1007/s12046-011-0046-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12046-011-0046-0