This chapter is concerned with feature extraction and back-end speech reconstruction and is particularly aimed at distributed speech recognition (DSR) and the work carried out by the ETSI Aurora group. Feature extraction is examined first and begins with a basic implementation of mel-frequency cepstral coefficients (MFCCs). Additional processing, in the form of noise and channel compensation, is explained and has the aim of increasing speech recognition accuracy in real-world environments. Source and channel coding issues relevant to DSR are also briefly discussed. Back-end speech reconstruction using a sinusoidal model is explained and it is shown how this is possible by transmitting additional source information (voicing and fundamental frequency) from the terminal device. An alternative method of back-end speech reconstruction is then explained, where the voicing and fundamental frequency are predicted from the received MFCC vectors. This enables speech to be reconstructed solely from the MFCC vector stream and requires no explicit voicing and fundamental frequency transmission.


Fundamental Frequency Discrete Cosine Transform Speech Recognition Speech Signal Gaussian Mixture Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.



  1. Boll, S.F. (1979) Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process., vol. 27, pp. 113-120.CrossRefGoogle Scholar
  2. Cooley, J.W. and Tukey, J.W. (1965) An algorithm for the machine calculation of complex Fourier series. Math. Comput., vol. 19, pp. 297-301.zbMATHCrossRefMathSciNetGoogle Scholar
  3. Davis, S.B. and Mermelstein P. (1980) Comparison of parametric representations for mono-syllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process., vol. 28, pp. 357-366.CrossRefGoogle Scholar
  4. de Cheveigne, A. and Kawahara, H. (2001) YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. Vol. 111, no. 4, pp. 1917-1930.CrossRefGoogle Scholar
  5. ETSI Standard ES 201 108 (2003a) Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compres-sion algorithms, version 1.1.3, September 23rd 2003.Google Scholar
  6. ETSI Standard ES 202 211 (2003b) Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Extended front-end feature extraction algorithm; Compression algorithms; Back-end speech reconstruction algorithm, version 1.1.1, November 14th 2003.Google Scholar
  7. ETSI Standard ES 202 050 (2007) Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Advanced front-end feature extraction algorithm; Compression algorithms, version 1.1.5, January 11th 2007.Google Scholar
  8. Furui, S. (1986) Speaker-independent isolated word recognition using dynamic features of speech spectrum. IEEE Trans. ASSP, vol. 34, pp. 52-59.CrossRefGoogle Scholar
  9. George, E.B. and Smith, M.J.T. (1992) An analysis-by-synthesis approach to sinusoidal mod-elling applied to the analysis and synthesis of musical tones. J. Audio Eng. Soc., vol. 40, pp. 467-516.Google Scholar
  10. Hanson, B.A. and Applebaum, T.H. (1990) Robust speaker-independent word features using static, dynamic and acceleration features. In Proc. ICASSP, pp. 857-860.Google Scholar
  11. Hermansky, H. (1990) Perceptual linear predictive (PLP) analysis of speech, J. Acoust. Soc. Am., vol. 87, no. 4, pp. 1738-1752.CrossRefGoogle Scholar
  12. Hermansky, H. and Morgan, N. (1994) RASTA processing of speech. IEEE Trans. Speech Audio Proc., vol. 2, no. 4, pp. 578-589.CrossRefGoogle Scholar
  13. HTK (2007) Hidden Markov model toolkit,
  14. McAulay, R.J. and Quatiery, T.F. (1986) Speech analysis/synthesis based on a sinusoidal representation. IEEE Trans. ASSP vol. 34, pp. 744-754.CrossRefGoogle Scholar
  15. Milner, B.P. and Shao, X. (2007) Prediction of Fundamental Frequency and Voicing from Mel-Frequency Cepstral Coefficients for Unconstrained Speech Reconstruction. IEEE Trans. Audio Speech Lang. Process., vol. 15, pp. 24-33.CrossRefGoogle Scholar
  16. Oppenheim, A.V. and Schafer, R.W. (1989) Discrete-time signal processing, Prentice-Hall, New Jersey, USA.zbMATHGoogle Scholar
  17. Rabiner, L.R. and Schaeffer, L.W. (1978) Digital processing of speech signals, Prentice Hall, New Jersey, USA.Google Scholar
  18. Rosenberg, A.E., Lee, C.-H. and Soong, F.K. (1994) Cepstral channel normalization tech-niques for HMM-based speaker verification. In Proc. ICSLP, pp. 1835-1838.Google Scholar
  19. Shao, X. and Milner, B.P. (2004) Pitch prediction from MFCC vectors for speech reconstruc-tion. In Proc. ICASSP.Google Scholar
  20. Vaseghi, S.V. (2006) Advanced digital signal processing and noise reduction, John-Wiley.Google Scholar
  21. Wu, K. and Chen, P. (2001) Efficient speech enhancement using spectral subtraction for car hands-free application. Int. Conf. Consum. Electron., vol. 2, pp. 220-221.Google Scholar

Copyright information

© Springer-Verlag London Limited 2008

Authors and Affiliations

  • Ben Milner
    • 1
  1. 1.School of Computing SciencesUniversity of East AngliaNorwichUK

Personalised recommendations