Speech Feature Extraction and Reconstruction
This chapter is concerned with feature extraction and back-end speech reconstruction and is particularly aimed at distributed speech recognition (DSR) and the work carried out by the ETSI Aurora group. Feature extraction is examined first and begins with a basic implementation of mel-frequency cepstral coefficients (MFCCs). Additional processing, in the form of noise and channel compensation, is explained and has the aim of increasing speech recognition accuracy in real-world environments. Source and channel coding issues relevant to DSR are also briefly discussed. Back-end speech reconstruction using a sinusoidal model is explained and it is shown how this is possible by transmitting additional source information (voicing and fundamental frequency) from the terminal device. An alternative method of back-end speech reconstruction is then explained, where the voicing and fundamental frequency are predicted from the received MFCC vectors. This enables speech to be reconstructed solely from the MFCC vector stream and requires no explicit voicing and fundamental frequency transmission.
KeywordsFundamental Frequency Discrete Cosine Transform Speech Recognition Speech Signal Gaussian Mixture Model
Unable to display preview. Download preview PDF.
- ETSI Standard ES 201 108 (2003a) Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compres-sion algorithms, version 1.1.3, September 23rd 2003.Google Scholar
- ETSI Standard ES 202 211 (2003b) Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Extended front-end feature extraction algorithm; Compression algorithms; Back-end speech reconstruction algorithm, version 1.1.1, November 14th 2003.Google Scholar
- ETSI Standard ES 202 050 (2007) Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Advanced front-end feature extraction algorithm; Compression algorithms, version 1.1.5, January 11th 2007.Google Scholar
- George, E.B. and Smith, M.J.T. (1992) An analysis-by-synthesis approach to sinusoidal mod-elling applied to the analysis and synthesis of musical tones. J. Audio Eng. Soc., vol. 40, pp. 467-516.Google Scholar
- Hanson, B.A. and Applebaum, T.H. (1990) Robust speaker-independent word features using static, dynamic and acceleration features. In Proc. ICASSP, pp. 857-860.Google Scholar
- HTK (2007) Hidden Markov model toolkit, http://htk.eng.cam.ac.uk/
- Rabiner, L.R. and Schaeffer, L.W. (1978) Digital processing of speech signals, Prentice Hall, New Jersey, USA.Google Scholar
- Rosenberg, A.E., Lee, C.-H. and Soong, F.K. (1994) Cepstral channel normalization tech-niques for HMM-based speaker verification. In Proc. ICSLP, pp. 1835-1838.Google Scholar
- Shao, X. and Milner, B.P. (2004) Pitch prediction from MFCC vectors for speech reconstruc-tion. In Proc. ICASSP.Google Scholar
- Vaseghi, S.V. (2006) Advanced digital signal processing and noise reduction, John-Wiley.Google Scholar
- Wu, K. and Chen, P. (2001) Efficient speech enhancement using spectral subtraction for car hands-free application. Int. Conf. Consum. Electron., vol. 2, pp. 220-221.Google Scholar