Springer Handbook of Speech Processing pp 283-306 | Cite as
Principles of Speech Coding
Abstract
Speech coding is the art of reducing the bit rate required to describe a speech signal. In this chapter, we discuss the attributes of speech coders as well as the underlying principles that determine their behavior and their architecture. The ubiquitous class of linear-prediction-based coders is used as an illustration. Speech is generally modeled as a sequence of stationary signal segments, each having unique statistics. Segments are encoded using a two-step procedure: (1) find a model describing the speech segment, (2) encode the segment assuming it is generated by the model. We show that the bit allocation for the model (the predictor parameters) is independent of overall rate and of perception, which is consistent with existing experimental results. The modeling of perception is an important aspect of efficient coding and we discuss how various perceptual distortion measures can be integrated into speech coders.
Keywords
Speech Signal Autoregressive Model Transmission Control Protocol Mean Opinion Score Speech CodeAbbreviations
- AMR-WB
wide-band AMR speech coder
- AR
autoregressive
- DCT
discrete cosine transform
- DFT
discrete Fourier transform
- ERB
equivalent rectangular bandwidth
- IP
internet protocol
- JND
just-noticeable difference
- LSF
line spectral frequency
- MOS
mean opinion score
- OSI
open systems interconnection reference
- TCP
transmission control protocol
- UDP
user datagram protocol
References
- 14.1.W.B. Kleijn, K.K. Paliwal: An introduction to speech coding. In: Speech Coding and Synthesis, ed. by W.B. Kleijn, K.K. Paliwal (Elsevier, Amsterdam 1995) pp. 1-47Google Scholar
- 14.2.R.V. Cox: Speech coding standards. In: Speech Coding and Synthesis, ed. by W.B. Kleijn, K.K. Paliwal (Elsevier, Amsterdam 1995) pp. 49-78Google Scholar
- 14.3.R. Salami, C. Laflamme, J. Adoul, A. Kataoka, S. HAyashi, T. Moriya, C. Lamblin, D. Massaloux, S. Proust, P. Kroon, Y. Shoham: Design and description of CS-ACELP: a toll quality 8 kb/s speech coder, IEEE Trans. Speech Audio Process. 6(2), 116-130 (1998)CrossRefGoogle Scholar
- 14.4.B. Bessette, R. Salami, R. Lefebvre, M. Jelinek, J. Rotola-Pukkila, J. Vainio, H. Mikkola: The adaptive multirate wideband speech codec (amr-wb), IEEE Trans. Speech Audio Process. 6(8), 620-636 (2002)CrossRefGoogle Scholar
- 14.5.ITU-T Rec. P.800: Methods for Subjective Determination of Transmission Quality (1996)Google Scholar
- 14.6.A.W. Rix: Perceptual speech quality assessment - a review, Proc. IEEE ICASSP, Vol. 3 (2004) pp. 1056-1059Google Scholar
- 14.7.S. Möller: Assessment and Prediction of Speech Quality in Telecommunications (Kluwer Academic, Boston 2000)CrossRefGoogle Scholar
- 14.8.P. Kroon: Evaluation of speech coders. In: Speech Coding and Synthesis, ed. by W.B. Kleijn, K.K. Paliwal (Elsevier, Amsterdam 1995) pp. 467-493Google Scholar
- 14.9.W. Stallings: High-speed networks: TCP/IP and ATM design principles (Prentice Hall, Englewood Cliffs 1998)Google Scholar
- 14.10.Information Sciences Institute: Transmission control protocol, IETF RFC793 (1981)Google Scholar
- 14.11.J. Postel: User datagram protocol, IETF RFC768 (1980)Google Scholar
- 14.12.T.M. Cover, J.A. Thomas: Elements of Information Theory (Wiley, New York 1991)CrossRefMATHGoogle Scholar
- 14.13.N. Kitawaki, K. Itoh: Pure delay effects on speech quality in telecommunications, IEEE J. Sel. Area. Comm. 9(4), 586-593 (1991)CrossRefGoogle Scholar
- 14.14.J. Cox: The minimum detectable delay of speech and music, Proc. IEEE ICASSP, Vol. 1 (1984) pp. 136-139Google Scholar
- 14.15.J. Chen: A robust low-delay CELP speech coder at 16 kb/s. In: Advances in Speech Coding, ed. by B.S. Atal, V. Cuperman, A. Gersho (Kluwer Academic, Dordrecht 1991) pp. 25-35CrossRefGoogle Scholar
- 14.16.B.S. Atal, M.R. Schroeder: Stochastic coding of speech at very low bit rates, Proc. Int. Conf. Comm. (1984) pp. 1610-1613Google Scholar
- 14.17.J.-P. Adoul, P. Mabilleau, M. Delprat, S. Morisette: Fast CELP coding based on algebraic codes, Proc. IEEE ICASSP (1987) pp. 1957-1960Google Scholar
- 14.18.I.M. Trancoso, B.S. Atal: Efficient procedures for selecting the optimum innovation in stochastic coders, IEEE Trans. Acoust. Speech 38(3), 385-396 (1990)CrossRefGoogle Scholar
- 14.19.W.B. Kleijn, D.J. Krasinski, R.H. Ketchum: Fast methods for the CELP speech coding algorithm, IEEE Trans. Acoust. Speech 38(8), 1330-1342 (1990)CrossRefGoogle Scholar
- 14.20.T. Lookabough, R. Gray: High-resolution theory and the vector quantizer advantage, IEEE Trans. Inform. Theory IT-35(5), 1020-1033 (1989)CrossRefGoogle Scholar
- 14.21.S. Na, D. Neuhoff: Bennettʼs integral for vector quantizers, IEEE Trans. Inform. Theory 41(4), 886-900 (1995)MathSciNetCrossRefMATHGoogle Scholar
- 14.22.S.P. Lloyd: Least squares quantization in PCM, IEEE Trans. Inform. Theory IT-28, 129-137 (1982)MathSciNetCrossRefMATHGoogle Scholar
- 14.23.Y. Linde, A. Buzo, R.M. Gray: An algorithm for vector quantizer design, IEEE Trans. Commun. COM-28, 84-95 (1980)CrossRefGoogle Scholar
- 14.24.P. Chou, T. Lookabough, R. Gray: Entropy-constrained vector quantization, IEEE Trans. Acoust. Speech 38(1), 31-42 (1989)MathSciNetCrossRefGoogle Scholar
- 14.25.A. Gersho: Asymptotically optimal block quantization, IEEE Trans. Inform. Theory 25, 373-380 (1979)MathSciNetCrossRefMATHGoogle Scholar
- 14.26.P. Swaszek, T. Ku: Asymptotic performance of unrestricted polar quantizers, IEEE Trans. Inform. Theory 32(2), 330-333 (1986)CrossRefGoogle Scholar
- 14.27.R. Vafin, W.B. Kleijn: Entropy-constrained polar quantization and its application to audio coding, IEEE Trans. Speech Audio Process. 13(2), 220-232 (2005)CrossRefGoogle Scholar
- 14.28.J.J. Rissanen, G. Langdon: Arithmetic coding, IBM J. Res. Devel. 23(2), 149-162 (1979)MathSciNetCrossRefMATHGoogle Scholar
- 14.29.J. Rissanen: Modeling by the shortest data description, Automatica 14, 465-471 (1978)CrossRefMATHGoogle Scholar
- 14.30.J. Rissanen: A universal prior for integers and estimation by minimum description length, Ann. Stat. 11(2), 416-431 (1983)MathSciNetCrossRefMATHGoogle Scholar
- 14.31.P. Grunwald: A tutorial introduction to the minimum description length principle. In: Advances in Minimum Description Length: Theory and Applications, ed. by P. Grunwald, I.J. Myung, M. Pitt (MIT, Boston 2005)Google Scholar
- 14.32.A. Barron, T.M. Cover: Minimum complexity density estimation, IEEE Trans. Inform. Theory 37(4), 1034-1054 (1991)MathSciNetCrossRefMATHGoogle Scholar
- 14.33.A.H. Gray, J.D. Markel: Distance measures for speech process, IEEE Trans. Acoust. Speech Signal Process. ASSP-24(5), 380-391 (1976)CrossRefGoogle Scholar
- 14.34.R. Hagen, P. Hedelin: Low bit-rate spectral coding in CELP a new LSP method, Proc. IEEE ICASSP (1990) pp. 189-192Google Scholar
- 14.35.K.K. Paliwal, B.S. Atal: Efficient vector quantization of LPC parameters at 24 bits/frame, IEEE Trans. Speech Audio Process. 1(1), 3-14 (1993)CrossRefGoogle Scholar
- 14.36.C. Xydeas, C. Papanastasiou: Split matrix quantization of lpc parameters, IEEE Trans. Speech Audio Process. 7(2), 113-125 (1999)CrossRefGoogle Scholar
- 14.37.A. Subramaniam, B. Rao: Speech LSF quantization with rate independent complexity, bit scalability, and learning, Proc. IEEE ICASSP (2001) pp. 705-708Google Scholar
- 14.38.U. Grenander, G. Szego: Toeplitz Forms and their Applications (Chelsea, New York 1984)MATHGoogle Scholar
- 14.39.F. Itakura, S. Saito: Speech information compression based on the maximum likelihood estimation, J. Acoust. Soc. Jpn. 27(9), 463 (1971)Google Scholar
- 14.40.S. Saito, K. Nakata: Fundamentals of Speech Signal Process (Academic, New York 1985)Google Scholar
- 14.41.P.J. Brockwell, R.A. Davis: Time Series: Theory and Methods (Springer, New York 1996)MATHGoogle Scholar
- 14.42.F. Itakura, S. Saito: Analysis Synthesis Telephony Based Upon the Maximum Likelihood Method, Reports of 6th Int. Cong. Acoust.,C-5-5, C17-20, ed. by Y. Kohasi (1968)Google Scholar
- 14.43.R.M. Gray, A. Buzo, A.H. Gray, Y. Matsuyama: Distortion measures for speech process, IEEE Trans. Acoust. Speech Signal Process. ASSP-28(4), 367-376 (1980)CrossRefMATHGoogle Scholar
- 14.44.K.K. Paliwal, W.B. Kleijn: Quantization of LPC parameters. In: Speech Coding and Synthesis, ed. by W.B. Kleijn, K.K. Paliwal (Elsevier, Amsterdam 1995) pp. 433-466Google Scholar
- 14.45.W.R. Gardner, B.D. Rao: Noncausal all-pole modeling of voiced speech, IEEE Trans. Speech Audio Process. 5(1), 1-10 (1997)CrossRefGoogle Scholar
- 14.46.M. Nilsson, W.B. Kleijn: Shannon entropy estimation based on high-rate quantization theory, Proc. EUSIPCO (2004) pp. 1753-1756Google Scholar
- 14.47.M. Nilsson: Entropy and Speech (Royal Institute of Technology, Stockholm 2006), Ph.D. dissertation, KTHGoogle Scholar
- 14.48.C. Lamm: Improved Spectral Estimation in Speech Coding (Lund Institute of Technology (LTH), Lund 1998), Masterʼs thesisGoogle Scholar
- 14.49.K.L.C. Chan: Split-dimension vector quantization of parcor coefficients for low bit rate speech coding, IEEE Trans. Speech Audio Process. 2(3), 443-446 (1994)CrossRefGoogle Scholar
- 14.50.A. Subramaniam, B.D. Rao: PDF optimized parametric vector quantization of speech line spectral freuencies, IEEE Speech Coding Workshop (Delavan 2000) pp. 87-89Google Scholar
- 14.51.P. Hedelin, J. Skoglund: Vector quantization based on Gaussian mixture models, IEEE Trans. Speech Audio Process. 8(4), 385-401 (2000)CrossRefGoogle Scholar
- 14.52.S. Srinivasan, J. Samuelsson, W.B. Kleijn: Speech enhancement using a-priori information with classified noise codebooks, Proc. EUSIPCO (2004) pp. 1461-1464Google Scholar
- 14.53.W.R. Gardner, B.D. Rao: Optimal distortion measures for the high rate vector quantization of LPC parameters, Proc. IEEE ICASSP (1995) pp. 752-755Google Scholar
- 14.54.M.Y. Kim, W.B. Kleijn: KLT-based adaptive classified vector quantization of the speech signal, IEEE Trans. Speech Audio Process. 12(3), 277-289 (2004)CrossRefGoogle Scholar
- 14.55.P. Kroon, E.F. Deprettere: A class of analysis-by-synthesis predictive coders for high quality speech coding at rates between 4.8 and 16 kbit/s, IEEE J. Sel. Area. Commun. 6(2), 353-363 (1988)CrossRefGoogle Scholar
- 14.56.J. Chen, A. Gersho: Real-time vector APC speech coding at 4-800 bps with adaptive postfiltering, Proc. IEEE ICASSP (1987) pp. 2185-2188Google Scholar
- 14.57.J. Johnston: Transform coding of audio signals using perceptual noise criteria, IEEE J. Sel. Area. Commun. 6(2), 314-323 (1988)CrossRefGoogle Scholar
- 14.58.H. Malvar: Enhancing the performance of subband audio coders for speech signals, Proc. IEEE Int. Symp. on Circ. Syst., Vol. 5 (1998) pp. 98-101Google Scholar
- 14.59.R. Veldhuis: Bit rates in audio source coding, IEEE J. Sel. Area. Commun. 10(1), 86-96 (1992)CrossRefGoogle Scholar
- 14.60.B.C.J. Moore: Masking in the human auditory system. In: Collected papers on digital audio bit-rate reduction, ed. by N. Gilchrist, C. Grewin (Audio Eng. Soc., New York 1996)Google Scholar
- 14.61.B.C.J. Moore: An Introduction to the Psychology of Hearing (Academic, London 1997)Google Scholar
- 14.62.E. Zwicker, H. Fastl: Psychoacoustics (Springer Verlag, Berlin, Heidelberg 1999)CrossRefGoogle Scholar
- 14.63.T. Painter, A. Spanias: Perceptual coding of digital audio, Proc. IEEE 88(4), 451-515 (2000)CrossRefGoogle Scholar
- 14.64.J.H. Plasberg, W.B. Kleijn: The sensitivity matrix: Using advanced auditory models in speech and audio processing, IEEE Trans. Speech Audio Process. 15, 310-319 (2007)CrossRefGoogle Scholar
- 14.65.J.L. Hall: Auditory psychophysics for coding applications. In: The Digital Signal Processing Handbook, ed. by V.K. Madisetti, D. Williams (CRC, Boca Raton 1998) pp. 39.1-39.25Google Scholar
- 14.66.W. Jesteadt, S.P. Bacon, J.R. Lehman: Forward masking as a function of frequency, masker level and signal delay, J. Acoust. Soc. Am. 71(4), 950-962 (1982)CrossRefGoogle Scholar
- 14.67.D. Sinha, J.D. Johnston: Audio compression at low bit rates using a signal adaptive switched filterbank, Proc. IEEE ICASSP, Vol. 2 (1996) pp. 1053-1056Google Scholar
- 14.68.T. Verma, T. Meng: A 6 kbps to 85 kbps scalable audio coder, Proc. IEEE ICASSP, Vol. 2 (2000) pp. II877-II880Google Scholar
- 14.69.A.S. Scheuble, Z. Xiong: Scalable audio coding using the nonuniform modulated complex lapped transform, Proc. IEEE ICASSP, Vol. 5 (2001) pp. 3257-3260Google Scholar
- 14.70.R. Heusdens, R. Vafin, W.B. Kleijn: Sinusoidal modeling using psychoacoustic-adaptive matching pursuits, IEEE Signal Proc. Lett. 9(8), 262-265 (2002)CrossRefGoogle Scholar
- 14.71.M.Y. Kim, W.B. Kleijn: Resolution-constrained quantization with JND based perceptual-distortion measures, IEEE Signal Proc. Lett. 13(5), 304-307 (2006)CrossRefGoogle Scholar
- 14.72.O. Ghitza: Auditory nerve representation as a basis for speech processing. In: Advances in Speech Signal Processing (Dekker, New York 1992) pp. 453-485Google Scholar
- 14.73.T. Dau, D. Püschel, A. Kohlrausch: A quantitative model of the effective signal processing in the auditory system. I. Model structure, J. Acoust. Soc. Am. 99(6), 3615-3622 (1996)CrossRefGoogle Scholar
- 14.74.T. Dau, B. Kollmeier, A. Kohlrausch: Modeling auditory processing of amplitude modulation. I. detection and masking with narrowband carriers, J. Acoust. Soc. Am. 102(5), 2892-2905 (1997)CrossRefGoogle Scholar
- 14.75.G. Kubin, W.B. Kleijn: On speech coding in a perceptual domain, Proc. IEEE ICASSP, Vol. I (1999) pp. 205-208Google Scholar
- 14.76.F. Baumgarte: Ein psychophysiologisches Gehörmodell zur Nachbildung von Wahrnehmungsschwellen für die Audiocodierung (Univ. Hannover, Hannover 2000), Ph.D. dissertation (in German)Google Scholar
- 14.77.S. van de Par, A. Kohlrausch, G. Charestan, R. Heusdens: A new psychoacoustical masking model for audio coding applications, Proc. IEEE ICASSP (2002) pp. 1805-1808Google Scholar
- 14.78.D. Sen, D. Irving, W. Holmes: Use of an auditory model to improve speech coders, Proc. IEEE ICASSP (1993) pp. II411-II414Google Scholar
- 14.79.J.H. Plasberg, D.Y. Zhao, W.B. Kleijn: The sensitivity matrix for a spectro-temporal auditory model, Proc. EUSIPCO (2004) pp. 1673-1676Google Scholar
- 14.80.X. Yang, K. Wang, S. Shamma: Auditory representation of acoustic signals, IEEE Trans. Inform. Theory 38(2), 824-839 (1996)CrossRefGoogle Scholar
- 14.81.T. Linder, R. Zamir, K. Zeger: High-resolution source coding for non-difference measures: the rate-distortion function, IEEE Trans. Inform. Theory 45(2), 533-547 (1999)MathSciNetCrossRefMATHGoogle Scholar
- 14.82.I. Gerson, M. Jasiuk: Vector sum excited linear prediction (VSELP), Proc. IEEE ICASSP (1990) pp. 461-464Google Scholar