Principles of Speech Coding

Part of the Springer Handbooks book series (SPRINGERHAND)

Abstract

Speech coding is the art of reducing the bit rate required to describe a speech signal. In this chapter, we discuss the attributes of speech coders as well as the underlying principles that determine their behavior and their architecture. The ubiquitous class of linear-prediction-based coders is used as an illustration. Speech is generally modeled as a sequence of stationary signal segments, each having unique statistics. Segments are encoded using a two-step procedure: (1) find a model describing the speech segment, (2) encode the segment assuming it is generated by the model. We show that the bit allocation for the model (the predictor parameters) is independent of overall rate and of perception, which is consistent with existing experimental results. The modeling of perception is an important aspect of efficient coding and we discuss how various perceptual distortion measures can be integrated into speech coders.

Keywords

Speech Signal Autoregressive Model Transmission Control Protocol Mean Opinion Score Speech Code 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Abbreviations

AMR-WB

wide-band AMR speech coder

AR

autoregressive

DCT

discrete cosine transform

DFT

discrete Fourier transform

ERB

equivalent rectangular bandwidth

IP

internet protocol

JND

just-noticeable difference

LSF

line spectral frequency

MOS

mean opinion score

OSI

open systems interconnection reference

TCP

transmission control protocol

UDP

user datagram protocol

References

  1. 14.1.
    W.B. Kleijn, K.K. Paliwal: An introduction to speech coding. In: Speech Coding and Synthesis, ed. by W.B. Kleijn, K.K. Paliwal (Elsevier, Amsterdam 1995) pp. 1-47Google Scholar
  2. 14.2.
    R.V. Cox: Speech coding standards. In: Speech Coding and Synthesis, ed. by W.B. Kleijn, K.K. Paliwal (Elsevier, Amsterdam 1995) pp. 49-78Google Scholar
  3. 14.3.
    R. Salami, C. Laflamme, J. Adoul, A. Kataoka, S. HAyashi, T. Moriya, C. Lamblin, D. Massaloux, S. Proust, P. Kroon, Y. Shoham: Design and description of CS-ACELP: a toll quality 8 kb/s speech coder, IEEE Trans. Speech Audio Process. 6(2), 116-130 (1998)CrossRefGoogle Scholar
  4. 14.4.
    B. Bessette, R. Salami, R. Lefebvre, M. Jelinek, J. Rotola-Pukkila, J. Vainio, H. Mikkola: The adaptive multirate wideband speech codec (amr-wb), IEEE Trans. Speech Audio Process. 6(8), 620-636 (2002)CrossRefGoogle Scholar
  5. 14.5.
    ITU-T Rec. P.800: Methods for Subjective Determination of Transmission Quality (1996)Google Scholar
  6. 14.6.
    A.W. Rix: Perceptual speech quality assessment - a review, Proc. IEEE ICASSP, Vol. 3 (2004) pp. 1056-1059Google Scholar
  7. 14.7.
    S. Möller: Assessment and Prediction of Speech Quality in Telecommunications (Kluwer Academic, Boston 2000)CrossRefGoogle Scholar
  8. 14.8.
    P. Kroon: Evaluation of speech coders. In: Speech Coding and Synthesis, ed. by W.B. Kleijn, K.K. Paliwal (Elsevier, Amsterdam 1995) pp. 467-493Google Scholar
  9. 14.9.
    W. Stallings: High-speed networks: TCP/IP and ATM design principles (Prentice Hall, Englewood Cliffs 1998)Google Scholar
  10. 14.10.
    Information Sciences Institute: Transmission control protocol, IETF RFC793 (1981)Google Scholar
  11. 14.11.
    J. Postel: User datagram protocol, IETF RFC768 (1980)Google Scholar
  12. 14.12.
    T.M. Cover, J.A. Thomas: Elements of Information Theory (Wiley, New York 1991)CrossRefMATHGoogle Scholar
  13. 14.13.
    N. Kitawaki, K. Itoh: Pure delay effects on speech quality in telecommunications, IEEE J. Sel. Area. Comm. 9(4), 586-593 (1991)CrossRefGoogle Scholar
  14. 14.14.
    J. Cox: The minimum detectable delay of speech and music, Proc. IEEE ICASSP, Vol. 1 (1984) pp. 136-139Google Scholar
  15. 14.15.
    J. Chen: A robust low-delay CELP speech coder at 16 kb/s. In: Advances in Speech Coding, ed. by B.S. Atal, V. Cuperman, A. Gersho (Kluwer Academic, Dordrecht 1991) pp. 25-35CrossRefGoogle Scholar
  16. 14.16.
    B.S. Atal, M.R. Schroeder: Stochastic coding of speech at very low bit rates, Proc. Int. Conf. Comm. (1984) pp. 1610-1613Google Scholar
  17. 14.17.
    J.-P. Adoul, P. Mabilleau, M. Delprat, S. Morisette: Fast CELP coding based on algebraic codes, Proc. IEEE ICASSP (1987) pp. 1957-1960Google Scholar
  18. 14.18.
    I.M. Trancoso, B.S. Atal: Efficient procedures for selecting the optimum innovation in stochastic coders, IEEE Trans. Acoust. Speech 38(3), 385-396 (1990)CrossRefGoogle Scholar
  19. 14.19.
    W.B. Kleijn, D.J. Krasinski, R.H. Ketchum: Fast methods for the CELP speech coding algorithm, IEEE Trans. Acoust. Speech 38(8), 1330-1342 (1990)CrossRefGoogle Scholar
  20. 14.20.
    T. Lookabough, R. Gray: High-resolution theory and the vector quantizer advantage, IEEE Trans. Inform. Theory IT-35(5), 1020-1033 (1989)CrossRefGoogle Scholar
  21. 14.21.
    S. Na, D. Neuhoff: Bennettʼs integral for vector quantizers, IEEE Trans. Inform. Theory 41(4), 886-900 (1995)MathSciNetCrossRefMATHGoogle Scholar
  22. 14.22.
    S.P. Lloyd: Least squares quantization in PCM, IEEE Trans. Inform. Theory IT-28, 129-137 (1982)MathSciNetCrossRefMATHGoogle Scholar
  23. 14.23.
    Y. Linde, A. Buzo, R.M. Gray: An algorithm for vector quantizer design, IEEE Trans. Commun. COM-28, 84-95 (1980)CrossRefGoogle Scholar
  24. 14.24.
    P. Chou, T. Lookabough, R. Gray: Entropy-constrained vector quantization, IEEE Trans. Acoust. Speech 38(1), 31-42 (1989)MathSciNetCrossRefGoogle Scholar
  25. 14.25.
    A. Gersho: Asymptotically optimal block quantization, IEEE Trans. Inform. Theory 25, 373-380 (1979)MathSciNetCrossRefMATHGoogle Scholar
  26. 14.26.
    P. Swaszek, T. Ku: Asymptotic performance of unrestricted polar quantizers, IEEE Trans. Inform. Theory 32(2), 330-333 (1986)CrossRefGoogle Scholar
  27. 14.27.
    R. Vafin, W.B. Kleijn: Entropy-constrained polar quantization and its application to audio coding, IEEE Trans. Speech Audio Process. 13(2), 220-232 (2005)CrossRefGoogle Scholar
  28. 14.28.
    J.J. Rissanen, G. Langdon: Arithmetic coding, IBM J. Res. Devel. 23(2), 149-162 (1979)MathSciNetCrossRefMATHGoogle Scholar
  29. 14.29.
    J. Rissanen: Modeling by the shortest data description, Automatica 14, 465-471 (1978)CrossRefMATHGoogle Scholar
  30. 14.30.
    J. Rissanen: A universal prior for integers and estimation by minimum description length, Ann. Stat. 11(2), 416-431 (1983)MathSciNetCrossRefMATHGoogle Scholar
  31. 14.31.
    P. Grunwald: A tutorial introduction to the minimum description length principle. In: Advances in Minimum Description Length: Theory and Applications, ed. by P. Grunwald, I.J. Myung, M. Pitt (MIT, Boston 2005)Google Scholar
  32. 14.32.
    A. Barron, T.M. Cover: Minimum complexity density estimation, IEEE Trans. Inform. Theory 37(4), 1034-1054 (1991)MathSciNetCrossRefMATHGoogle Scholar
  33. 14.33.
    A.H. Gray, J.D. Markel: Distance measures for speech process, IEEE Trans. Acoust. Speech Signal Process. ASSP-24(5), 380-391 (1976)CrossRefGoogle Scholar
  34. 14.34.
    R. Hagen, P. Hedelin: Low bit-rate spectral coding in CELP a new LSP method, Proc. IEEE ICASSP (1990) pp. 189-192Google Scholar
  35. 14.35.
    K.K. Paliwal, B.S. Atal: Efficient vector quantization of LPC parameters at 24 bits/frame, IEEE Trans. Speech Audio Process. 1(1), 3-14 (1993)CrossRefGoogle Scholar
  36. 14.36.
    C. Xydeas, C. Papanastasiou: Split matrix quantization of lpc parameters, IEEE Trans. Speech Audio Process. 7(2), 113-125 (1999)CrossRefGoogle Scholar
  37. 14.37.
    A. Subramaniam, B. Rao: Speech LSF quantization with rate independent complexity, bit scalability, and learning, Proc. IEEE ICASSP (2001) pp. 705-708Google Scholar
  38. 14.38.
    U. Grenander, G. Szego: Toeplitz Forms and their Applications (Chelsea, New York 1984)MATHGoogle Scholar
  39. 14.39.
    F. Itakura, S. Saito: Speech information compression based on the maximum likelihood estimation, J. Acoust. Soc. Jpn. 27(9), 463 (1971)Google Scholar
  40. 14.40.
    S. Saito, K. Nakata: Fundamentals of Speech Signal Process (Academic, New York 1985)Google Scholar
  41. 14.41.
    P.J. Brockwell, R.A. Davis: Time Series: Theory and Methods (Springer, New York 1996)MATHGoogle Scholar
  42. 14.42.
    F. Itakura, S. Saito: Analysis Synthesis Telephony Based Upon the Maximum Likelihood Method, Reports of 6th Int. Cong. Acoust.,C-5-5, C17-20, ed. by Y. Kohasi (1968)Google Scholar
  43. 14.43.
    R.M. Gray, A. Buzo, A.H. Gray, Y. Matsuyama: Distortion measures for speech process, IEEE Trans. Acoust. Speech Signal Process. ASSP-28(4), 367-376 (1980)CrossRefMATHGoogle Scholar
  44. 14.44.
    K.K. Paliwal, W.B. Kleijn: Quantization of LPC parameters. In: Speech Coding and Synthesis, ed. by W.B. Kleijn, K.K. Paliwal (Elsevier, Amsterdam 1995) pp. 433-466Google Scholar
  45. 14.45.
    W.R. Gardner, B.D. Rao: Noncausal all-pole modeling of voiced speech, IEEE Trans. Speech Audio Process. 5(1), 1-10 (1997)CrossRefGoogle Scholar
  46. 14.46.
    M. Nilsson, W.B. Kleijn: Shannon entropy estimation based on high-rate quantization theory, Proc. EUSIPCO (2004) pp. 1753-1756Google Scholar
  47. 14.47.
    M. Nilsson: Entropy and Speech (Royal Institute of Technology, Stockholm 2006), Ph.D. dissertation, KTHGoogle Scholar
  48. 14.48.
    C. Lamm: Improved Spectral Estimation in Speech Coding (Lund Institute of Technology (LTH), Lund 1998), Masterʼs thesisGoogle Scholar
  49. 14.49.
    K.L.C. Chan: Split-dimension vector quantization of parcor coefficients for low bit rate speech coding, IEEE Trans. Speech Audio Process. 2(3), 443-446 (1994)CrossRefGoogle Scholar
  50. 14.50.
    A. Subramaniam, B.D. Rao: PDF optimized parametric vector quantization of speech line spectral freuencies, IEEE Speech Coding Workshop (Delavan 2000) pp. 87-89Google Scholar
  51. 14.51.
    P. Hedelin, J. Skoglund: Vector quantization based on Gaussian mixture models, IEEE Trans. Speech Audio Process. 8(4), 385-401 (2000)CrossRefGoogle Scholar
  52. 14.52.
    S. Srinivasan, J. Samuelsson, W.B. Kleijn: Speech enhancement using a-priori information with classified noise codebooks, Proc. EUSIPCO (2004) pp. 1461-1464Google Scholar
  53. 14.53.
    W.R. Gardner, B.D. Rao: Optimal distortion measures for the high rate vector quantization of LPC parameters, Proc. IEEE ICASSP (1995) pp. 752-755Google Scholar
  54. 14.54.
    M.Y. Kim, W.B. Kleijn: KLT-based adaptive classified vector quantization of the speech signal, IEEE Trans. Speech Audio Process. 12(3), 277-289 (2004)CrossRefGoogle Scholar
  55. 14.55.
    P. Kroon, E.F. Deprettere: A class of analysis-by-synthesis predictive coders for high quality speech coding at rates between 4.8 and 16 kbit/s, IEEE J. Sel. Area. Commun. 6(2), 353-363 (1988)CrossRefGoogle Scholar
  56. 14.56.
    J. Chen, A. Gersho: Real-time vector APC speech coding at 4-800 bps with adaptive postfiltering, Proc. IEEE ICASSP (1987) pp. 2185-2188Google Scholar
  57. 14.57.
    J. Johnston: Transform coding of audio signals using perceptual noise criteria, IEEE J. Sel. Area. Commun. 6(2), 314-323 (1988)CrossRefGoogle Scholar
  58. 14.58.
    H. Malvar: Enhancing the performance of subband audio coders for speech signals, Proc. IEEE Int. Symp. on Circ. Syst., Vol. 5 (1998) pp. 98-101Google Scholar
  59. 14.59.
    R. Veldhuis: Bit rates in audio source coding, IEEE J. Sel. Area. Commun. 10(1), 86-96 (1992)CrossRefGoogle Scholar
  60. 14.60.
    B.C.J. Moore: Masking in the human auditory system. In: Collected papers on digital audio bit-rate reduction, ed. by N. Gilchrist, C. Grewin (Audio Eng. Soc., New York 1996)Google Scholar
  61. 14.61.
    B.C.J. Moore: An Introduction to the Psychology of Hearing (Academic, London 1997)Google Scholar
  62. 14.62.
    E. Zwicker, H. Fastl: Psychoacoustics (Springer Verlag, Berlin, Heidelberg 1999)CrossRefGoogle Scholar
  63. 14.63.
    T. Painter, A. Spanias: Perceptual coding of digital audio, Proc. IEEE 88(4), 451-515 (2000)CrossRefGoogle Scholar
  64. 14.64.
    J.H. Plasberg, W.B. Kleijn: The sensitivity matrix: Using advanced auditory models in speech and audio processing, IEEE Trans. Speech Audio Process. 15, 310-319 (2007)CrossRefGoogle Scholar
  65. 14.65.
    J.L. Hall: Auditory psychophysics for coding applications. In: The Digital Signal Processing Handbook, ed. by V.K. Madisetti, D. Williams (CRC, Boca Raton 1998) pp. 39.1-39.25Google Scholar
  66. 14.66.
    W. Jesteadt, S.P. Bacon, J.R. Lehman: Forward masking as a function of frequency, masker level and signal delay, J. Acoust. Soc. Am. 71(4), 950-962 (1982)CrossRefGoogle Scholar
  67. 14.67.
    D. Sinha, J.D. Johnston: Audio compression at low bit rates using a signal adaptive switched filterbank, Proc. IEEE ICASSP, Vol. 2 (1996) pp. 1053-1056Google Scholar
  68. 14.68.
    T. Verma, T. Meng: A 6 kbps to 85 kbps scalable audio coder, Proc. IEEE ICASSP, Vol. 2 (2000) pp. II877-II880Google Scholar
  69. 14.69.
    A.S. Scheuble, Z. Xiong: Scalable audio coding using the nonuniform modulated complex lapped transform, Proc. IEEE ICASSP, Vol. 5 (2001) pp. 3257-3260Google Scholar
  70. 14.70.
    R. Heusdens, R. Vafin, W.B. Kleijn: Sinusoidal modeling using psychoacoustic-adaptive matching pursuits, IEEE Signal Proc. Lett. 9(8), 262-265 (2002)CrossRefGoogle Scholar
  71. 14.71.
    M.Y. Kim, W.B. Kleijn: Resolution-constrained quantization with JND based perceptual-distortion measures, IEEE Signal Proc. Lett. 13(5), 304-307 (2006)CrossRefGoogle Scholar
  72. 14.72.
    O. Ghitza: Auditory nerve representation as a basis for speech processing. In: Advances in Speech Signal Processing (Dekker, New York 1992) pp. 453-485Google Scholar
  73. 14.73.
    T. Dau, D. Püschel, A. Kohlrausch: A quantitative model of the effective signal processing in the auditory system. I. Model structure, J. Acoust. Soc. Am. 99(6), 3615-3622 (1996)CrossRefGoogle Scholar
  74. 14.74.
    T. Dau, B. Kollmeier, A. Kohlrausch: Modeling auditory processing of amplitude modulation. I. detection and masking with narrowband carriers, J. Acoust. Soc. Am. 102(5), 2892-2905 (1997)CrossRefGoogle Scholar
  75. 14.75.
    G. Kubin, W.B. Kleijn: On speech coding in a perceptual domain, Proc. IEEE ICASSP, Vol. I (1999) pp. 205-208Google Scholar
  76. 14.76.
    F. Baumgarte: Ein psychophysiologisches Gehörmodell zur Nachbildung von Wahrnehmungsschwellen für die Audiocodierung (Univ. Hannover, Hannover 2000), Ph.D. dissertation (in German)Google Scholar
  77. 14.77.
    S. van de Par, A. Kohlrausch, G. Charestan, R. Heusdens: A new psychoacoustical masking model for audio coding applications, Proc. IEEE ICASSP (2002) pp. 1805-1808Google Scholar
  78. 14.78.
    D. Sen, D. Irving, W. Holmes: Use of an auditory model to improve speech coders, Proc. IEEE ICASSP (1993) pp. II411-II414Google Scholar
  79. 14.79.
    J.H. Plasberg, D.Y. Zhao, W.B. Kleijn: The sensitivity matrix for a spectro-temporal auditory model, Proc. EUSIPCO (2004) pp. 1673-1676Google Scholar
  80. 14.80.
    X. Yang, K. Wang, S. Shamma: Auditory representation of acoustic signals, IEEE Trans. Inform. Theory 38(2), 824-839 (1996)CrossRefGoogle Scholar
  81. 14.81.
    T. Linder, R. Zamir, K. Zeger: High-resolution source coding for non-difference measures: the rate-distortion function, IEEE Trans. Inform. Theory 45(2), 533-547 (1999)MathSciNetCrossRefMATHGoogle Scholar
  82. 14.82.
    I. Gerson, M. Jasiuk: Vector sum excited linear prediction (VSELP), Proc. IEEE ICASSP (1990) pp. 461-464Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  1. 1.School of Electrical Engineering, Sound and Image Processing LabRoyal Institute of Technology (KTH)StockholmSweden

Personalised recommendations