Skip to main content
Log in

Le traitement du signal vocal voice signal processing

  • Published:
Annales Des Télécommunications Aims and scope Submit manuscript

Résumé

Le traitement de la parole a connu ces dernières années un formidable développement lié aux avancées technologiques des composants de traitement numérique des signaux et à la numérisation grandissante des réseaux. Cet article fournit une analyse des principales techniques qui se sont imposées récemment dans les domaines du codage, de la reconnaissance et de la synthèse de la parole. En compression de débit, ľaccent est mis sur le codage par analyse/synthèse excité par code (code-excited linear prediction celp) qui domine les recherches actuelles dans une gamme de débits allant de 4 à 16 kbit/s. En reconnaissance de parole, on insiste sur ľadaptation aux lignes téléphoniques, le rejet des entrées parasites et la détection de mots-clés, trois éléments essentiels pour augmenter la robustesse des systèmes. En synthèse de la parole à partir du texte, la technique psola (pitch synchronous overlap and add), qui a donné naissance à une nouvelle génération de systèmes de synthèse au timbre très naturel, est détaillée. Ľanalyse des tendances actuelles permet de dégager quelques axes prometteurs pour de futures recherches.

Abstract

The speech processing studies have advanced rapidly in recent years spurred on by great progresses in thevlsi technologies and in the digitalization of the networks. This paper offers an overview of the most attractive techniques which have focused the recent researchs and developments in speech coding, recognition and synthesis areas. For speech compression, the emphasis is put on a family of techniques named code-excited linear prediction (celp) which dominates current studies for rates in the range of 4 to 16 kbit/s. In terms of speech recognition, particular emphasis is placed on the following three elements which are essential in order to increase the robustness of the systems : telephone line adaptation, rejection of parasite noise and out-of-vocabulary words, and keyword spotting. In terms of text-to-speech synthesis, thepsola (pitch synchronous overlap and add) technique is outlined herein. This technique gives rise to a new generation of synthesis systems which produce speech with very natural timbre. The analysis of current tendencies for each area allows to suggest attractive directions for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

  1. Daumer (W. R.),Maitre (X.),Mermelstein (P.),Tokizawa (I.). Overview of the adpcm coding algorithm.Proc. of the IEEE Global Telecom. Conf. (1984), pp. 23.1.1–23.1.4.

  2. Le Guyader (A.), Gilloire (A.). Codage différentiel de la parole: algorithmes de prédiction adaptative et performances.Ann. Télécommunic. (1983),38, n° 9-10, pp. 381–397.

    Google Scholar 

  3. Taka (M.), Combescure (P.), Mermelstein (P.), Westaix (F.). Overview of the 64 kbit/s (7 kHz) audio coding standard.Proc. of the IEEE Global Telecom. Conf., Houston (1986), pp. 17.1.1–17.1.6.

    Google Scholar 

  4. Kroon (P.), Deprettere (F.), Sluyter (R. J.). Regular pulse excitation a novel approach to effective and efficient multipulse coding of speech.IEEE Trans. ASSP (1986),34, pp. 1044–1063.

    Google Scholar 

  5. Hellwig (K.), Vary (P.), Massaloux (D.), Petit (J. P.). Speech codec for the European mobile radio system. Proc.of the IEEE Global Telecom. Conf. (1989),2, pp. 1065–1069.

    Google Scholar 

  6. Atal (B. S.). High quality speech at very low bit rates : multipulse and stochastically excited linear predictive coders.Proc. of the Int. Conf. on ASSP (1986), pp. 1681–1684.

  7. Trancoso (I.), Atal (B. S.). Efficient search procedures for selecting the optimum innovation in stochastic coders.IEEE Trans. ASSP (1990),38, n° 3, pp. 385–396.

    Google Scholar 

  8. Chen (J. H.), Cox (R. V.), Lin (Y. C.), Jayant (N.), Melchner (M. J.). A low-delay celp coder for the ccitt 16 kbit/s speech coding standard.IEEE J SAC (June 1992),10, n° 5, pp. 830–848.

    Google Scholar 

  9. Gerson (I.),Jasiuk (M.). Vector sum excited linear prediction (vselp) speech coding at 8 kbps.Proc. the Int. Conf. on ASSP0 (1990), pp. 461–464.

  10. Davidson (G.),Gersho (A.). Complexity reduction methods for vector excitation coding.Proc. of the Int. Conf. on ASSP (1986), pp. 3055–3058.

  11. Le Guyader (A.),Massaloux (D.),Petit (J. P.). Robust and fast celp coding of speech signals.Proc. of the Int. Conf. on ASSP (1989), pp. 120–123.

  12. Salami (R.),Laflamme (C.),Adoul (J. P.),Massaloux (D.). Toll quality 8 kbit/s speech coder for the personal communication system (pes).IEEE Trans. VT (Aug. 1994),43, n° 3.

    Google Scholar 

  13. Di Francesco (R.). Codage algébrique de la parole: prédiction linéaire à excitation par codes ternaires.Ann. Télécommunic. (1992),47, n° 5-6, pp. 214–226.

    Google Scholar 

  14. Lamblin (C.). Quantification vectorielle algébrique sphérique par le réseau de Bames-Wall: application au codage de la parole.PhD, Université de Sherbrooke, Canada (1988).

    Google Scholar 

  15. Markel (J. D.), Gray (A. H.). Linear prediction of speech.Springer Verlag, Berlin, Heidelberg (1976).

    MATH  Google Scholar 

  16. Le Flour (E.),Petit (J. P.),Auslander (E.),Couvrat (M.). Full duplex real-time implementation of ITU G728 ldcelp speech coding recommendation and hands free controls on a single new fixed point dsp.DSP 94, Paris (oct. 1994).

  17. Salami (R.),Laflamme (C.),Adoul (J. P.). 8 kbit/s acelp coding of speech with 10 ms speech frame : a candidate for CCITT standardization.Proc. of the Int. Conf. on ASSP (1994), pp. II–97, 11–100.

  18. Kataoka (A.),Moriya (T.),Hayashi (S.). An 8 kbit/s speech coder based on conjugate structure celp.Proc. of the Int. Conf. on ASSP (1993), pp. II–592, II–595.

  19. Kleijn (W. B.), Krasinsky (D. J.), Ketchum (R. H.). Fast methods for the celp coding algorithm.IEEE Trans. ASSP (1990),38, n° 8, pp. 1330–1342.

    Google Scholar 

  20. Kroon (P.),Atal (B. S.). Pitch predictors with high temporal resolution.Proc. of the Int. Conf. on ASSP (1990), pp. 661–664.

  21. Mahieux (Y.). High quality audio transform coding at 64 kbit/s.Ann. Télécommunic. (1992),47, n° 3-4, pp. 95–106.

    Google Scholar 

  22. Moreau (N.),Dymarski (P.). Successive orthogonalizations in the multistage celp coder.Proc. of the int. Conf. on ASSP (1992), pp. 1–61, 1–64.

  23. Lozach (B.). Codage de la parole sous-bandes/cELP à codes imbriqués et largeur de bande transmise flexible (16-24-32 kbit/s).Thèse doctorat de ľUniversité de Rennes I (1993).

  24. Rabiner (L. R). The role of voice processing in telecommunications.Proc. 2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications, Kyoto, Japon (1994), pp. 1–8.

    Google Scholar 

  25. Lennig (M.), Sharp (D.), Gupta (V.), Kenny (P.), Precoda (K.). Flexible vocabulary recognition of speech over the telephone.Proc. 1st IEEE Workshop on Interactive Voice Technology for Telecommunications Applications, Piscataway, NJ (1992), pp. VIII.2.1-3.

    Google Scholar 

  26. Aust (H.), Oerder (M.), Seide (F.), Steinbiss (V.). Experience with the Philips automatic train timetable information system.Proc. 2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications, Kyoto, Japon (1994) pp. 67–72.

    Google Scholar 

  27. Athimon (C.), Bigorgne (D.), Cherbonnel (B.), Dubois (D.), Gagnoulet (C.), Jouvet (D.), Marzio (H.), Monne (J.), Py (S.), Sorin (C.), Toularhoat (M.). Operational and experimental French telecommunication services using cnet speech recognition and text-to-speech synthesis.Proc. 2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications, Kyoto, Japon (1994), pp. 27–32.

    Google Scholar 

  28. Vysotsky (G. J.). VoiceDialing — The first speech recognition based telephone service delivered to customer’s home.Proc. 2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications, Kyoto, Japon (1994), pp. 149–152.

    Google Scholar 

  29. Mercier (G.), Gagnoulet (C.), Vives (R.), Vaissiere (J.). A multipurpose speech understanding system.Proc. Int. Conf. on ASSP, Hartford (1977), pp. 815–818.

    Google Scholar 

  30. Klatt (D. H.). Review of the arpa speech understanding project.JASA (1977),62, n° 6, pp. 1345–1366.

    Google Scholar 

  31. Vintsyuk (T. K.). Speech discrimination by dynamic programming.Kibernetica (1968),4, p. 81.

    Article  MathSciNet  Google Scholar 

  32. Baker (J.). The dragon system —an overview.IEEE Trans. ASSP (1975),23, pp. 24–29.

    Article  Google Scholar 

  33. Jelinek (F.), Bahl (L. R.), Mercer (R. L.). The design of a linguistic statistical decoder for the recognition of continuous speech.IEEE Trans. IT (1975),21, pp. 250–256.

    Article  MATH  Google Scholar 

  34. Rabiner (L. R.), Levinson (S. E.), Sondhi (M. M.). On the application of vector quantization and hidden Markov models to speaker-independent isolated word recognition.Bell Syst. Techn. J. (1983),62, n° 4, pp. 1075–1105.

    MathSciNet  Google Scholar 

  35. Schwartz (R.),Chow (Y.),Roucos (S.),Krasner (M.),Makhoul (J.). Improved hidden Markov modeling of phonemes for continuous speech recognition.Proc. IEEE Int. Conf. on ASSP (1984), pp. 35.6.1–4.

  36. Lee (K. F.), Hon (H. W.), Hwang (M. Y.), Mahajan (S.), Reddy (R.). The sphinx speech recognition system.Proc. IEEE Int. Conf. on ASSP, Glasgow, UK (1989), pp. 445–448.

    Google Scholar 

  37. Jouvet (D.), Bartkova (K.), Monne (J.). On the modelization of allophones in an hmm based speech recognition system.Proc. Eurospeech’91, Gúnes, Italie (1991), pp. 923–926.

    Google Scholar 

  38. Mokbel (C.), Paches-Leal (P.), Jouvet (D.), Monne (J.). Compensation of telephone line effects for robust speech recognition.Proc. Int. Conf. on Spoken Language Processing, Yokohama, Japon (1994), pp. 987–990.

    Google Scholar 

  39. Mokbel (C.), Monne (J.), Jouvet (D.). On-line adaptation of a speech recognizer to variations in telephone line conditions.Proc. Eurospeech’93, Berlin (1993), pp. 1247–1250.

    Google Scholar 

  40. Hirsch (H. G.), Meyer (P.), Ruehl (H.). Improved speech recognition using high-pass filtering of subband envelopes.Proc. Eurospeech’91, Gênes, Italie (1991), pp. 413–416.

    Google Scholar 

  41. Hermansky (H.), Morgan (N.), Bayya (A.), Kohn (P.). Compensation for the effect of the communication channel in auditory like analysis of speech (Rasta-PLP).Proc. Eurospeech’91, Gênes, Italie (1991), pp. 1367–1370.

    Google Scholar 

  42. Cerf-Danon (H.), De Gennaro (S.), Ferreti (M.), Gonzalez (J.), Keppel (E.), tangora — a large vocabulary speech recognition system for five languages.Proc. Eurospeech’91, Gênes, Italie (1991), pp. 183–192.

    Google Scholar 

  43. Baker (J. M.). Dictation, directories and data bases; emerging PC applications for large vocabulary speech recognition.Proc. Eurospeech’ 93, Berlin (1993), pp. 3–10.

    Google Scholar 

  44. Gauvain (J.-L.), Lamel (L. F.), Adda (G.), Adda-Decker (M.). Speaker-independent continuous speech dictation.Proc. Eurospeech’93, Berlin (1993), pp. 125–128.

    Google Scholar 

  45. Emerard (F.), Graillot (P.). Sahara II: speech prosthesis for the non-speaking handicapped.Proc. of the 4th Annual Conference on Rehabilitation Engineering, Washington, DC (1981).

    Google Scholar 

  46. Sorin (C.). Towards high-quality multilingual text-to-speech.Progress and Prospects of Speech Research and Technology,H. Nieman Editor, Infix Publishing Company, Sankt Augustin (1994).

    Google Scholar 

  47. Schmidt (M.), Fitt (S.), Scott (C.), Jack (M.). Phonetic transcription standards for European names (onomastica).Proc. Eurospeech’93, Berlin (1993),1, pp. 279–283.

    Google Scholar 

  48. Emerard (F.),Mortamet (L.),Cozannet (A.). Prosodic processing in a text-to-speech synthesis system using a database and learning procedures.Talking Machines,G. Bailly andC. Benoît (eds), Amsterdam,North Holland Publishing Company (1992), pp. 225–254.

  49. Traber (C.). Fo generation with a database of natural Fo patterns and with a neural network.Talking Machines,G. Bailly andC. Benoît (eds), North Holland (1992), pp. 287–304.

  50. Klatt (D. H.). Review of text-to-speech conversion for English.JASA (1987),82, pp. 737–793.

    Google Scholar 

  51. Bigorgne (D.), Boeffard (O.), Cherbonnel (B.), Emerard (F.), Larreur (D.), Le Saint-Milon (J. L.), Métayer (I.), Sorin (C.), White (S.). Multilingual psola text-to-speech system.Proc. ICASSP’93, Minneapolis (Apr. 1993),2, pp. 187–190.

    Google Scholar 

  52. Atal (B. S.), Hanauer (S. L.). Speech analysis and synthesis by linear prediction of the speech wave.JASA (1971),50, pp. 637–655.

    Google Scholar 

  53. Hamon (C.). Procédé et dispositif de synthèse de la parole par addition/recouvrement de formes ďondes.Brevet français n° 88 11 517 acquis en France, Canada, USA. En cours dans ďautres pays.

  54. Moulines (E.), Charpentier (F.). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones.Speech Communication (1990),9, pp. 453–467.

    Article  Google Scholar 

  55. Fellbaum (K.), Klaus (H.), Sotscheck (J.). Hörversuche zur Beurteilung der Sprachqualität von Sprachsynthesesystemen für die deutsche Sprache.Proceedings of the DAGA 94 Vorkolloquium, Dresden (March 1994).

    Google Scholar 

  56. Boeffard (O.), Cherbonnel (B.), Emerard (F.), White (S.). Automatic segmentation and quality evaluation of speech units inventories for concatenation-base multilingual psola text-to-speech systems.Proc. Eurospeech’93, Berlin (Sep. 1993),2, pp. 1449–1452.

    Google Scholar 

  57. Llisterri (J.),Poch-Olive (D.). Phonetics and phonology of speaking styles.Special Issue of Speech Communication (Oct. 1992),11, n° 4-5.

  58. Abe (M.). Statistical analysis of the acoustic and prosodic characteristics of different speaking styles.Proc. Eurospeech’93, Berlin (Sep. 1993),3, pp. 2107–2110.

    Google Scholar 

  59. Valbret (H.), Moulines (E.), Tubach (J. P.). Voice transformation using psola technique.Speech Communication (1992),11, pp. 175–187.

    Article  Google Scholar 

  60. Serra (X.), Smith (J.). Spectral modeling synthesis: a sound analysis/synthesis system based on a deterministic plus stochastic decomposition.Computer Music Journal (Winter 1990),14, n° 4, pp. 12–24.

    Article  Google Scholar 

  61. Laroche (J.), Stylianou (Y.), Moulines (E.), hns: speech modification based on a harmonic + noise model.Proc. ICASSP’93, Minneapolis (1993).

    Google Scholar 

  62. Boeffard (O.), Violaro (F.). Improving the robustness of the psola synthesis scheme for large prosodie variations.Second ESCAIIEEE Workshop on Speech Synthesis, Monhonk, NJ (Sep. 1994).

    Google Scholar 

  63. Van Coile (B.), De Zitter (M.), Van Tichelen (M.), Vorster-mans (M.). Prosody transplantations in text-to-speech: applications and tools.Proc. Second ESCAIIEEE Workshop on Speech Synthesis, Monhonk, NJ (Sep. 1994).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Combescure, P., Le Guyader, A., Jouvet, D. et al. Le traitement du signal vocal voice signal processing. Ann. Télécommun. 50, 142–164 (1995). https://doi.org/10.1007/BF03000774

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF03000774

Mots clés

Key words

Navigation