Résumé
Le traitement de la parole a connu ces dernières années un formidable développement lié aux avancées technologiques des composants de traitement numérique des signaux et à la numérisation grandissante des réseaux. Cet article fournit une analyse des principales techniques qui se sont imposées récemment dans les domaines du codage, de la reconnaissance et de la synthèse de la parole. En compression de débit, ľaccent est mis sur le codage par analyse/synthèse excité par code (code-excited linear prediction celp) qui domine les recherches actuelles dans une gamme de débits allant de 4 à 16 kbit/s. En reconnaissance de parole, on insiste sur ľadaptation aux lignes téléphoniques, le rejet des entrées parasites et la détection de mots-clés, trois éléments essentiels pour augmenter la robustesse des systèmes. En synthèse de la parole à partir du texte, la technique psola (pitch synchronous overlap and add), qui a donné naissance à une nouvelle génération de systèmes de synthèse au timbre très naturel, est détaillée. Ľanalyse des tendances actuelles permet de dégager quelques axes prometteurs pour de futures recherches.
Abstract
The speech processing studies have advanced rapidly in recent years spurred on by great progresses in thevlsi technologies and in the digitalization of the networks. This paper offers an overview of the most attractive techniques which have focused the recent researchs and developments in speech coding, recognition and synthesis areas. For speech compression, the emphasis is put on a family of techniques named code-excited linear prediction (celp) which dominates current studies for rates in the range of 4 to 16 kbit/s. In terms of speech recognition, particular emphasis is placed on the following three elements which are essential in order to increase the robustness of the systems : telephone line adaptation, rejection of parasite noise and out-of-vocabulary words, and keyword spotting. In terms of text-to-speech synthesis, thepsola (pitch synchronous overlap and add) technique is outlined herein. This technique gives rise to a new generation of synthesis systems which produce speech with very natural timbre. The analysis of current tendencies for each area allows to suggest attractive directions for future research.
References
Daumer (W. R.),Maitre (X.),Mermelstein (P.),Tokizawa (I.). Overview of the adpcm coding algorithm.Proc. of the IEEE Global Telecom. Conf. (1984), pp. 23.1.1–23.1.4.
Le Guyader (A.), Gilloire (A.). Codage différentiel de la parole: algorithmes de prédiction adaptative et performances.Ann. Télécommunic. (1983),38, n° 9-10, pp. 381–397.
Taka (M.), Combescure (P.), Mermelstein (P.), Westaix (F.). Overview of the 64 kbit/s (7 kHz) audio coding standard.Proc. of the IEEE Global Telecom. Conf., Houston (1986), pp. 17.1.1–17.1.6.
Kroon (P.), Deprettere (F.), Sluyter (R. J.). Regular pulse excitation a novel approach to effective and efficient multipulse coding of speech.IEEE Trans. ASSP (1986),34, pp. 1044–1063.
Hellwig (K.), Vary (P.), Massaloux (D.), Petit (J. P.). Speech codec for the European mobile radio system. Proc.of the IEEE Global Telecom. Conf. (1989),2, pp. 1065–1069.
Atal (B. S.). High quality speech at very low bit rates : multipulse and stochastically excited linear predictive coders.Proc. of the Int. Conf. on ASSP (1986), pp. 1681–1684.
Trancoso (I.), Atal (B. S.). Efficient search procedures for selecting the optimum innovation in stochastic coders.IEEE Trans. ASSP (1990),38, n° 3, pp. 385–396.
Chen (J. H.), Cox (R. V.), Lin (Y. C.), Jayant (N.), Melchner (M. J.). A low-delay celp coder for the ccitt 16 kbit/s speech coding standard.IEEE J SAC (June 1992),10, n° 5, pp. 830–848.
Gerson (I.),Jasiuk (M.). Vector sum excited linear prediction (vselp) speech coding at 8 kbps.Proc. the Int. Conf. on ASSP0 (1990), pp. 461–464.
Davidson (G.),Gersho (A.). Complexity reduction methods for vector excitation coding.Proc. of the Int. Conf. on ASSP (1986), pp. 3055–3058.
Le Guyader (A.),Massaloux (D.),Petit (J. P.). Robust and fast celp coding of speech signals.Proc. of the Int. Conf. on ASSP (1989), pp. 120–123.
Salami (R.),Laflamme (C.),Adoul (J. P.),Massaloux (D.). Toll quality 8 kbit/s speech coder for the personal communication system (pes).IEEE Trans. VT (Aug. 1994),43, n° 3.
Di Francesco (R.). Codage algébrique de la parole: prédiction linéaire à excitation par codes ternaires.Ann. Télécommunic. (1992),47, n° 5-6, pp. 214–226.
Lamblin (C.). Quantification vectorielle algébrique sphérique par le réseau de Bames-Wall: application au codage de la parole.PhD, Université de Sherbrooke, Canada (1988).
Markel (J. D.), Gray (A. H.). Linear prediction of speech.Springer Verlag, Berlin, Heidelberg (1976).
Le Flour (E.),Petit (J. P.),Auslander (E.),Couvrat (M.). Full duplex real-time implementation of ITU G728 ldcelp speech coding recommendation and hands free controls on a single new fixed point dsp.DSP 94, Paris (oct. 1994).
Salami (R.),Laflamme (C.),Adoul (J. P.). 8 kbit/s acelp coding of speech with 10 ms speech frame : a candidate for CCITT standardization.Proc. of the Int. Conf. on ASSP (1994), pp. II–97, 11–100.
Kataoka (A.),Moriya (T.),Hayashi (S.). An 8 kbit/s speech coder based on conjugate structure celp.Proc. of the Int. Conf. on ASSP (1993), pp. II–592, II–595.
Kleijn (W. B.), Krasinsky (D. J.), Ketchum (R. H.). Fast methods for the celp coding algorithm.IEEE Trans. ASSP (1990),38, n° 8, pp. 1330–1342.
Kroon (P.),Atal (B. S.). Pitch predictors with high temporal resolution.Proc. of the Int. Conf. on ASSP (1990), pp. 661–664.
Mahieux (Y.). High quality audio transform coding at 64 kbit/s.Ann. Télécommunic. (1992),47, n° 3-4, pp. 95–106.
Moreau (N.),Dymarski (P.). Successive orthogonalizations in the multistage celp coder.Proc. of the int. Conf. on ASSP (1992), pp. 1–61, 1–64.
Lozach (B.). Codage de la parole sous-bandes/cELP à codes imbriqués et largeur de bande transmise flexible (16-24-32 kbit/s).Thèse doctorat de ľUniversité de Rennes I (1993).
Rabiner (L. R). The role of voice processing in telecommunications.Proc. 2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications, Kyoto, Japon (1994), pp. 1–8.
Lennig (M.), Sharp (D.), Gupta (V.), Kenny (P.), Precoda (K.). Flexible vocabulary recognition of speech over the telephone.Proc. 1st IEEE Workshop on Interactive Voice Technology for Telecommunications Applications, Piscataway, NJ (1992), pp. VIII.2.1-3.
Aust (H.), Oerder (M.), Seide (F.), Steinbiss (V.). Experience with the Philips automatic train timetable information system.Proc. 2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications, Kyoto, Japon (1994) pp. 67–72.
Athimon (C.), Bigorgne (D.), Cherbonnel (B.), Dubois (D.), Gagnoulet (C.), Jouvet (D.), Marzio (H.), Monne (J.), Py (S.), Sorin (C.), Toularhoat (M.). Operational and experimental French telecommunication services using cnet speech recognition and text-to-speech synthesis.Proc. 2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications, Kyoto, Japon (1994), pp. 27–32.
Vysotsky (G. J.). VoiceDialing — The first speech recognition based telephone service delivered to customer’s home.Proc. 2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications, Kyoto, Japon (1994), pp. 149–152.
Mercier (G.), Gagnoulet (C.), Vives (R.), Vaissiere (J.). A multipurpose speech understanding system.Proc. Int. Conf. on ASSP, Hartford (1977), pp. 815–818.
Klatt (D. H.). Review of the arpa speech understanding project.JASA (1977),62, n° 6, pp. 1345–1366.
Vintsyuk (T. K.). Speech discrimination by dynamic programming.Kibernetica (1968),4, p. 81.
Baker (J.). The dragon system —an overview.IEEE Trans. ASSP (1975),23, pp. 24–29.
Jelinek (F.), Bahl (L. R.), Mercer (R. L.). The design of a linguistic statistical decoder for the recognition of continuous speech.IEEE Trans. IT (1975),21, pp. 250–256.
Rabiner (L. R.), Levinson (S. E.), Sondhi (M. M.). On the application of vector quantization and hidden Markov models to speaker-independent isolated word recognition.Bell Syst. Techn. J. (1983),62, n° 4, pp. 1075–1105.
Schwartz (R.),Chow (Y.),Roucos (S.),Krasner (M.),Makhoul (J.). Improved hidden Markov modeling of phonemes for continuous speech recognition.Proc. IEEE Int. Conf. on ASSP (1984), pp. 35.6.1–4.
Lee (K. F.), Hon (H. W.), Hwang (M. Y.), Mahajan (S.), Reddy (R.). The sphinx speech recognition system.Proc. IEEE Int. Conf. on ASSP, Glasgow, UK (1989), pp. 445–448.
Jouvet (D.), Bartkova (K.), Monne (J.). On the modelization of allophones in an hmm based speech recognition system.Proc. Eurospeech’91, Gúnes, Italie (1991), pp. 923–926.
Mokbel (C.), Paches-Leal (P.), Jouvet (D.), Monne (J.). Compensation of telephone line effects for robust speech recognition.Proc. Int. Conf. on Spoken Language Processing, Yokohama, Japon (1994), pp. 987–990.
Mokbel (C.), Monne (J.), Jouvet (D.). On-line adaptation of a speech recognizer to variations in telephone line conditions.Proc. Eurospeech’93, Berlin (1993), pp. 1247–1250.
Hirsch (H. G.), Meyer (P.), Ruehl (H.). Improved speech recognition using high-pass filtering of subband envelopes.Proc. Eurospeech’91, Gênes, Italie (1991), pp. 413–416.
Hermansky (H.), Morgan (N.), Bayya (A.), Kohn (P.). Compensation for the effect of the communication channel in auditory like analysis of speech (Rasta-PLP).Proc. Eurospeech’91, Gênes, Italie (1991), pp. 1367–1370.
Cerf-Danon (H.), De Gennaro (S.), Ferreti (M.), Gonzalez (J.), Keppel (E.), tangora — a large vocabulary speech recognition system for five languages.Proc. Eurospeech’91, Gênes, Italie (1991), pp. 183–192.
Baker (J. M.). Dictation, directories and data bases; emerging PC applications for large vocabulary speech recognition.Proc. Eurospeech’ 93, Berlin (1993), pp. 3–10.
Gauvain (J.-L.), Lamel (L. F.), Adda (G.), Adda-Decker (M.). Speaker-independent continuous speech dictation.Proc. Eurospeech’93, Berlin (1993), pp. 125–128.
Emerard (F.), Graillot (P.). Sahara II: speech prosthesis for the non-speaking handicapped.Proc. of the 4th Annual Conference on Rehabilitation Engineering, Washington, DC (1981).
Sorin (C.). Towards high-quality multilingual text-to-speech.Progress and Prospects of Speech Research and Technology,H. Nieman Editor, Infix Publishing Company, Sankt Augustin (1994).
Schmidt (M.), Fitt (S.), Scott (C.), Jack (M.). Phonetic transcription standards for European names (onomastica).Proc. Eurospeech’93, Berlin (1993),1, pp. 279–283.
Emerard (F.),Mortamet (L.),Cozannet (A.). Prosodic processing in a text-to-speech synthesis system using a database and learning procedures.Talking Machines,G. Bailly andC. Benoît (eds), Amsterdam,North Holland Publishing Company (1992), pp. 225–254.
Traber (C.). Fo generation with a database of natural Fo patterns and with a neural network.Talking Machines,G. Bailly andC. Benoît (eds), North Holland (1992), pp. 287–304.
Klatt (D. H.). Review of text-to-speech conversion for English.JASA (1987),82, pp. 737–793.
Bigorgne (D.), Boeffard (O.), Cherbonnel (B.), Emerard (F.), Larreur (D.), Le Saint-Milon (J. L.), Métayer (I.), Sorin (C.), White (S.). Multilingual psola text-to-speech system.Proc. ICASSP’93, Minneapolis (Apr. 1993),2, pp. 187–190.
Atal (B. S.), Hanauer (S. L.). Speech analysis and synthesis by linear prediction of the speech wave.JASA (1971),50, pp. 637–655.
Hamon (C.). Procédé et dispositif de synthèse de la parole par addition/recouvrement de formes ďondes.Brevet français n° 88 11 517 acquis en France, Canada, USA. En cours dans ďautres pays.
Moulines (E.), Charpentier (F.). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones.Speech Communication (1990),9, pp. 453–467.
Fellbaum (K.), Klaus (H.), Sotscheck (J.). Hörversuche zur Beurteilung der Sprachqualität von Sprachsynthesesystemen für die deutsche Sprache.Proceedings of the DAGA 94 Vorkolloquium, Dresden (March 1994).
Boeffard (O.), Cherbonnel (B.), Emerard (F.), White (S.). Automatic segmentation and quality evaluation of speech units inventories for concatenation-base multilingual psola text-to-speech systems.Proc. Eurospeech’93, Berlin (Sep. 1993),2, pp. 1449–1452.
Llisterri (J.),Poch-Olive (D.). Phonetics and phonology of speaking styles.Special Issue of Speech Communication (Oct. 1992),11, n° 4-5.
Abe (M.). Statistical analysis of the acoustic and prosodic characteristics of different speaking styles.Proc. Eurospeech’93, Berlin (Sep. 1993),3, pp. 2107–2110.
Valbret (H.), Moulines (E.), Tubach (J. P.). Voice transformation using psola technique.Speech Communication (1992),11, pp. 175–187.
Serra (X.), Smith (J.). Spectral modeling synthesis: a sound analysis/synthesis system based on a deterministic plus stochastic decomposition.Computer Music Journal (Winter 1990),14, n° 4, pp. 12–24.
Laroche (J.), Stylianou (Y.), Moulines (E.), hns: speech modification based on a harmonic + noise model.Proc. ICASSP’93, Minneapolis (1993).
Boeffard (O.), Violaro (F.). Improving the robustness of the psola synthesis scheme for large prosodie variations.Second ESCAIIEEE Workshop on Speech Synthesis, Monhonk, NJ (Sep. 1994).
Van Coile (B.), De Zitter (M.), Van Tichelen (M.), Vorster-mans (M.). Prosody transplantations in text-to-speech: applications and tools.Proc. Second ESCAIIEEE Workshop on Speech Synthesis, Monhonk, NJ (Sep. 1994).
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Combescure, P., Le Guyader, A., Jouvet, D. et al. Le traitement du signal vocal voice signal processing. Ann. Télécommun. 50, 142–164 (1995). https://doi.org/10.1007/BF03000774
Issue Date:
DOI: https://doi.org/10.1007/BF03000774
Mots clés
- Traitement parole
- Codage parole
- Reconnaissance parole
- Synthèse parole
- Article synthèse
- Compression bande passante
- Application télécommunication
- Histoire
- Etat actuel technique
- Codage prédictif