International Journal of Speech Technology

, Volume 2, Issue 2, pp 145–153 | Cite as

Speech recognition and synthesis technology development at NTT for telecommunications services

  • Kazuo Hakoda
  • Mikio Kitai
  • Shigeki Sagayama


This paper describes recent developments at NTT in the areas of speech recognition, speech synthesis, and interactive voice systems as they relate to telecommunications applications. Speaker-independent largevocabulary speech recognition based on context-dependent phone models and LR parser, and high-quality text-to-speech (TTS) conversion using the waveform concatenation method, both realized as software, have enabled interactive voice systems for fast and easy prototyping of telephone-based applications. Practical applications are discussed with examples.


speech recognition speech synthesis interactive voice systems 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Abe, M., Hakoda, K., and Tsukada, H. (1996). An information retrieval system from text database using text-to-speech.Proc. AVIOS'96, pp. 189–196.Google Scholar
  2. Charpentier, F. and Moulines, E. (1989). Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones.Proc. Eurospeech'89, pp. 13–19.Google Scholar
  3. Darrel, S. and Bernie, R. (1994). DECtalk software in a desktop environment.Proc. AVIOS'94, pp. 189–193.Google Scholar
  4. Hakoda, K., Nakajima, S., Hirokawa, T., and Mizuno, H. (1990). A new Japanese text-to-speech synthesizer based on COC synthesized method.Proc. ICSLP'90, pp. 809–812.Google Scholar
  5. Hakoda, K., Hirokawa, T., Tsukada, H., Yoshida, Y., and Mizuno, H. (1995). Japanese text-to-speech software based on waveform concatenation method.Proc. AVIOS'95, pp. 65–72.Google Scholar
  6. Hirokawa, T., Itoh, K., and Sato, H. (1993). High quality speech synthesis system based on waveform concatenation of phoneme segment.IEICE Trans. Fundamentals, E76-A(11): 1964–1970.Google Scholar
  7. Ikehara, S., Murakami, K., Miyazaki, M., and Ohyama, Y. (1986). Construction of Japanese text-to-speech system.ECL Tech. J., 35(2): 145–155 (in Japanese).Google Scholar
  8. Imamura, A. and Suzuki, Y. (1990). Speaker-independent word spotting and a tranputer-based implementation.Proc. ICSLP'90, pp. 537–540.Google Scholar
  9. Intoh, K. and Miki, S. (1988). Speaker independent isolated word recognition board and its application.American Voice I/O Systems Applications Conf., AVIOS'88.Google Scholar
  10. Itakura, F. (1975). Line spectrum representation of linear prediction coefficients of speech signal.Trans. of the Committee on Speech Research, ASJ, S75-34.Google Scholar
  11. Itakura, F. and Saito, S. (1969). Speech analysis-synthesis system based on the partial autocorrelation coefficient.Acoust. Soc. of Japan Meeting, pp. 199–200 (in Japanese).Google Scholar
  12. Minami, Y, Shikano, K., Yamada, T., and Matsuoka, T. (1992). Very-large-vocabulary continuous speech recognition system for telephone directory assistance.Proc. IVTTA'92.Google Scholar
  13. Momosaki, K., Hara, Y., Shiga, Y., Kaseno, O., Tamanaka, N., Nitta, T., and Kobayashi, K. (1994). A Japanese TTS software for personal computers.ASJ'94 Autumn Meeting.3-5-6, pp. 327–328 (in Japanese).Google Scholar
  14. Nakatsu, R. and Ishii, N. (1987). Voice response and recognition system for telephone information services.Proc. of SPEECH TECH'87, pp. 168–172.Google Scholar
  15. Noda, Y and Sagayama, S. (1995). Fast and accurate beam search using forward heuristic functions in HMM-LR speech recognition.Proc. Eurospeech'95 (Madrid), WEamIA.5, pp. 913–916.Google Scholar
  16. Sato, H., Sagisaka, Y, Kogure, K., and Sagayama, S. (1982). Investigation on Japanese text-to-speech conversion.Trans. of the Committee on Speech Research, S82-08 (in Japanese).Google Scholar
  17. Takahashi, J. and Sagayama, S. (1994). Fast telephone channel adaptation based on vector field smoothing technique.Proc. IVTTA'94 Workshop, pp. 97–100.Google Scholar
  18. Takahashi, J. and Sagayama, S. (1995). Vector-field-smoothed bayesian learning for incremental speaker adaptation.Proc. ICASSP95 (Detroit), pp. 696–699.Google Scholar
  19. Takahashi, K., Iwata, K., Mitome, Y, and Nagano, K. (1994). Japanese text-to-speech conversion software for personal computers.Proc. ICSLPV4, pp. 1743–1746.Google Scholar
  20. Takahashi, S. and Sagayama, S. (1995). Four-level tied structure for efficient representation of acoustic modeling.Proc. ICASSP'95 (Detroit), pp. 520–523.Google Scholar
  21. Tomita, M. (1991).Generalized LR Parsing. Kluwer Academic Publishers.Google Scholar
  22. Yamada, T. and Sagayama, S. (1994). An implementation of LR parser using context-dependent phone models.Proc. JASJ Conf., 3-8-8, pp. 123–124 (in Japanese).Google Scholar
  23. Yoshida, Y, Nakajima, S., Hakoda, K., and Hirokawa, T. (1996). A new method of generating speech synthesis units based on phonological knowledge and clustering technique.Proc. ICSLP'96, pp. 1712–1715.Google Scholar

Copyright information

© Kluwer Academic Publishers 1997

Authors and Affiliations

  • Kazuo Hakoda
    • 1
  • Mikio Kitai
    • 1
  • Shigeki Sagayama
    • 1
  1. 1.NTT Human Interface LaboratoriesJapan

Personalised recommendations