Introduction

  • K. Sreenivasa Rao
Chapter
Part of the SpringerBriefs in Electrical and Computer Engineering book series (BRIEFSELECTRIC)

Abstract

This chapter discuss about the significance of prosody knowledge for developing speech systems by machine, and for performing various speech tasks by human beings. The manifestation of prosodic knowledge in speech at linguistic, articulatory, acoustic and perception levels is described. Some of the inherent prosodic knowledge sources present in speech, which can be analyzed by machine automatically are also discussed in this chapter. The basic objective, scope and organization of the contents of the book are discussed at the end of this chapter.

Keywords

Respiration 

References

  1. 1.
    S. Werner and E. Keller, “Prosodic aspects of speech,” in Fundamentals of Speech Synthesis and Speech Recognition: Basic Concepts, State of the Art, the Future Challenges (E. Keller, ed.), pp. 23–40, Chichester: John Wiley, 1994.Google Scholar
  2. 2.
    K. S. Rao and B. Yegnanarayana, “Modeling syllable duration in Indian languages using neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Montreal, Quebec, Canada), pp. 313–316, May 2004.Google Scholar
  3. 3.
    K. S. Rao and B. Yegnanarayana, “Intonation modeling for Indian languages,” in Proc. Int. Conf. Spoken Language Processing, (Jeju Island, Korea), pp. 733–736, Oct. 2004.Google Scholar
  4. 4.
    K. S. Rao and B. Yegnanarayana, “Prosodic manipulation using instants of significant excitation,” in Proc. IEEE Int. Conf. Multimedia and Expo, (Baltimore, Maryland, USA), pp. 389–392, July 2003.Google Scholar
  5. 5.
    H. Mixdorff, An integrated approach to modeling German prosody. PhD thesis, Technical University, Dresden, Germany, July 2002.Google Scholar
  6. 6.
    D. H. Klatt, “Synthesis by rule of segmental durations in English sentences,” in Frontiers of Speech Communication Research (B. Lindblom and S. Ohman, eds.), pp. 287–300, New York: Academic Press, 1979.Google Scholar
  7. 7.
    J. Allen, M. S. Hunnicut, and D. H. Klatt, From Text to Speech: The MITalk system. Cambridge: Cambridge University Press, 1987.Google Scholar
  8. 8.
    K. J. Kohler, “Zeistrukturierung in der Sprachsynthese,” ITG-Tagung Digitalc Sprachverarbeitung, no. 6, pp. 165–170, 1988.Google Scholar
  9. 9.
    K. Bartkova and C. Sorin, “A model of segmental duration for speech synthesis in French,” Speech Communication, no. 6, pp. 245–260, 1987.Google Scholar
  10. 10.
    S. R. R. Kumar and B. Yegnanarayana, “Significance of durational knowledge for speech synthesis in Indian languages,” in Proc. IEEE Region 10 Conf. Convergent Technologies for the Asia-Pacific, (Bombay, India), pp. 486–489, Nov. 1989.Google Scholar
  11. 11.
    S. R. R. Kumar, “Significance of durational knowledge for a text-to-speech system in an Indian language,” Master’s thesis, Dept. of Computer science and Engineering, Indian Institute of Technology Madras, Mar. 1990.Google Scholar
  12. 12.
    R. Sriram, S. R. R. Kumar, and B. Yegnanarayana, A Text-to-Speech conversion system for Indian languages using parameter based approach. Technical report no.12, Project VOIS, Dept. of CSE, IITM, May 1989.Google Scholar
  13. 13.
    K. K. Kumar, “Duration and intonation knowledge for text-to-speech conversion system for Telugu and Hindi,” Master’s thesis, Dept. of Computer science and Engineering, Indian Institute of Technology Madras, May 2002.Google Scholar
  14. 14.
    K. K. Kumar, K. S. Rao, and B. Yegnanarayana, “Duration knowledge for text-to-speech system for Telugu,” in Proc. Int. Conf. Knowledge Based Computer Systems, (Mumbai, India), pp. 563–571, Dec. 2002.Google Scholar
  15. 15.
    S. H. Chen, W. H. Lai, and Y. R. Wang, “A new duration modeling approach for Mandarin speech,” IEEE Trans. Speech and Audio Processing, vol. 11, pp. 308–320, July 2003.CrossRefGoogle Scholar
  16. 16.
    J. P. H. V. Santen, “Segmental duration and speech timing,” in Computing prosody (Sagisaka, Campbell, and Higuchi, eds.), pp. 225–249, Springer-Verlag, 1996.Google Scholar
  17. 17.
    J. P. H. V. Santen, “Assignment of segment duration in text-to-speech synthesis,” Computer Speech and Language, vol. 8, pp. 95–128, Apr. 1994.CrossRefGoogle Scholar
  18. 18.
    J. P. H. V. Santen, “Timing in text-to-speech systems,” in Proc. Eurospeech, vol. 35, (Berlin, Germany), pp. 1397–1404, 1993.Google Scholar
  19. 19.
    J. P. H. V. Santen, “Analyzing n-way tables with sums-of-products models,” Journal of Mathematical Psychology, vol. 37, pp. 327–371, 1993.MathSciNetMATHCrossRefGoogle Scholar
  20. 20.
    J. P. H. V. Santen, “Prosodic modeling in text-to-speech synthesis,” in Proc. Eurospeech, (Rhodes, Greece), 1997.Google Scholar
  21. 21.
    J. P. H. V. Santen, C. Shih, and et. al., “Multi-lingual duration modeling,” in Proc. Eurospeech, vol. 5, (Rhodes, Greece), pp. 2651–2654, 1997.Google Scholar
  22. 22.
    O. Goubanova and P. Taylor, “Using bayesian belief networks for model duration in text-to-speech systems,” in Proc. Int. Conf. Spoken Language Processing, vol. 2, (Beijing, China), pp. 427–431, Oct. 2000.Google Scholar
  23. 23.
    O. Sayli, “Duration analysis and modeling for Turkish text-to-speech synthesis,” Master’s thesis, Dept. of Electrical and Electronics Engineering, Bogaziei University, 2002.Google Scholar
  24. 24.
    A. W. Black, P. Taylor, and R. Caley, “The Festival speech synthesis system.” Manual and source code available at www.cstr.ed.ac.uk/projects/festival.html.Google Scholar
  25. 25.
    M. Riley, “Tree-based modeling of segmental durations,” Talking Machines: Theories, Models and Designs, pp. 265–273, 1992.Google Scholar
  26. 26.
    S. Lee and Y.-H. Oh, “Tree-based modeling of prosodic phrasing and segmental duration for Korean TTS systems,” Speech Communication, vol. 28, pp. 283–300, 1999.CrossRefGoogle Scholar
  27. 27.
    A. Maghboulegh, “An empirical comparison of automatic decision tree and hand-configured linear models for vowel durations,” in Proc. of the Workshop in Computational Phonology in Speech Technology, (Santa Cruz), 1996.Google Scholar
  28. 28.
    R. Batusek, “A duration model for Czech text-to-speech synthesis,” in Proceedings of TSD, (Pilsen, Czech Republic), Sept. 2002.Google Scholar
  29. 29.
    H. Chung, “Segment duration in spoken Korean,” in Proc. Int. Conf. Spoken Language Processing, (Denver, Colorado, USA), pp. 1105–1108, Sept. 2002.Google Scholar
  30. 30.
    N. S. Krishna and H. A. Murthy, “Duration modeling of Indian languages Hindi and Telugu,” in 5th ISCA Speech Synthesis Workshop, (Pittsburgh, USA), pp. 197–202, May 2004.Google Scholar
  31. 31.
    W. N. Campbell, “Analog i/o nets for syllable timing,” Speech Communication, vol. 9, pp. 57–61, Feb. 1990.CrossRefGoogle Scholar
  32. 32.
    W. N. Campbell, “Syllable based segment duration,” in Talking Machines: Theories, Models and Designs (G. Bailly, C. Benoit, and T. R. Sawallis, eds.), pp. 211–224, Elsevier, 1992.Google Scholar
  33. 33.
    P. A. Barbosa and G. Bailly, “Characterization of rhythmic patterns for text-to-speech synthesis,” Speech Communication, vol. 15, pp. 127–137, 1994.CrossRefGoogle Scholar
  34. 34.
    P. A. Barbosa and G. Bailly, “Generation of pauses within the z-score model,” in Progress in Speech Synthesis, pp. 365–381, Springer-Verlag, 1997.Google Scholar
  35. 35.
    R. Cordoba, J. A. Vallejo, J. M. Montero, J. Gutierrezarriola, M. A. Lopez, and J. M. Pardo, “Automatic modeling of duration in a Spanish text-to-speech system using neural networks,” in Proc. European Conf. Speech Communication and Technology, (Budapest, Hungary), Sept. 1999.Google Scholar
  36. 36.
    Y. Hifny and M. Rashwan, “Duration modeling of Arabic text-to-speech synthesis,” in Proc. Int. Conf. Spoken Language Processing, (Denver, Colorado, USA), pp. 1773–1776, Sept. 2002.Google Scholar
  37. 37.
    G. P. Sonntag, T. Portele, and B. Heuft, “Prosody generation with a neural network: Weighing the importance of input parameters,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Munich, Germany), pp. 931–934, Apr. 1997.Google Scholar
  38. 38.
    J. P. Teixeira and D. Freitas, “Segmental durations predicted with a neural network,” in Proc. European Conf. Speech Communication and Technology, (Geneva, Switzerland), pp. 169–172, Sept. 2003.Google Scholar
  39. 39.
    R. E. Donovan, Trainable speech synthesis. PhD thesis, Cambridge University Engineering Department, Christ’s college, Trumpington Street, Cambridge CB2 1PZ, England, June 1996.Google Scholar
  40. 40.
    A. Botinis, B. Granstrom, and B. Mobius, “Developments and paradigms in intonation research,” Speech Communication, vol. 33, pp. 263–296, 2001.MATHCrossRefGoogle Scholar
  41. 41.
    P. A. Taylor, “Analysis and synthesis of intonation using the Tilt model,” Journal of Acoustic Society of America, vol. 107, pp. 1697–1714, Mar. 2000.CrossRefGoogle Scholar
  42. 42.
    P. A. Taylor, “The rise/fall/connection model of intonation,” Speech Communication, vol. 15, no. 15, pp. 169–186, 1995.Google Scholar
  43. 43.
    J. B. Pierrehumbert, The Phonology and Phonetics of English Intonation. PhD thesis, MIT, MA, USA, 1980.Google Scholar
  44. 44.
    R. Sproat, ed., Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Kluwer Academic Publishers, 1998.Google Scholar
  45. 45.
    M. Jilka, G. Mohler, and G. Dogil, “Rules for generation of TOBI-based American English intonation,” Speech Communication, vol. 28, pp. 83–108, 1999.CrossRefGoogle Scholar
  46. 46.
    J. Buhmann, H. Vereecken, J. Fackrell, J. P. Martens, and B. V. Coile, “Data driven intonation modeling of 6 languages,” in Proc. Int. Conf. Spoken Language Processing, vol. 3, (Beijing, China), pp. 179–183, Oct. 2000.Google Scholar
  47. 47.
    N. (Thorsen) Gronnum, “The groundworks of Danish intonation: An introduction.” Museum Tusculanum Press, Copenhagen, 1992.Google Scholar
  48. 48.
    N. (Thorsen) Gronnum, “Superposition and subordination in intonation - a non-linear approach,” in Proceedings of the 13 th International Congress - Phon. Sc. Stockholm, (Stockholm), pp. 124–131, 1995.Google Scholar
  49. 49.
    E. Garding, “A generative model of intonation,” in Prosody: Models and Measuraments (A. Cutler and D. R. Ladd, eds.), pp. 11–25, Berlin, Germany: Springer-Verlag, 1983.Google Scholar
  50. 50.
    H. Fujisaki, K. Hirose, P. Halle, and H. Lei, “A generative model for the prosody of connected speech in Japanese,” in Ann. Rep. Engineerng Research Institute 30, pp. 75–80, 1971.Google Scholar
  51. 51.
    H. Fujisaki, “Dynamic characteristics of voice fundamental frequency in speech and singing,” in The production of speech (P. F. MacNeilage, ed.), pp. 39–55, New York, USA: Springer-Verlag, 1983.Google Scholar
  52. 52.
    H. Fujisaki, “A note on the physiological and physical basis for the phrase and accent components in the voice fundamental frequency contour,” in Vocal Physiology: Voice Production, Mechanisms and Functions (O. Fujimura, ed.), pp. 347–355, New York, USA: Raven Press, 1988.Google Scholar
  53. 53.
    H. Fujisaki, K. Hirose, and N. Takahashi, “Acoustic characteristics and the underlying rules of the intonation of the common Japanese used by radio and TV anouncers,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 2039–2042, 1986.Google Scholar
  54. 54.
    H. Fujisaki, S. Ohno, K. Nakamura, M. Guirao, and J. Gurlekian, “Analysis and synthesis of accent and intonation in standard Spanish,” in Proc. Int. Conf. Spoken Language Processing, (Yokohama), pp. 355–358, 1994.Google Scholar
  55. 55.
    H. Fujisaki and S. Ohno, “Analysis and modeling of fundamental frequency contours of English utterances,” in Proceedings Eurospeech 95, (Madrid), pp. 985–988, 1995.Google Scholar
  56. 56.
    H. Fujisaki, S. Ohno, and T. Yagi, “Analysis and modeling of fundamental frequency contours of Greek utterances,” in Proceedings Eurospeech 97, (Rhodes, Greece), pp. 465–468, Sept. 1997.Google Scholar
  57. 57.
    H. Mixdorff and H. Fujisaki, “Analysis of voice fundamental frequency contours of German utterances using a quantitative model,” in Proc. Int. Conf. Spoken Language Processing, vol. 4, (Yokohama), pp. 2231–2234, 1994.Google Scholar
  58. 58.
    P. Taylor and S. Isard, “A new model of intonation for use with speech synthesis and recognition,” in Proc. Int. Conf. Spoken Language Processing, pp. 1287–1290, 1992.Google Scholar
  59. 59.
    J. t’Hart, R. Collier, and A. Cohen, A perceptual study of intonation. Cambridge: Cambridge University Press.Google Scholar
  60. 60.
    C. D’Alessandro and P. Mertens, “Automatic pitch contour stylisation using a model of tonal perception,” Computer Speech and Language, vol. 9, pp. 257–288, 1995.CrossRefGoogle Scholar
  61. 61.
    P. Mertens, L’intonation du Franais: de la description linguistique a’ la reconnaissance automatique. PhD thesis, Katholieke Universiteit Leuven, Leuven, 1987.Google Scholar
  62. 62.
    J. Terken, “Synthesizing natural sounding intonation for Dutch: rules and perceptual evaluation,” Computer Speech and Language, vol. 7, pp. 27–48, 1993.CrossRefGoogle Scholar
  63. 63.
    J. R. de Pijper, “Modeling British English Intonation,” 1983. Foris, Dordrecht.Google Scholar
  64. 64.
    L. M. H. Adriaens, Ein Modell Deutscher Intonation. PhD thesis, Technical University Eindhoven, Eindhoven, 1991.Google Scholar
  65. 65.
    C. Ode, “Russian intonation: A perceptual description,” 1989. Rodopi, Amsterdam.Google Scholar
  66. 66.
    M. S. Scordilis and J. N. Gowdy, “Neural network based generation of fundamental frequency contours,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. 1, (Glasgow, Scotland), pp. 219–222, May. 1989.Google Scholar
  67. 67.
    M. Vainio and T. Altosaar, “Modeling the microprosody of pitch and loudness for speech synthesis with neural networks,” in Proc. Int. Conf. Spoken Language Processing, (Sidney, Australia), Dec. 1998.Google Scholar
  68. 68.
    M. Vainio, Artificial neural network based prosody models for Finnish text-to-speech synthesis. PhD thesis, Dept. of Phonetics, University of Helsinki, Finland, 2001.Google Scholar
  69. 69.
    S. H. Hwang and S. H. Chen, “Neural-network-based F0 text-to-speech synthesizer for Mandarin,” IEE Proc. Image Signal Processing, vol. 141, pp. 384–390, Dec. 1994.CrossRefGoogle Scholar
  70. 70.
    A. S. M. Kumar, S. Rajendran, and B. Yegnanarayana, “Intonation component of text-to-speech system for Hindi,” Computer Speech and Language, vol. 7, pp. 283–301, 1993.CrossRefGoogle Scholar
  71. 71.
    A. S. M. Kumar, Intonation knowledge for speech systems for an Indian language. PhD thesis, Dept. of Computer Science and Engineering, Indian Institute of Technology, Madras, Chennai, India, Jan. 1993.Google Scholar
  72. 72.
    T. F. Quatieri and R. J. McAulay, “Shape invariant time-scale and pitch modification of speech,” IEEE Trans. Signal Processing, vol. 40, pp. 497–510, Mar. 1992.CrossRefGoogle Scholar
  73. 73.
    E. Moulines and F. Charpentier, “Pitch-synchronous waveform processing techniques for text to speech synthesis using diphones,” Speech Communication, vol. 9, pp. 453–467, Dec. 1990.CrossRefGoogle Scholar
  74. 74.
    D. G. Childers, K. Wu, D. M. Hicks, and B. Yegnanarayana, “Voice conversion,” Speech Communication, vol. 8, pp. 147–158, June 1989.CrossRefGoogle Scholar
  75. 75.
    E. Moulines and J. Laroche, “Non-parametric techniques for pitch-scale and time-scale modification of speech,” Speech Communication, vol. 16, pp. 175–205, Feb. 1995.CrossRefGoogle Scholar
  76. 76.
    B. Yegnanarayana, S. Rajendran, V. R. Ramachandran, and A. S. M. Kumar, “Significance of knowledge sources for TTS system for Indian languages,” SADHANA Academy Proc. in Engineering Sciences, vol. 19, pp. 147–169, Feb. 1994.Google Scholar
  77. 77.
    M. R. Portnoff, “Time-scale modification of speech based on short-time Fourier analysis,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 29, pp. 374–390, June. 1981.Google Scholar
  78. 78.
    M. R. Schroeder, J. L. Flanagan, and E. A. Lundry, “Bandwidth compression of speech by analytic-signal rooting,” Proc. IEEE, vol. 55, pp. 396–401, Mar. 1967.CrossRefGoogle Scholar
  79. 79.
    D. H. Klatt, “Review of text-to-speech conversion for English,” Journal of Acoustic Society of America, vol. 82(3), pp. 737–793, Sep. 1987.CrossRefGoogle Scholar
  80. 80.
    M. Narendranadh, H. A. Murthy, S. Rajendran, and B. Yegnanarayana, “Transformation of formants for voice conversion using artificial neural networks,” Speech Communication, vol. 16, pp. 206–216, Feb. 1995.Google Scholar
  81. 81.
    E. P. Neuburg, “Simple pitch-dependent algorithm for high-quality speech rate changing,” Journal of Acoustic Society of America, vol. 63, pp. 624–625, Feb. 1978.CrossRefGoogle Scholar
  82. 82.
    E. B. George and M. J. T. Smith, “Speech Analysis/Synthesis and modification using an Analysis-by-Synthesis/Overlap-Add Sinusoidal model,” IEEE Trans. Speech and Audio Processing, vol. 5, pp. 389–406, Sept. 1997.Google Scholar
  83. 83.
    R. Crochiere, “A weighted overlap-add method of short time Fourier analysis/synthesis,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 28, pp. 99–102, Feb. 1980.Google Scholar
  84. 84.
    S. Roucos and A. Wilgus, “High quality time-scale modification of speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Tampa, Florida, USA), pp. 493–496, Mar. 1985.Google Scholar
  85. 85.
    J. Laroche, Y. Stylianou, and E. Moulines, “HNS: Speech modification based on a harmonic + noise model,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Minneapolis, USA), pp. 550–553, Apr. 1993.Google Scholar
  86. 86.
    Y. Stylianou, “Applying the harmonic plus noise model in concatenative speech synthesis,” IEEE Trans. Speech and Audio Processing, vol. 9, pp. 21–29, Jan. 2001.CrossRefGoogle Scholar
  87. 87.
    H. Kawahara, “Speech representation and transformation using adaptive interpolation of weighted spectrum: Vocoder revisited,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. 2, (Munich, Germany), pp. 1303–1306, 1997.Google Scholar
  88. 88.
    H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigne, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds,” Speech Communication, vol. 27, pp. 187–207, 1999.CrossRefGoogle Scholar
  89. 89.
    R. MuraliSankar, A. G. Ramakrishnan, and P. Prathibha, “Modification of pitch using DCT in source domain,” Speech Communication, vol. 42, pp. 143–154, Jan. 2004.CrossRefGoogle Scholar
  90. 90.
    R. MuraliSankar, A. G. Ramakrishnan, A. K. Rohitprasad, and M. Anoop, “DCT baced pitch modification,” in Proc. SPCOM 2001 6th Biennial Conference, (Bangalore, India), pp. 114–117, July 2001.Google Scholar
  91. 91.
    W. Verhelst, “Overlap-add methods for time-scaling of speech,” Speech Communication, vol. 30, pp. 207–221, 2000.CrossRefGoogle Scholar
  92. 92.
    D. O’Brien and A. Monaghan, “Shape invariant time-scale modification of speech using a harmonic model,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Arizona, USA), 1999.Google Scholar
  93. 93.
    D. O’Brien and A. Monaghan, “Shape invariant pitch modification of speech using a harmonic model,” in Proc. Eurospeech, (Budapest), 1999.Google Scholar
  94. 94.
    D. O’Brien and A. Monaghan, Improvements in Speech Synthesis, ch. Shape invariant pitch and time-scale modification of speech based on harmonic model. Chichester: John Wiley & Sons, 2001.Google Scholar
  95. 95.
    B. Yegnanarayana, C. d’Alessandro, and V. Darsinos, “An iterative algorithm for decomposition of speech signals into periodic and aperiodic components,” IEEE Trans. Speech and Audio Processing, vol. 6, pp. 1–11, Jan. 1998.CrossRefGoogle Scholar
  96. 96.
    S. Lemmetty, “Review of speech synthesis technology,” Master’s thesis, Dept. of Electrical and Communications Engineering, Helsinki University of Technology, Espoo, Finland, Mar. 1999.Google Scholar
  97. 97.
    R. Kortekaas and A. Kohlrausch, “Psychoacoustical evaluation of the Pitch Synchronous Overlap-and-Add speech waveform manipulation technique using single formant stimuli,” Journal of Acoustic Society of America, vol. 101, no. 4, pp. 2202–2213, 1997.CrossRefGoogle Scholar
  98. 98.
    H. Valbret, E. Moulines, and J. P. Tubach, “Voice transformation using PSOLA techniques,” Speech Communication, vol. 11, pp. 175–187, 1992.CrossRefGoogle Scholar
  99. 99.
    Y. Jiang and P. Murphy, “Production based pitch modification of voiced speech,” in Proc. Int. Conf. Spoken Language Processing, (Denver, Colorado, USA), pp. 2073–2076, Sept. 2002.Google Scholar
  100. 100.
    S. Haykin, Neural Networks: A Comprehensive Foundation. New Delhi, India: Pearson Education Aisa, Inc., 1999.MATHGoogle Scholar
  101. 101.
    B. Yegnanarayana, Artificial Neural Networks. New Delhi, India: Printice-Hall, 1999.Google Scholar
  102. 102.
    V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 2001.Google Scholar
  103. 103.
    A. Smola and B. Scholkopf, A Tutorial on Support Vector Regression. Technical report Neuro COLT NC-TR-98-030, 1998.Google Scholar
  104. 104.
    X. Huang, A. Acero, and H. W. Hon, Spoken Language Processing. New York, NJ, USA: Prentice-Hall, Inc., 2001.Google Scholar
  105. 105.
    D. H. Klatt, “Linguistic uses of segmental duration in English: Acoustic and perceptual evidence,” Journal of Acoustic Society of America, vol. 59, pp. 1209–1221, 1976.CrossRefGoogle Scholar
  106. 106.
    A. W. F. Huggins, “On the perception of temporal phenomena in speech,” Journal of Acoustic Society of America, vol. 4, pp. 1279–1290, 1972.CrossRefGoogle Scholar
  107. 107.
    D. K. Oller, “The effect of position in utterance on speech segment duration in English,” Journal of Acoustic Society of America, vol. 54, pp. 1247–1253, 1973.CrossRefGoogle Scholar
  108. 108.
    T. H. Crystal and A. S. House, “Characterization and modeling of speech segment durations,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 2791–2794, 1986.Google Scholar
  109. 109.
    T. H. Crystal and A. S. House, “The duration of American English vowels: an overview,” Journal of Phonetics, vol. 16, pp. 263–284, 1988.Google Scholar
  110. 110.
    T. H. Crystal and A. S. House, “The duration of American English stop consonants: An overview,” Journal of Phonetics, vol. 16, pp. 285–294, 1988.Google Scholar
  111. 111.
    K. N. Reddy, “The duration of Telugu speech sounds: an acoustic study,” Special issue of Journal of IETE on Speech processing, pp. 57–63, 1988.Google Scholar
  112. 112.
    S. R. Savithri, “Duration analysis of Kannada vowels,” Journal of Acoustical Society of India, vol. 4, pp. 34–40, 1986.Google Scholar
  113. 113.
    K. S. Rao, S. V. Gangashetty, and B. Yegnanarayana, “Duration analysis for Telugu language,” in Int. Conf. Natural Language Processing, (Mysore, India), pp. 152–158, Dec. 2003.Google Scholar
  114. 114.
    N. Umeda, “Linguistic rules for text-to-speech synthesis,” Proc. IEEE, vol. 4, pp. 443–451, 1976.CrossRefGoogle Scholar
  115. 115.
    A. Chopde, “Itrans Indian language transliteration package version 5.2 source.” http://www.aczone.con/itrans/.Google Scholar
  116. 116.
    A. N. Khan, S. V. Gangashetty, and B. Yegnanarayana, “Syllabic properties of three Indian languages: Implications for speech recognition and language identification,” in Int. Conf. Natural Language Processing, (Mysore, India), pp. 125–134, Dec. 2003.Google Scholar
  117. 117.
    O. Fujimura, “Syllable as a unit of speech recognition,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 23, pp. 82–87, Feb. 1975.Google Scholar
  118. 118.
    K. S. Rao and S. G. Koolagudi, “Selection of suitable features for modeling the durations of syllables,” Journal of Software Engineering and Applications, vol. 3, Dec. 2010.Google Scholar
  119. 119.
    M. Riedi, “A neural network based model of segmental duration for speech synthesis,” in Proc. European Conf. Speech Communication and Technology, (Madrid), pp. 599–602, Sept. 1995.Google Scholar
  120. 120.
    W. N. Campbell, “Predicting segmental durations for accommodation within a syllable-level timing framework,” in Proc. European Conf. Speech Communication and Technology, vol. 2, (Berlin, Germany), pp. 1081–1084, Sept. 1993.Google Scholar
  121. 121.
    S. Rajendran, K. S. Rao, B. Yegnanarayana, and K. N. Reddy, “Syllable duration in broadcast news in Telugu: A preliminary study,” in National Conf. on Language Technology Tools: Implementation of Telugu/Urdu, (Hyderabad, India), Oct. 2003.Google Scholar
  122. 122.
    K. S. Rao, S. V. Gangashetty, and B. Yegnanarayana, “Duration analysis for Telugu language,” in Int. Conf. on Natural Language Processing (ICON), (Mysore, India), Dec. 2003.Google Scholar
  123. 123.
    S. Lee, K. Hirose, and N. Minematsu, “Incoporation of prosodic modules for large vocabulary continuous speech recognition,” in Proc. ISCA Workshop on Prosody in Speech recognition and understanding, pp. 97–101, 2001.Google Scholar
  124. 124.
    K. Ivano, T. Seki, and S. Furui, “Noise robust speech recognition using F0 contour extract by Hough transform,” in Proc. Int. Conf. Spoken Language Processing, pp. 941–944, 2002.Google Scholar
  125. 125.
    L. Mary and B. Yegnanarayana, “Prosodic features for speaker verification,” in Proc. Int. Conf. Spoken Language Processing, (Pittsburgh, PA, USA), pp. 917–920, Sep. 2006.Google Scholar
  126. 126.
    L. Mary, Multi level implicit features for language and speaker recognition. PhD thesis, Dept. of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, India, June 2006.Google Scholar
  127. 127.
    L. Mary and B. Yegnanarayana, “Consonant-vowel based features for language identification,” in Int. Conf. Natural Language Processing, (Kanpur, India), pp. 103–106, Dec. 2005.Google Scholar
  128. 128.
    L. Mary, K. S. Rao, and B. Yegnanarayana, “Neural network classifiers for language identification using phonotactic and prosodic features,” in Proc. Int. Conf. Intelligent Sensing and Information Processing (ICISIP), (Chennai, India), pp. 404–408, Jan. 2005.Google Scholar
  129. 129.
    K. K. Kumar, K. S. Rao, and B. Yegnanarayana, “Duration knowledge for text-to-speech system for telugu,” in Int. Conf. Knowledge based computer systems (KBCS), (Mumbai, India), Dec. 2002.Google Scholar
  130. 130.
    C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167, 1998.CrossRefGoogle Scholar
  131. 131.
    T. B. Trafalis and H. Lnce, “Support vector machine for regression and applications to financial forecasting,” in Int. Joint Conf. Neural Networks, pp. 348–353, June 2000.Google Scholar
  132. 132.
    J. R. Bellegarda, K. E. A. Silverman, K. Lenzo, and V. Anderson, “Statistical prosodic modeling: From corpus design to parameter estimation,” IEEE Trans. Speech and Audio Processing, vol. 9, pp. 52–66, Jan. 2001.CrossRefGoogle Scholar
  133. 133.
    J. R. Bellegarda and K. E. A. Silverman, “Improved duration modeling of English phonemes using a root sinusoidal transformation,” in Proc. Int. Conf. Spoken Language Processing, pp. 21–24, Dec. 1998.Google Scholar
  134. 134.
    K. E. A. Silverman and J. R. Bellegarda, “Using a sigmoid transformation for improved modeling of phoneme duration,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Phoenix, AZ, USA), pp. 385–388, Mar. 1999.Google Scholar
  135. 135.
    B. Siebenhaar, B. Zellner-Keller, and E. Keller, “Phonetic and timing considerations in a Swiss high German TTS system,” in Improvements in Speech Synthesis (E. Keller, G. Bailly, A. Monaghan, J. Terken, and M. Huckvale, eds.), Chichester: John Wiley, 2001.Google Scholar
  136. 136.
    C. S. Gupta, S. R. M. Prasanna, and B. Yegnanarayana, “Autoassociative neural network models for online speaker verification using source features from vowels,” in Int. Joint Conf. Neural Networks, (Honululu, Hawii, USA), May 2002.Google Scholar
  137. 137.
    B. Yegnanarayana and S. P. Kishore, “AANN an alternative to GMM for pattern recognition,” Neural Networks, vol. 15, pp. 459–469, Apr. 2002.CrossRefGoogle Scholar
  138. 138.
    K. S. Rao and B. Yegnanarayana, “Modeling syllable duration in Indian languages using neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Montreal, Quebec, Canada), May 2004.Google Scholar
  139. 139.
    K. S. Rao and B. Yegnanarayana, “Modeling syllable duration in indian languages using support vector machines,” in 2nd Int. Conf. Intelligent Sensing and Information Processing (ICISIP-2005), (Chennai, India), Jan. 2005.Google Scholar
  140. 140.
    K. S. Rao and B. Yegnanarayana, “Modeling durations of syllables using neural networks,” Computer Speech and Language, vol. 21, pp. 282–295, Apr. 2007.CrossRefGoogle Scholar
  141. 141.
    K. S. Rao, “Modeling supra-segmental features of syllables using neural networks,” in Speech, Audio, Image and Biomedical Signal Processing using Neural Networks (P. B. Prasad and S. R. M. Prasanna, eds.), pp. 71–95, Springer, 2008.Google Scholar
  142. 142.
    K. S. Rao and B. Yegnanarayana, “Impact of constraints on prosody modeling for Indian languiages,” in 3rd International Conference on Natural Language Processing (ICON-2004), (Hyderabad, India), pp. 225–236, Dec. 2004.Google Scholar
  143. 143.
    K. S. Rao and B. Yegnanarayana, “Two-stage duration model for indian languages using neural networks,” in Lecture Notes in Computer Science, Neural Information Processing (Springer), pp. 1179–1185, 2004.Google Scholar
  144. 144.
    K. S. Rao, “Application of prosody models for developing speech systems in Indian languages,” International Journal of Speech Technology, vol. 14, pp. 19–23, March 2011.CrossRefGoogle Scholar
  145. 145.
    S. R. M. Prasanna and B. Yegnanarayana, “Extraction of pitch in adverse conditions,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Montreal, Canada), May 2004.Google Scholar
  146. 146.
    K. S. Rao and B. Yegnanarayana, “Intonation modeling for indian languages,” Computer Speech and Language, vol. 23, pp. 240–256, Apr. 2009.CrossRefGoogle Scholar
  147. 147.
    K. S. Rao and B. Yegnanarayana, “Intonation modeling for indian languages,” in 8th Int. Conf. on Spoken Language Processing (Interspeech-2004), (Jeju Island, Korea), pp. 733–736, Oct. 2004.Google Scholar
  148. 148.
    L. Mary, K. S. Rao, S. V. Gangashetty, and B. Yegnanarayana, “Neural network models for capturing duration and intonation knowledge for language and speaker identification,” in 8th Int. Conf. on Cognitive and Neural systems (ICCNS), (Boston, MA, USA), May 2004.Google Scholar
  149. 149.
    L. Mary, K. S. Rao, and B. Yegnanarayana, “Neural network classifiers for language identification using syntactic and prosodic features,” in 2nd Int. Conf. Intelligent Sensing and Information Processing (ICISIP-2005), (Chennai, India), Jan. 2005.Google Scholar
  150. 150.
    S. G. Koolagudi and K. S. Rao, “Neural network models for capturing prosodic knowledge for emotion recognition,” in 12th Int. Conf. on Cognitive and Neural systems (ICCNS), (Boston, MA, USA), May 2008.Google Scholar
  151. 151.
    P. S. Murthy and B. Yegnanarayana, “Robustness of group-delay-based method for extraction of significant excitation from speech signals,” IEEE Trans. Speech and Audio Processing, vol. 7, pp. 609–619, Nov. 1999.CrossRefGoogle Scholar
  152. 152.
    R. Smits and B. Yegnanarayana, “Determination of instants of significant excitation in speech using group delay function,” IEEE Trans. Speech and Audio Processing, vol. 3, pp. 325–333, Sept. 1995.Google Scholar
  153. 153.
    J. Makhoul, “Linear prediction: A tutorial review,” Proc. IEEE, vol. 63, pp. 561–580, Apr. 1975.CrossRefGoogle Scholar
  154. 154.
    A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-time signal processing. Upper Saddle River, NJ.: Prentice-Hall, 1999.Google Scholar
  155. 155.
    W. M. Fisher, G. R. Doddington, and K. M. Goudie-Marshall, “The DARPA speech recognition database: Specifications and status,” in Proc. DARPA Workshop on speech recognition, pp. 93–99, Feb. 1986.Google Scholar
  156. 156.
    J. R. Deller, J. G. Proakis, and J. H. L. Hansen, Discrete-time processing of speech signals. New York, USA: Macmilan Publishing Company, 1993.Google Scholar
  157. 157.
    R. V. Hogg and J. Ledolter, Engineering Statistics. 866 Third Avenue, New York, USA: Macmillan Publishing Company, 1987.Google Scholar
  158. 158.
    K. S. Rao and B. Yegnanarayana, “Prosody modification using instants of significant excitation,” IEEE Trans. Speech and Audio Processing, vol. 14, pp. 972–980, May 2006.CrossRefGoogle Scholar
  159. 159.
    K. S. Rao and B. Yegnanarayana, “Prosodic manipulation using instants of significant excitation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, April 2003.Google Scholar
  160. 160.
    K. S. Rao and B. Yegnanarayana, “Prosodic manipulation using instants of significant excitation,” in IEEE Int. Conf. Multimedia and Expo, (Baltimore, Maryland, USA), July 2003.Google Scholar
  161. 161.
    T. V. Ananthapadmanabha and B. Yegnanarayana, “Epoch extraction from linear prediction residual for identification of closed glottis interval,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 27, pp. 309–319, Aug. 1979.Google Scholar
  162. 162.
    A. V. Oppenheim and R. W. Schafer, Digital Signal Processing. Englewood Cliffs, New Jersey, USA: Prentice Hall, 1975.MATHGoogle Scholar
  163. 163.
    B. Yegnanarayana, S. R. M. Prasanna, and K. S. Rao, “Speech enhancement using excitation source information,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. 1, (Orlando, Florida, USA), pp. 541–544, May 2002.Google Scholar
  164. 164.
    D. Gabor, “Theory of communication,” J. IEE, vol. 93, no. 2, pp. 429–457, 1946.Google Scholar
  165. 165.
    N. S. Krishna, H. A. Murthy, and T. A. Gonsalves, “Text-to-speech (tts) in indian languages,” in Int. Conf. Natural Language Processing, 2002.Google Scholar
  166. 166.
    S. Srikanth, S. R. R. Kumar, R. Sundar, and B. Yegnanarayana, A text-to-speech conversion system for Indian languages based on waveform concatenation model. Technical report no.11, Project VOIS, Dept. of Computer Science and Engineering, Indian Institute of Technology Madras, Mar. 1989.Google Scholar
  167. 167.
    B. Zellner, “Fast and slow speech rate: A characterization for French,” in Proc. Int. Conf. Spoken Language Processing, (Sydney, Australia.), pp. 542–545, Dec. 1998.Google Scholar
  168. 168.
    S. R. M. Prasanna and J. M. Zachariah, “Detection of vowel onset point in speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Orlando, Florida, USA), May 2002.Google Scholar
  169. 169.
    S. V. Gangashetty, C. C. Sekhar, and B. Yegnanarayana, “Extraction of fixed dimension patterns from varying duration segments of consonant-vowel utterances,” in Proc. IEEE Int. Conf. Intelligent Sensing and Information Processing, (Chennai, India), pp. 159–164, Jan. 2004.Google Scholar
  170. 170.
    Database for Indian languages. Speech and Vision lab, Indian Institute of Technology Madras, India, 2001.Google Scholar
  171. 171.
    H. A. Murthy and B. Yegnanarayana, “Formant extraction from group delay function,” Speech Communication, vol. 10, pp. 209–221, Mar. 1991.CrossRefGoogle Scholar
  172. 172.
    K. S. Rao, S. R. M. Prasanna, and B. Yegnanarayana, “Determination of instants of significant excitation in speech using hilbert envelope and group delay function,” IEEE Signal Processing Letters, vol. 14, pp. 762–765, Oct. 2007.CrossRefGoogle Scholar
  173. 173.
    K. S. Rao, “Real time prosody modification,” Journal of Signal and Information Processing, Nov. 2010.Google Scholar
  174. 174.
    K. S. Rao and B. Yegnanarayana, “Neural network models for text-to-speech synthesis,” in 5th International Conference on Knowledge Based Computer Systems (KBCS-2004), (Hyderabad, India), pp. 520–530, Dec. 2004.Google Scholar
  175. 175.
    K. S. Rao and B. Yegnanarayana, “Duration modification using glottal closure instants and vowel onset points,” Speech Communication, vol. 51, pp. 1263–1269, Dec. 2009.CrossRefGoogle Scholar
  176. 176.
    K. S. Rao and B. Yegnanarayana, “Voice conversion by prosody and vocal tract modification,” in 9th Int. Conf. Information Technology, (Bhubaneswar, Orissa, India), Dec 2006.Google Scholar
  177. 177.
    K. S. Rao, “Voice conversion by mapping the speaker-specific features using pitch synchronous approach,” Computer Speech and Language, vol. 24, pp. 474–494, July 2010.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2012

Authors and Affiliations

  • K. Sreenivasa Rao
    • 1
  1. 1.School of Information TechnologyIndian Institute of Technology KharagpurKharagpurIndia

Personalised recommendations