Skip to main content

Analysis of Durations of Sound Units

  • Chapter
  • First Online:
Predicting Prosody from Text for Text-to-Speech Synthesis

Part of the book series: SpringerBriefs in Electrical and Computer Engineering ((BRIEFSSPEECHTECH))

  • 743 Accesses

Abstract

This chapter presents the detailed analysis of durations of sound units. Durations of the syllables are analyzed with respect to positional and contextual factors. For detailed duration analysis, syllables are categorized into groups based on size of the word and position of the word in the utterance, and the analysis is performed separately on each category. From the duration analysis presented in this chapter, it is observed that durations of sound units depend on several factors at various levels, and it is very difficult to derive the precise rules for accurate estimation of durations. Therefore, there is a need to explore nonlinear models to capture the duration patterns of sound units from features mentioned in this chapter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. S. Werner and E. Keller, “Prosodic aspects of speech,” in Fundamentals of Speech Synthesis and Speech Recognition: Basic Concepts, State of the Art, the Future Challenges (E. Keller, ed.), pp. 23–40, Chichester: John Wiley, 1994.

    Google Scholar 

  2. K. S. Rao and B. Yegnanarayana, “Modeling syllable duration in Indian languages using neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Montreal, Quebec, Canada), pp. 313–316, May 2004.

    Google Scholar 

  3. K. S. Rao and B. Yegnanarayana, “Intonation modeling for Indian languages,” in Proc. Int. Conf. Spoken Language Processing, (Jeju Island, Korea), pp. 733–736, Oct. 2004.

    Google Scholar 

  4. K. S. Rao and B. Yegnanarayana, “Prosodic manipulation using instants of significant excitation,” in Proc. IEEE Int. Conf. Multimedia and Expo, (Baltimore, Maryland, USA), pp. 389–392, July 2003.

    Google Scholar 

  5. H. Mixdorff, An integrated approach to modeling German prosody. PhD thesis, Technical University, Dresden, Germany, July 2002.

    Google Scholar 

  6. D. H. Klatt, “Synthesis by rule of segmental durations in English sentences,” in Frontiers of Speech Communication Research (B. Lindblom and S. Ohman, eds.), pp. 287–300, New York: Academic Press, 1979.

    Google Scholar 

  7. J. Allen, M. S. Hunnicut, and D. H. Klatt, From Text to Speech: The MITalk system. Cambridge: Cambridge University Press, 1987.

    Google Scholar 

  8. K. J. Kohler, “Zeistrukturierung in der Sprachsynthese,” ITG-Tagung Digitalc Sprachverarbeitung, no. 6, pp. 165–170, 1988.

    Google Scholar 

  9. K. Bartkova and C. Sorin, “A model of segmental duration for speech synthesis in French,” Speech Communication, no. 6, pp. 245–260, 1987.

    Google Scholar 

  10. S. R. R. Kumar and B. Yegnanarayana, “Significance of durational knowledge for speech synthesis in Indian languages,” in Proc. IEEE Region 10 Conf. Convergent Technologies for the Asia-Pacific, (Bombay, India), pp. 486–489, Nov. 1989.

    Google Scholar 

  11. S. R. R. Kumar, “Significance of durational knowledge for a text-to-speech system in an Indian language,” Master’s thesis, Dept. of Computer science and Engineering, Indian Institute of Technology Madras, Mar. 1990.

    Google Scholar 

  12. R. Sriram, S. R. R. Kumar, and B. Yegnanarayana, A Text-to-Speech conversion system for Indian languages using parameter based approach. Technical report no.12, Project VOIS, Dept. of CSE, IITM, May 1989.

    Google Scholar 

  13. K. K. Kumar, “Duration and intonation knowledge for text-to-speech conversion system for Telugu and Hindi,” Master’s thesis, Dept. of Computer science and Engineering, Indian Institute of Technology Madras, May 2002.

    Google Scholar 

  14. K. K. Kumar, K. S. Rao, and B. Yegnanarayana, “Duration knowledge for text-to-speech system for Telugu,” in Proc. Int. Conf. Knowledge Based Computer Systems, (Mumbai, India), pp. 563–571, Dec. 2002.

    Google Scholar 

  15. S. H. Chen, W. H. Lai, and Y. R. Wang, “A new duration modeling approach for Mandarin speech,” IEEE Trans. Speech and Audio Processing, vol. 11, pp. 308–320, July 2003.

    Article  Google Scholar 

  16. J. P. H. V. Santen, “Segmental duration and speech timing,” in Computing prosody (Sagisaka, Campbell, and Higuchi, eds.), pp. 225–249, Springer-Verlag, 1996.

    Google Scholar 

  17. J. P. H. V. Santen, “Assignment of segment duration in text-to-speech synthesis,” Computer Speech and Language, vol. 8, pp. 95–128, Apr. 1994.

    Article  Google Scholar 

  18. J. P. H. V. Santen, “Timing in text-to-speech systems,” in Proc. Eurospeech, vol. 35, (Berlin, Germany), pp. 1397–1404, 1993.

    Google Scholar 

  19. J. P. H. V. Santen, “Analyzing n-way tables with sums-of-products models,” Journal of Mathematical Psychology, vol. 37, pp. 327–371, 1993.

    Article  MathSciNet  MATH  Google Scholar 

  20. J. P. H. V. Santen, “Prosodic modeling in text-to-speech synthesis,” in Proc. Eurospeech, (Rhodes, Greece), 1997.

    Google Scholar 

  21. J. P. H. V. Santen, C. Shih, and et. al., “Multi-lingual duration modeling,” in Proc. Eurospeech, vol. 5, (Rhodes, Greece), pp. 2651–2654, 1997.

    Google Scholar 

  22. O. Goubanova and P. Taylor, “Using bayesian belief networks for model duration in text-to-speech systems,” in Proc. Int. Conf. Spoken Language Processing, vol. 2, (Beijing, China), pp. 427–431, Oct. 2000.

    Google Scholar 

  23. O. Sayli, “Duration analysis and modeling for Turkish text-to-speech synthesis,” Master’s thesis, Dept. of Electrical and Electronics Engineering, Bogaziei University, 2002.

    Google Scholar 

  24. A. W. Black, P. Taylor, and R. Caley, “The Festival speech synthesis system.” Manual and source code available at www.cstr.ed.ac.uk/projects/festival.html .

  25. M. Riley, “Tree-based modeling of segmental durations,” Talking Machines: Theories, Models and Designs, pp. 265–273, 1992.

    Google Scholar 

  26. S. Lee and Y.-H. Oh, “Tree-based modeling of prosodic phrasing and segmental duration for Korean TTS systems,” Speech Communication, vol. 28, pp. 283–300, 1999.

    Article  Google Scholar 

  27. A. Maghboulegh, “An empirical comparison of automatic decision tree and hand-configured linear models for vowel durations,” in Proc. of the Workshop in Computational Phonology in Speech Technology, (Santa Cruz), 1996.

    Google Scholar 

  28. R. Batusek, “A duration model for Czech text-to-speech synthesis,” in Proceedings of TSD, (Pilsen, Czech Republic), Sept. 2002.

    Google Scholar 

  29. H. Chung, “Segment duration in spoken Korean,” in Proc. Int. Conf. Spoken Language Processing, (Denver, Colorado, USA), pp. 1105–1108, Sept. 2002.

    Google Scholar 

  30. N. S. Krishna and H. A. Murthy, “Duration modeling of Indian languages Hindi and Telugu,” in 5th ISCA Speech Synthesis Workshop, (Pittsburgh, USA), pp. 197–202, May 2004.

    Google Scholar 

  31. W. N. Campbell, “Analog i/o nets for syllable timing,” Speech Communication, vol. 9, pp. 57–61, Feb. 1990.

    Article  Google Scholar 

  32. W. N. Campbell, “Syllable based segment duration,” in Talking Machines: Theories, Models and Designs (G. Bailly, C. Benoit, and T. R. Sawallis, eds.), pp. 211–224, Elsevier, 1992.

    Google Scholar 

  33. P. A. Barbosa and G. Bailly, “Characterization of rhythmic patterns for text-to-speech synthesis,” Speech Communication, vol. 15, pp. 127–137, 1994.

    Article  Google Scholar 

  34. P. A. Barbosa and G. Bailly, “Generation of pauses within the z-score model,” in Progress in Speech Synthesis, pp. 365–381, Springer-Verlag, 1997.

    Google Scholar 

  35. R. Cordoba, J. A. Vallejo, J. M. Montero, J. Gutierrezarriola, M. A. Lopez, and J. M. Pardo, “Automatic modeling of duration in a Spanish text-to-speech system using neural networks,” in Proc. European Conf. Speech Communication and Technology, (Budapest, Hungary), Sept. 1999.

    Google Scholar 

  36. Y. Hifny and M. Rashwan, “Duration modeling of Arabic text-to-speech synthesis,” in Proc. Int. Conf. Spoken Language Processing, (Denver, Colorado, USA), pp. 1773–1776, Sept. 2002.

    Google Scholar 

  37. G. P. Sonntag, T. Portele, and B. Heuft, “Prosody generation with a neural network: Weighing the importance of input parameters,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Munich, Germany), pp. 931–934, Apr. 1997.

    Google Scholar 

  38. J. P. Teixeira and D. Freitas, “Segmental durations predicted with a neural network,” in Proc. European Conf. Speech Communication and Technology, (Geneva, Switzerland), pp. 169–172, Sept. 2003.

    Google Scholar 

  39. R. E. Donovan, Trainable speech synthesis. PhD thesis, Cambridge University Engineering Department, Christ’s college, Trumpington Street, Cambridge CB2 1PZ, England, June 1996.

    Google Scholar 

  40. A. Botinis, B. Granstrom, and B. Mobius, “Developments and paradigms in intonation research,” Speech Communication, vol. 33, pp. 263–296, 2001.

    Article  MATH  Google Scholar 

  41. P. A. Taylor, “Analysis and synthesis of intonation using the Tilt model,” Journal of Acoustic Society of America, vol. 107, pp. 1697–1714, Mar. 2000.

    Article  Google Scholar 

  42. P. A. Taylor, “The rise/fall/connection model of intonation,” Speech Communication, vol. 15, no. 15, pp. 169–186, 1995.

    Google Scholar 

  43. J. B. Pierrehumbert, The Phonology and Phonetics of English Intonation. PhD thesis, MIT, MA, USA, 1980.

    Google Scholar 

  44. R. Sproat, ed., Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Kluwer Academic Publishers, 1998.

    Google Scholar 

  45. M. Jilka, G. Mohler, and G. Dogil, “Rules for generation of TOBI-based American English intonation,” Speech Communication, vol. 28, pp. 83–108, 1999.

    Article  Google Scholar 

  46. J. Buhmann, H. Vereecken, J. Fackrell, J. P. Martens, and B. V. Coile, “Data driven intonation modeling of 6 languages,” in Proc. Int. Conf. Spoken Language Processing, vol. 3, (Beijing, China), pp. 179–183, Oct. 2000.

    Google Scholar 

  47. N. (Thorsen) Gronnum, “The groundworks of Danish intonation: An introduction.” Museum Tusculanum Press, Copenhagen, 1992.

    Google Scholar 

  48. N. (Thorsen) Gronnum, “Superposition and subordination in intonation - a non-linear approach,” in Proceedings of the 13 th International Congress - Phon. Sc. Stockholm, (Stockholm), pp. 124–131, 1995.

    Google Scholar 

  49. E. Garding, “A generative model of intonation,” in Prosody: Models and Measuraments (A. Cutler and D. R. Ladd, eds.), pp. 11–25, Berlin, Germany: Springer-Verlag, 1983.

    Google Scholar 

  50. H. Fujisaki, K. Hirose, P. Halle, and H. Lei, “A generative model for the prosody of connected speech in Japanese,” in Ann. Rep. Engineerng Research Institute 30, pp. 75–80, 1971.

    Google Scholar 

  51. H. Fujisaki, “Dynamic characteristics of voice fundamental frequency in speech and singing,” in The production of speech (P. F. MacNeilage, ed.), pp. 39–55, New York, USA: Springer-Verlag, 1983.

    Google Scholar 

  52. H. Fujisaki, “A note on the physiological and physical basis for the phrase and accent components in the voice fundamental frequency contour,” in Vocal Physiology: Voice Production, Mechanisms and Functions (O. Fujimura, ed.), pp. 347–355, New York, USA: Raven Press, 1988.

    Google Scholar 

  53. H. Fujisaki, K. Hirose, and N. Takahashi, “Acoustic characteristics and the underlying rules of the intonation of the common Japanese used by radio and TV anouncers,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 2039–2042, 1986.

    Google Scholar 

  54. H. Fujisaki, S. Ohno, K. Nakamura, M. Guirao, and J. Gurlekian, “Analysis and synthesis of accent and intonation in standard Spanish,” in Proc. Int. Conf. Spoken Language Processing, (Yokohama), pp. 355–358, 1994.

    Google Scholar 

  55. H. Fujisaki and S. Ohno, “Analysis and modeling of fundamental frequency contours of English utterances,” in Proceedings Eurospeech 95, (Madrid), pp. 985–988, 1995.

    Google Scholar 

  56. H. Fujisaki, S. Ohno, and T. Yagi, “Analysis and modeling of fundamental frequency contours of Greek utterances,” in Proceedings Eurospeech 97, (Rhodes, Greece), pp. 465–468, Sept. 1997.

    Google Scholar 

  57. H. Mixdorff and H. Fujisaki, “Analysis of voice fundamental frequency contours of German utterances using a quantitative model,” in Proc. Int. Conf. Spoken Language Processing, vol. 4, (Yokohama), pp. 2231–2234, 1994.

    Google Scholar 

  58. P. Taylor and S. Isard, “A new model of intonation for use with speech synthesis and recognition,” in Proc. Int. Conf. Spoken Language Processing, pp. 1287–1290, 1992.

    Google Scholar 

  59. J. t’Hart, R. Collier, and A. Cohen, A perceptual study of intonation. Cambridge: Cambridge University Press.

    Google Scholar 

  60. C. D’Alessandro and P. Mertens, “Automatic pitch contour stylisation using a model of tonal perception,” Computer Speech and Language, vol. 9, pp. 257–288, 1995.

    Article  Google Scholar 

  61. P. Mertens, L’intonation du Franais: de la description linguistique a’ la reconnaissance automatique. PhD thesis, Katholieke Universiteit Leuven, Leuven, 1987.

    Google Scholar 

  62. J. Terken, “Synthesizing natural sounding intonation for Dutch: rules and perceptual evaluation,” Computer Speech and Language, vol. 7, pp. 27–48, 1993.

    Article  Google Scholar 

  63. J. R. de Pijper, “Modeling British English Intonation,” 1983. Foris, Dordrecht.

    Google Scholar 

  64. L. M. H. Adriaens, Ein Modell Deutscher Intonation. PhD thesis, Technical University Eindhoven, Eindhoven, 1991.

    Google Scholar 

  65. C. Ode, “Russian intonation: A perceptual description,” 1989. Rodopi, Amsterdam.

    Google Scholar 

  66. M. S. Scordilis and J. N. Gowdy, “Neural network based generation of fundamental frequency contours,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. 1, (Glasgow, Scotland), pp. 219–222, May. 1989.

    Google Scholar 

  67. M. Vainio and T. Altosaar, “Modeling the microprosody of pitch and loudness for speech synthesis with neural networks,” in Proc. Int. Conf. Spoken Language Processing, (Sidney, Australia), Dec. 1998.

    Google Scholar 

  68. M. Vainio, Artificial neural network based prosody models for Finnish text-to-speech synthesis. PhD thesis, Dept. of Phonetics, University of Helsinki, Finland, 2001.

    Google Scholar 

  69. S. H. Hwang and S. H. Chen, “Neural-network-based F0 text-to-speech synthesizer for Mandarin,” IEE Proc. Image Signal Processing, vol. 141, pp. 384–390, Dec. 1994.

    Article  Google Scholar 

  70. A. S. M. Kumar, S. Rajendran, and B. Yegnanarayana, “Intonation component of text-to-speech system for Hindi,” Computer Speech and Language, vol. 7, pp. 283–301, 1993.

    Article  Google Scholar 

  71. A. S. M. Kumar, Intonation knowledge for speech systems for an Indian language. PhD thesis, Dept. of Computer Science and Engineering, Indian Institute of Technology, Madras, Chennai, India, Jan. 1993.

    Google Scholar 

  72. T. F. Quatieri and R. J. McAulay, “Shape invariant time-scale and pitch modification of speech,” IEEE Trans. Signal Processing, vol. 40, pp. 497–510, Mar. 1992.

    Article  Google Scholar 

  73. E. Moulines and F. Charpentier, “Pitch-synchronous waveform processing techniques for text to speech synthesis using diphones,” Speech Communication, vol. 9, pp. 453–467, Dec. 1990.

    Article  Google Scholar 

  74. D. G. Childers, K. Wu, D. M. Hicks, and B. Yegnanarayana, “Voice conversion,” Speech Communication, vol. 8, pp. 147–158, June 1989.

    Article  Google Scholar 

  75. E. Moulines and J. Laroche, “Non-parametric techniques for pitch-scale and time-scale modification of speech,” Speech Communication, vol. 16, pp. 175–205, Feb. 1995.

    Article  Google Scholar 

  76. B. Yegnanarayana, S. Rajendran, V. R. Ramachandran, and A. S. M. Kumar, “Significance of knowledge sources for TTS system for Indian languages,” SADHANA Academy Proc. in Engineering Sciences, vol. 19, pp. 147–169, Feb. 1994.

    Google Scholar 

  77. M. R. Portnoff, “Time-scale modification of speech based on short-time Fourier analysis,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 29, pp. 374–390, June. 1981.

    Google Scholar 

  78. M. R. Schroeder, J. L. Flanagan, and E. A. Lundry, “Bandwidth compression of speech by analytic-signal rooting,” Proc. IEEE, vol. 55, pp. 396–401, Mar. 1967.

    Article  Google Scholar 

  79. D. H. Klatt, “Review of text-to-speech conversion for English,” Journal of Acoustic Society of America, vol. 82(3), pp. 737–793, Sep. 1987.

    Article  Google Scholar 

  80. M. Narendranadh, H. A. Murthy, S. Rajendran, and B. Yegnanarayana, “Transformation of formants for voice conversion using artificial neural networks,” Speech Communication, vol. 16, pp. 206–216, Feb. 1995.

    Google Scholar 

  81. E. P. Neuburg, “Simple pitch-dependent algorithm for high-quality speech rate changing,” Journal of Acoustic Society of America, vol. 63, pp. 624–625, Feb. 1978.

    Article  Google Scholar 

  82. E. B. George and M. J. T. Smith, “Speech Analysis/Synthesis and modification using an Analysis-by-Synthesis/Overlap-Add Sinusoidal model,” IEEE Trans. Speech and Audio Processing, vol. 5, pp. 389–406, Sept. 1997.

    Google Scholar 

  83. R. Crochiere, “A weighted overlap-add method of short time Fourier analysis/synthesis,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 28, pp. 99–102, Feb. 1980.

    Google Scholar 

  84. S. Roucos and A. Wilgus, “High quality time-scale modification of speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Tampa, Florida, USA), pp. 493–496, Mar. 1985.

    Google Scholar 

  85. J. Laroche, Y. Stylianou, and E. Moulines, “HNS: Speech modification based on a harmonic + noise model,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Minneapolis, USA), pp. 550–553, Apr. 1993.

    Google Scholar 

  86. Y. Stylianou, “Applying the harmonic plus noise model in concatenative speech synthesis,” IEEE Trans. Speech and Audio Processing, vol. 9, pp. 21–29, Jan. 2001.

    Article  Google Scholar 

  87. H. Kawahara, “Speech representation and transformation using adaptive interpolation of weighted spectrum: Vocoder revisited,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. 2, (Munich, Germany), pp. 1303–1306, 1997.

    Google Scholar 

  88. H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigne, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds,” Speech Communication, vol. 27, pp. 187–207, 1999.

    Article  Google Scholar 

  89. R. MuraliSankar, A. G. Ramakrishnan, and P. Prathibha, “Modification of pitch using DCT in source domain,” Speech Communication, vol. 42, pp. 143–154, Jan. 2004.

    Article  Google Scholar 

  90. R. MuraliSankar, A. G. Ramakrishnan, A. K. Rohitprasad, and M. Anoop, “DCT baced pitch modification,” in Proc. SPCOM 2001 6th Biennial Conference, (Bangalore, India), pp. 114–117, July 2001.

    Google Scholar 

  91. W. Verhelst, “Overlap-add methods for time-scaling of speech,” Speech Communication, vol. 30, pp. 207–221, 2000.

    Article  Google Scholar 

  92. D. O’Brien and A. Monaghan, “Shape invariant time-scale modification of speech using a harmonic model,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Arizona, USA), 1999.

    Google Scholar 

  93. D. O’Brien and A. Monaghan, “Shape invariant pitch modification of speech using a harmonic model,” in Proc. Eurospeech, (Budapest), 1999.

    Google Scholar 

  94. D. O’Brien and A. Monaghan, Improvements in Speech Synthesis, ch. Shape invariant pitch and time-scale modification of speech based on harmonic model. Chichester: John Wiley & Sons, 2001.

    Google Scholar 

  95. B. Yegnanarayana, C. d’Alessandro, and V. Darsinos, “An iterative algorithm for decomposition of speech signals into periodic and aperiodic components,” IEEE Trans. Speech and Audio Processing, vol. 6, pp. 1–11, Jan. 1998.

    Article  Google Scholar 

  96. S. Lemmetty, “Review of speech synthesis technology,” Master’s thesis, Dept. of Electrical and Communications Engineering, Helsinki University of Technology, Espoo, Finland, Mar. 1999.

    Google Scholar 

  97. R. Kortekaas and A. Kohlrausch, “Psychoacoustical evaluation of the Pitch Synchronous Overlap-and-Add speech waveform manipulation technique using single formant stimuli,” Journal of Acoustic Society of America, vol. 101, no. 4, pp. 2202–2213, 1997.

    Article  Google Scholar 

  98. H. Valbret, E. Moulines, and J. P. Tubach, “Voice transformation using PSOLA techniques,” Speech Communication, vol. 11, pp. 175–187, 1992.

    Article  Google Scholar 

  99. Y. Jiang and P. Murphy, “Production based pitch modification of voiced speech,” in Proc. Int. Conf. Spoken Language Processing, (Denver, Colorado, USA), pp. 2073–2076, Sept. 2002.

    Google Scholar 

  100. S. Haykin, Neural Networks: A Comprehensive Foundation. New Delhi, India: Pearson Education Aisa, Inc., 1999.

    MATH  Google Scholar 

  101. B. Yegnanarayana, Artificial Neural Networks. New Delhi, India: Printice-Hall, 1999.

    Google Scholar 

  102. V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 2001.

    Google Scholar 

  103. A. Smola and B. Scholkopf, A Tutorial on Support Vector Regression. Technical report Neuro COLT NC-TR-98-030, 1998.

    Google Scholar 

  104. X. Huang, A. Acero, and H. W. Hon, Spoken Language Processing. New York, NJ, USA: Prentice-Hall, Inc., 2001.

    Google Scholar 

  105. D. H. Klatt, “Linguistic uses of segmental duration in English: Acoustic and perceptual evidence,” Journal of Acoustic Society of America, vol. 59, pp. 1209–1221, 1976.

    Article  Google Scholar 

  106. A. W. F. Huggins, “On the perception of temporal phenomena in speech,” Journal of Acoustic Society of America, vol. 4, pp. 1279–1290, 1972.

    Article  Google Scholar 

  107. D. K. Oller, “The effect of position in utterance on speech segment duration in English,” Journal of Acoustic Society of America, vol. 54, pp. 1247–1253, 1973.

    Article  Google Scholar 

  108. T. H. Crystal and A. S. House, “Characterization and modeling of speech segment durations,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 2791–2794, 1986.

    Google Scholar 

  109. T. H. Crystal and A. S. House, “The duration of American English vowels: an overview,” Journal of Phonetics, vol. 16, pp. 263–284, 1988.

    Google Scholar 

  110. T. H. Crystal and A. S. House, “The duration of American English stop consonants: An overview,” Journal of Phonetics, vol. 16, pp. 285–294, 1988.

    Google Scholar 

  111. K. N. Reddy, “The duration of Telugu speech sounds: an acoustic study,” Special issue of Journal of IETE on Speech processing, pp. 57–63, 1988.

    Google Scholar 

  112. S. R. Savithri, “Duration analysis of Kannada vowels,” Journal of Acoustical Society of India, vol. 4, pp. 34–40, 1986.

    Google Scholar 

  113. K. S. Rao, S. V. Gangashetty, and B. Yegnanarayana, “Duration analysis for Telugu language,” in Int. Conf. Natural Language Processing, (Mysore, India), pp. 152–158, Dec. 2003.

    Google Scholar 

  114. N. Umeda, “Linguistic rules for text-to-speech synthesis,” Proc. IEEE, vol. 4, pp. 443–451, 1976.

    Article  Google Scholar 

  115. A. Chopde, “Itrans Indian language transliteration package version 5.2 source.” http://www.aczone.con/itrans/ .

  116. A. N. Khan, S. V. Gangashetty, and B. Yegnanarayana, “Syllabic properties of three Indian languages: Implications for speech recognition and language identification,” in Int. Conf. Natural Language Processing, (Mysore, India), pp. 125–134, Dec. 2003.

    Google Scholar 

  117. O. Fujimura, “Syllable as a unit of speech recognition,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 23, pp. 82–87, Feb. 1975.

    Google Scholar 

  118. K. S. Rao and S. G. Koolagudi, “Selection of suitable features for modeling the durations of syllables,” Journal of Software Engineering and Applications, vol. 3, Dec. 2010.

    Google Scholar 

  119. M. Riedi, “A neural network based model of segmental duration for speech synthesis,” in Proc. European Conf. Speech Communication and Technology, (Madrid), pp. 599–602, Sept. 1995.

    Google Scholar 

  120. W. N. Campbell, “Predicting segmental durations for accommodation within a syllable-level timing framework,” in Proc. European Conf. Speech Communication and Technology, vol. 2, (Berlin, Germany), pp. 1081–1084, Sept. 1993.

    Google Scholar 

  121. S. Rajendran, K. S. Rao, B. Yegnanarayana, and K. N. Reddy, “Syllable duration in broadcast news in Telugu: A preliminary study,” in National Conf. on Language Technology Tools: Implementation of Telugu/Urdu, (Hyderabad, India), Oct. 2003.

    Google Scholar 

  122. K. S. Rao, S. V. Gangashetty, and B. Yegnanarayana, “Duration analysis for Telugu language,” in Int. Conf. on Natural Language Processing (ICON), (Mysore, India), Dec. 2003.

    Google Scholar 

  123. S. Lee, K. Hirose, and N. Minematsu, “Incoporation of prosodic modules for large vocabulary continuous speech recognition,” in Proc. ISCA Workshop on Prosody in Speech recognition and understanding, pp. 97–101, 2001.

    Google Scholar 

  124. K. Ivano, T. Seki, and S. Furui, “Noise robust speech recognition using F0 contour extract by Hough transform,” in Proc. Int. Conf. Spoken Language Processing, pp. 941–944, 2002.

    Google Scholar 

  125. L. Mary and B. Yegnanarayana, “Prosodic features for speaker verification,” in Proc. Int. Conf. Spoken Language Processing, (Pittsburgh, PA, USA), pp. 917–920, Sep. 2006.

    Google Scholar 

  126. L. Mary, Multi level implicit features for language and speaker recognition. PhD thesis, Dept. of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, India, June 2006.

    Google Scholar 

  127. L. Mary and B. Yegnanarayana, “Consonant-vowel based features for language identification,” in Int. Conf. Natural Language Processing, (Kanpur, India), pp. 103–106, Dec. 2005.

    Google Scholar 

  128. L. Mary, K. S. Rao, and B. Yegnanarayana, “Neural network classifiers for language identification using phonotactic and prosodic features,” in Proc. Int. Conf. Intelligent Sensing and Information Processing (ICISIP), (Chennai, India), pp. 404–408, Jan. 2005.

    Google Scholar 

  129. K. K. Kumar, K. S. Rao, and B. Yegnanarayana, “Duration knowledge for text-to-speech system for telugu,” in Int. Conf. Knowledge based computer systems (KBCS), (Mumbai, India), Dec. 2002.

    Google Scholar 

  130. C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167, 1998.

    Article  Google Scholar 

  131. T. B. Trafalis and H. Lnce, “Support vector machine for regression and applications to financial forecasting,” in Int. Joint Conf. Neural Networks, pp. 348–353, June 2000.

    Google Scholar 

  132. J. R. Bellegarda, K. E. A. Silverman, K. Lenzo, and V. Anderson, “Statistical prosodic modeling: From corpus design to parameter estimation,” IEEE Trans. Speech and Audio Processing, vol. 9, pp. 52–66, Jan. 2001.

    Article  Google Scholar 

  133. J. R. Bellegarda and K. E. A. Silverman, “Improved duration modeling of English phonemes using a root sinusoidal transformation,” in Proc. Int. Conf. Spoken Language Processing, pp. 21–24, Dec. 1998.

    Google Scholar 

  134. K. E. A. Silverman and J. R. Bellegarda, “Using a sigmoid transformation for improved modeling of phoneme duration,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Phoenix, AZ, USA), pp. 385–388, Mar. 1999.

    Google Scholar 

  135. B. Siebenhaar, B. Zellner-Keller, and E. Keller, “Phonetic and timing considerations in a Swiss high German TTS system,” in Improvements in Speech Synthesis (E. Keller, G. Bailly, A. Monaghan, J. Terken, and M. Huckvale, eds.), Chichester: John Wiley, 2001.

    Google Scholar 

  136. C. S. Gupta, S. R. M. Prasanna, and B. Yegnanarayana, “Autoassociative neural network models for online speaker verification using source features from vowels,” in Int. Joint Conf. Neural Networks, (Honululu, Hawii, USA), May 2002.

    Google Scholar 

  137. B. Yegnanarayana and S. P. Kishore, “AANN an alternative to GMM for pattern recognition,” Neural Networks, vol. 15, pp. 459–469, Apr. 2002.

    Article  Google Scholar 

  138. K. S. Rao and B. Yegnanarayana, “Modeling syllable duration in Indian languages using neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Montreal, Quebec, Canada), May 2004.

    Google Scholar 

  139. K. S. Rao and B. Yegnanarayana, “Modeling syllable duration in indian languages using support vector machines,” in 2nd Int. Conf. Intelligent Sensing and Information Processing (ICISIP-2005), (Chennai, India), Jan. 2005.

    Google Scholar 

  140. K. S. Rao and B. Yegnanarayana, “Modeling durations of syllables using neural networks,” Computer Speech and Language, vol. 21, pp. 282–295, Apr. 2007.

    Article  Google Scholar 

  141. K. S. Rao, “Modeling supra-segmental features of syllables using neural networks,” in Speech, Audio, Image and Biomedical Signal Processing using Neural Networks (P. B. Prasad and S. R. M. Prasanna, eds.), pp. 71–95, Springer, 2008.

    Google Scholar 

  142. K. S. Rao and B. Yegnanarayana, “Impact of constraints on prosody modeling for Indian languiages,” in 3rd International Conference on Natural Language Processing (ICON-2004), (Hyderabad, India), pp. 225–236, Dec. 2004.

    Google Scholar 

  143. K. S. Rao and B. Yegnanarayana, “Two-stage duration model for indian languages using neural networks,” in Lecture Notes in Computer Science, Neural Information Processing (Springer), pp. 1179–1185, 2004.

    Google Scholar 

  144. K. S. Rao, “Application of prosody models for developing speech systems in Indian languages,” International Journal of Speech Technology, vol. 14, pp. 19–23, March 2011.

    Article  Google Scholar 

  145. S. R. M. Prasanna and B. Yegnanarayana, “Extraction of pitch in adverse conditions,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Montreal, Canada), May 2004.

    Google Scholar 

  146. K. S. Rao and B. Yegnanarayana, “Intonation modeling for indian languages,” Computer Speech and Language, vol. 23, pp. 240–256, Apr. 2009.

    Article  Google Scholar 

  147. K. S. Rao and B. Yegnanarayana, “Intonation modeling for indian languages,” in 8th Int. Conf. on Spoken Language Processing (Interspeech-2004), (Jeju Island, Korea), pp. 733–736, Oct. 2004.

    Google Scholar 

  148. L. Mary, K. S. Rao, S. V. Gangashetty, and B. Yegnanarayana, “Neural network models for capturing duration and intonation knowledge for language and speaker identification,” in 8th Int. Conf. on Cognitive and Neural systems (ICCNS), (Boston, MA, USA), May 2004.

    Google Scholar 

  149. L. Mary, K. S. Rao, and B. Yegnanarayana, “Neural network classifiers for language identification using syntactic and prosodic features,” in 2nd Int. Conf. Intelligent Sensing and Information Processing (ICISIP-2005), (Chennai, India), Jan. 2005.

    Google Scholar 

  150. S. G. Koolagudi and K. S. Rao, “Neural network models for capturing prosodic knowledge for emotion recognition,” in 12th Int. Conf. on Cognitive and Neural systems (ICCNS), (Boston, MA, USA), May 2008.

    Google Scholar 

  151. P. S. Murthy and B. Yegnanarayana, “Robustness of group-delay-based method for extraction of significant excitation from speech signals,” IEEE Trans. Speech and Audio Processing, vol. 7, pp. 609–619, Nov. 1999.

    Article  Google Scholar 

  152. R. Smits and B. Yegnanarayana, “Determination of instants of significant excitation in speech using group delay function,” IEEE Trans. Speech and Audio Processing, vol. 3, pp. 325–333, Sept. 1995.

    Google Scholar 

  153. J. Makhoul, “Linear prediction: A tutorial review,” Proc. IEEE, vol. 63, pp. 561–580, Apr. 1975.

    Article  Google Scholar 

  154. A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-time signal processing. Upper Saddle River, NJ.: Prentice-Hall, 1999.

    Google Scholar 

  155. W. M. Fisher, G. R. Doddington, and K. M. Goudie-Marshall, “The DARPA speech recognition database: Specifications and status,” in Proc. DARPA Workshop on speech recognition, pp. 93–99, Feb. 1986.

    Google Scholar 

  156. J. R. Deller, J. G. Proakis, and J. H. L. Hansen, Discrete-time processing of speech signals. New York, USA: Macmilan Publishing Company, 1993.

    Google Scholar 

  157. R. V. Hogg and J. Ledolter, Engineering Statistics. 866 Third Avenue, New York, USA: Macmillan Publishing Company, 1987.

    Google Scholar 

  158. K. S. Rao and B. Yegnanarayana, “Prosody modification using instants of significant excitation,” IEEE Trans. Speech and Audio Processing, vol. 14, pp. 972–980, May 2006.

    Article  Google Scholar 

  159. K. S. Rao and B. Yegnanarayana, “Prosodic manipulation using instants of significant excitation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, April 2003.

    Google Scholar 

  160. K. S. Rao and B. Yegnanarayana, “Prosodic manipulation using instants of significant excitation,” in IEEE Int. Conf. Multimedia and Expo, (Baltimore, Maryland, USA), July 2003.

    Google Scholar 

  161. T. V. Ananthapadmanabha and B. Yegnanarayana, “Epoch extraction from linear prediction residual for identification of closed glottis interval,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 27, pp. 309–319, Aug. 1979.

    Google Scholar 

  162. A. V. Oppenheim and R. W. Schafer, Digital Signal Processing. Englewood Cliffs, New Jersey, USA: Prentice Hall, 1975.

    MATH  Google Scholar 

  163. B. Yegnanarayana, S. R. M. Prasanna, and K. S. Rao, “Speech enhancement using excitation source information,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. 1, (Orlando, Florida, USA), pp. 541–544, May 2002.

    Google Scholar 

  164. D. Gabor, “Theory of communication,” J. IEE, vol. 93, no. 2, pp. 429–457, 1946.

    Google Scholar 

  165. N. S. Krishna, H. A. Murthy, and T. A. Gonsalves, “Text-to-speech (tts) in indian languages,” in Int. Conf. Natural Language Processing, 2002.

    Google Scholar 

  166. S. Srikanth, S. R. R. Kumar, R. Sundar, and B. Yegnanarayana, A text-to-speech conversion system for Indian languages based on waveform concatenation model. Technical report no.11, Project VOIS, Dept. of Computer Science and Engineering, Indian Institute of Technology Madras, Mar. 1989.

    Google Scholar 

  167. B. Zellner, “Fast and slow speech rate: A characterization for French,” in Proc. Int. Conf. Spoken Language Processing, (Sydney, Australia.), pp. 542–545, Dec. 1998.

    Google Scholar 

  168. S. R. M. Prasanna and J. M. Zachariah, “Detection of vowel onset point in speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Orlando, Florida, USA), May 2002.

    Google Scholar 

  169. S. V. Gangashetty, C. C. Sekhar, and B. Yegnanarayana, “Extraction of fixed dimension patterns from varying duration segments of consonant-vowel utterances,” in Proc. IEEE Int. Conf. Intelligent Sensing and Information Processing, (Chennai, India), pp. 159–164, Jan. 2004.

    Google Scholar 

  170. Database for Indian languages. Speech and Vision lab, Indian Institute of Technology Madras, India, 2001.

    Google Scholar 

  171. H. A. Murthy and B. Yegnanarayana, “Formant extraction from group delay function,” Speech Communication, vol. 10, pp. 209–221, Mar. 1991.

    Article  Google Scholar 

  172. K. S. Rao, S. R. M. Prasanna, and B. Yegnanarayana, “Determination of instants of significant excitation in speech using hilbert envelope and group delay function,” IEEE Signal Processing Letters, vol. 14, pp. 762–765, Oct. 2007.

    Article  Google Scholar 

  173. K. S. Rao, “Real time prosody modification,” Journal of Signal and Information Processing, Nov. 2010.

    Google Scholar 

  174. K. S. Rao and B. Yegnanarayana, “Neural network models for text-to-speech synthesis,” in 5th International Conference on Knowledge Based Computer Systems (KBCS-2004), (Hyderabad, India), pp. 520–530, Dec. 2004.

    Google Scholar 

  175. K. S. Rao and B. Yegnanarayana, “Duration modification using glottal closure instants and vowel onset points,” Speech Communication, vol. 51, pp. 1263–1269, Dec. 2009.

    Article  Google Scholar 

  176. K. S. Rao and B. Yegnanarayana, “Voice conversion by prosody and vocal tract modification,” in 9th Int. Conf. Information Technology, (Bhubaneswar, Orissa, India), Dec 2006.

    Google Scholar 

  177. K. S. Rao, “Voice conversion by mapping the speaker-specific features using pitch synchronous approach,” Computer Speech and Language, vol. 24, pp. 474–494, July 2010.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media New York

About this chapter

Cite this chapter

Rao, K.S. (2012). Analysis of Durations of Sound Units. In: Predicting Prosody from Text for Text-to-Speech Synthesis. SpringerBriefs in Electrical and Computer Engineering(). Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1338-7_3

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-1338-7_3

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-1337-0

  • Online ISBN: 978-1-4614-1338-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics