Abstract
This chapter presents the detailed analysis of durations of sound units. Durations of the syllables are analyzed with respect to positional and contextual factors. For detailed duration analysis, syllables are categorized into groups based on size of the word and position of the word in the utterance, and the analysis is performed separately on each category. From the duration analysis presented in this chapter, it is observed that durations of sound units depend on several factors at various levels, and it is very difficult to derive the precise rules for accurate estimation of durations. Therefore, there is a need to explore nonlinear models to capture the duration patterns of sound units from features mentioned in this chapter.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
S. Werner and E. Keller, “Prosodic aspects of speech,” in Fundamentals of Speech Synthesis and Speech Recognition: Basic Concepts, State of the Art, the Future Challenges (E. Keller, ed.), pp. 23–40, Chichester: John Wiley, 1994.
K. S. Rao and B. Yegnanarayana, “Modeling syllable duration in Indian languages using neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Montreal, Quebec, Canada), pp. 313–316, May 2004.
K. S. Rao and B. Yegnanarayana, “Intonation modeling for Indian languages,” in Proc. Int. Conf. Spoken Language Processing, (Jeju Island, Korea), pp. 733–736, Oct. 2004.
K. S. Rao and B. Yegnanarayana, “Prosodic manipulation using instants of significant excitation,” in Proc. IEEE Int. Conf. Multimedia and Expo, (Baltimore, Maryland, USA), pp. 389–392, July 2003.
H. Mixdorff, An integrated approach to modeling German prosody. PhD thesis, Technical University, Dresden, Germany, July 2002.
D. H. Klatt, “Synthesis by rule of segmental durations in English sentences,” in Frontiers of Speech Communication Research (B. Lindblom and S. Ohman, eds.), pp. 287–300, New York: Academic Press, 1979.
J. Allen, M. S. Hunnicut, and D. H. Klatt, From Text to Speech: The MITalk system. Cambridge: Cambridge University Press, 1987.
K. J. Kohler, “Zeistrukturierung in der Sprachsynthese,” ITG-Tagung Digitalc Sprachverarbeitung, no. 6, pp. 165–170, 1988.
K. Bartkova and C. Sorin, “A model of segmental duration for speech synthesis in French,” Speech Communication, no. 6, pp. 245–260, 1987.
S. R. R. Kumar and B. Yegnanarayana, “Significance of durational knowledge for speech synthesis in Indian languages,” in Proc. IEEE Region 10 Conf. Convergent Technologies for the Asia-Pacific, (Bombay, India), pp. 486–489, Nov. 1989.
S. R. R. Kumar, “Significance of durational knowledge for a text-to-speech system in an Indian language,” Master’s thesis, Dept. of Computer science and Engineering, Indian Institute of Technology Madras, Mar. 1990.
R. Sriram, S. R. R. Kumar, and B. Yegnanarayana, A Text-to-Speech conversion system for Indian languages using parameter based approach. Technical report no.12, Project VOIS, Dept. of CSE, IITM, May 1989.
K. K. Kumar, “Duration and intonation knowledge for text-to-speech conversion system for Telugu and Hindi,” Master’s thesis, Dept. of Computer science and Engineering, Indian Institute of Technology Madras, May 2002.
K. K. Kumar, K. S. Rao, and B. Yegnanarayana, “Duration knowledge for text-to-speech system for Telugu,” in Proc. Int. Conf. Knowledge Based Computer Systems, (Mumbai, India), pp. 563–571, Dec. 2002.
S. H. Chen, W. H. Lai, and Y. R. Wang, “A new duration modeling approach for Mandarin speech,” IEEE Trans. Speech and Audio Processing, vol. 11, pp. 308–320, July 2003.
J. P. H. V. Santen, “Segmental duration and speech timing,” in Computing prosody (Sagisaka, Campbell, and Higuchi, eds.), pp. 225–249, Springer-Verlag, 1996.
J. P. H. V. Santen, “Assignment of segment duration in text-to-speech synthesis,” Computer Speech and Language, vol. 8, pp. 95–128, Apr. 1994.
J. P. H. V. Santen, “Timing in text-to-speech systems,” in Proc. Eurospeech, vol. 35, (Berlin, Germany), pp. 1397–1404, 1993.
J. P. H. V. Santen, “Analyzing n-way tables with sums-of-products models,” Journal of Mathematical Psychology, vol. 37, pp. 327–371, 1993.
J. P. H. V. Santen, “Prosodic modeling in text-to-speech synthesis,” in Proc. Eurospeech, (Rhodes, Greece), 1997.
J. P. H. V. Santen, C. Shih, and et. al., “Multi-lingual duration modeling,” in Proc. Eurospeech, vol. 5, (Rhodes, Greece), pp. 2651–2654, 1997.
O. Goubanova and P. Taylor, “Using bayesian belief networks for model duration in text-to-speech systems,” in Proc. Int. Conf. Spoken Language Processing, vol. 2, (Beijing, China), pp. 427–431, Oct. 2000.
O. Sayli, “Duration analysis and modeling for Turkish text-to-speech synthesis,” Master’s thesis, Dept. of Electrical and Electronics Engineering, Bogaziei University, 2002.
A. W. Black, P. Taylor, and R. Caley, “The Festival speech synthesis system.” Manual and source code available at www.cstr.ed.ac.uk/projects/festival.html .
M. Riley, “Tree-based modeling of segmental durations,” Talking Machines: Theories, Models and Designs, pp. 265–273, 1992.
S. Lee and Y.-H. Oh, “Tree-based modeling of prosodic phrasing and segmental duration for Korean TTS systems,” Speech Communication, vol. 28, pp. 283–300, 1999.
A. Maghboulegh, “An empirical comparison of automatic decision tree and hand-configured linear models for vowel durations,” in Proc. of the Workshop in Computational Phonology in Speech Technology, (Santa Cruz), 1996.
R. Batusek, “A duration model for Czech text-to-speech synthesis,” in Proceedings of TSD, (Pilsen, Czech Republic), Sept. 2002.
H. Chung, “Segment duration in spoken Korean,” in Proc. Int. Conf. Spoken Language Processing, (Denver, Colorado, USA), pp. 1105–1108, Sept. 2002.
N. S. Krishna and H. A. Murthy, “Duration modeling of Indian languages Hindi and Telugu,” in 5th ISCA Speech Synthesis Workshop, (Pittsburgh, USA), pp. 197–202, May 2004.
W. N. Campbell, “Analog i/o nets for syllable timing,” Speech Communication, vol. 9, pp. 57–61, Feb. 1990.
W. N. Campbell, “Syllable based segment duration,” in Talking Machines: Theories, Models and Designs (G. Bailly, C. Benoit, and T. R. Sawallis, eds.), pp. 211–224, Elsevier, 1992.
P. A. Barbosa and G. Bailly, “Characterization of rhythmic patterns for text-to-speech synthesis,” Speech Communication, vol. 15, pp. 127–137, 1994.
P. A. Barbosa and G. Bailly, “Generation of pauses within the z-score model,” in Progress in Speech Synthesis, pp. 365–381, Springer-Verlag, 1997.
R. Cordoba, J. A. Vallejo, J. M. Montero, J. Gutierrezarriola, M. A. Lopez, and J. M. Pardo, “Automatic modeling of duration in a Spanish text-to-speech system using neural networks,” in Proc. European Conf. Speech Communication and Technology, (Budapest, Hungary), Sept. 1999.
Y. Hifny and M. Rashwan, “Duration modeling of Arabic text-to-speech synthesis,” in Proc. Int. Conf. Spoken Language Processing, (Denver, Colorado, USA), pp. 1773–1776, Sept. 2002.
G. P. Sonntag, T. Portele, and B. Heuft, “Prosody generation with a neural network: Weighing the importance of input parameters,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Munich, Germany), pp. 931–934, Apr. 1997.
J. P. Teixeira and D. Freitas, “Segmental durations predicted with a neural network,” in Proc. European Conf. Speech Communication and Technology, (Geneva, Switzerland), pp. 169–172, Sept. 2003.
R. E. Donovan, Trainable speech synthesis. PhD thesis, Cambridge University Engineering Department, Christ’s college, Trumpington Street, Cambridge CB2 1PZ, England, June 1996.
A. Botinis, B. Granstrom, and B. Mobius, “Developments and paradigms in intonation research,” Speech Communication, vol. 33, pp. 263–296, 2001.
P. A. Taylor, “Analysis and synthesis of intonation using the Tilt model,” Journal of Acoustic Society of America, vol. 107, pp. 1697–1714, Mar. 2000.
P. A. Taylor, “The rise/fall/connection model of intonation,” Speech Communication, vol. 15, no. 15, pp. 169–186, 1995.
J. B. Pierrehumbert, The Phonology and Phonetics of English Intonation. PhD thesis, MIT, MA, USA, 1980.
R. Sproat, ed., Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Kluwer Academic Publishers, 1998.
M. Jilka, G. Mohler, and G. Dogil, “Rules for generation of TOBI-based American English intonation,” Speech Communication, vol. 28, pp. 83–108, 1999.
J. Buhmann, H. Vereecken, J. Fackrell, J. P. Martens, and B. V. Coile, “Data driven intonation modeling of 6 languages,” in Proc. Int. Conf. Spoken Language Processing, vol. 3, (Beijing, China), pp. 179–183, Oct. 2000.
N. (Thorsen) Gronnum, “The groundworks of Danish intonation: An introduction.” Museum Tusculanum Press, Copenhagen, 1992.
N. (Thorsen) Gronnum, “Superposition and subordination in intonation - a non-linear approach,” in Proceedings of the 13 th International Congress - Phon. Sc. Stockholm, (Stockholm), pp. 124–131, 1995.
E. Garding, “A generative model of intonation,” in Prosody: Models and Measuraments (A. Cutler and D. R. Ladd, eds.), pp. 11–25, Berlin, Germany: Springer-Verlag, 1983.
H. Fujisaki, K. Hirose, P. Halle, and H. Lei, “A generative model for the prosody of connected speech in Japanese,” in Ann. Rep. Engineerng Research Institute 30, pp. 75–80, 1971.
H. Fujisaki, “Dynamic characteristics of voice fundamental frequency in speech and singing,” in The production of speech (P. F. MacNeilage, ed.), pp. 39–55, New York, USA: Springer-Verlag, 1983.
H. Fujisaki, “A note on the physiological and physical basis for the phrase and accent components in the voice fundamental frequency contour,” in Vocal Physiology: Voice Production, Mechanisms and Functions (O. Fujimura, ed.), pp. 347–355, New York, USA: Raven Press, 1988.
H. Fujisaki, K. Hirose, and N. Takahashi, “Acoustic characteristics and the underlying rules of the intonation of the common Japanese used by radio and TV anouncers,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 2039–2042, 1986.
H. Fujisaki, S. Ohno, K. Nakamura, M. Guirao, and J. Gurlekian, “Analysis and synthesis of accent and intonation in standard Spanish,” in Proc. Int. Conf. Spoken Language Processing, (Yokohama), pp. 355–358, 1994.
H. Fujisaki and S. Ohno, “Analysis and modeling of fundamental frequency contours of English utterances,” in Proceedings Eurospeech 95, (Madrid), pp. 985–988, 1995.
H. Fujisaki, S. Ohno, and T. Yagi, “Analysis and modeling of fundamental frequency contours of Greek utterances,” in Proceedings Eurospeech 97, (Rhodes, Greece), pp. 465–468, Sept. 1997.
H. Mixdorff and H. Fujisaki, “Analysis of voice fundamental frequency contours of German utterances using a quantitative model,” in Proc. Int. Conf. Spoken Language Processing, vol. 4, (Yokohama), pp. 2231–2234, 1994.
P. Taylor and S. Isard, “A new model of intonation for use with speech synthesis and recognition,” in Proc. Int. Conf. Spoken Language Processing, pp. 1287–1290, 1992.
J. t’Hart, R. Collier, and A. Cohen, A perceptual study of intonation. Cambridge: Cambridge University Press.
C. D’Alessandro and P. Mertens, “Automatic pitch contour stylisation using a model of tonal perception,” Computer Speech and Language, vol. 9, pp. 257–288, 1995.
P. Mertens, L’intonation du Franais: de la description linguistique a’ la reconnaissance automatique. PhD thesis, Katholieke Universiteit Leuven, Leuven, 1987.
J. Terken, “Synthesizing natural sounding intonation for Dutch: rules and perceptual evaluation,” Computer Speech and Language, vol. 7, pp. 27–48, 1993.
J. R. de Pijper, “Modeling British English Intonation,” 1983. Foris, Dordrecht.
L. M. H. Adriaens, Ein Modell Deutscher Intonation. PhD thesis, Technical University Eindhoven, Eindhoven, 1991.
C. Ode, “Russian intonation: A perceptual description,” 1989. Rodopi, Amsterdam.
M. S. Scordilis and J. N. Gowdy, “Neural network based generation of fundamental frequency contours,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. 1, (Glasgow, Scotland), pp. 219–222, May. 1989.
M. Vainio and T. Altosaar, “Modeling the microprosody of pitch and loudness for speech synthesis with neural networks,” in Proc. Int. Conf. Spoken Language Processing, (Sidney, Australia), Dec. 1998.
M. Vainio, Artificial neural network based prosody models for Finnish text-to-speech synthesis. PhD thesis, Dept. of Phonetics, University of Helsinki, Finland, 2001.
S. H. Hwang and S. H. Chen, “Neural-network-based F0 text-to-speech synthesizer for Mandarin,” IEE Proc. Image Signal Processing, vol. 141, pp. 384–390, Dec. 1994.
A. S. M. Kumar, S. Rajendran, and B. Yegnanarayana, “Intonation component of text-to-speech system for Hindi,” Computer Speech and Language, vol. 7, pp. 283–301, 1993.
A. S. M. Kumar, Intonation knowledge for speech systems for an Indian language. PhD thesis, Dept. of Computer Science and Engineering, Indian Institute of Technology, Madras, Chennai, India, Jan. 1993.
T. F. Quatieri and R. J. McAulay, “Shape invariant time-scale and pitch modification of speech,” IEEE Trans. Signal Processing, vol. 40, pp. 497–510, Mar. 1992.
E. Moulines and F. Charpentier, “Pitch-synchronous waveform processing techniques for text to speech synthesis using diphones,” Speech Communication, vol. 9, pp. 453–467, Dec. 1990.
D. G. Childers, K. Wu, D. M. Hicks, and B. Yegnanarayana, “Voice conversion,” Speech Communication, vol. 8, pp. 147–158, June 1989.
E. Moulines and J. Laroche, “Non-parametric techniques for pitch-scale and time-scale modification of speech,” Speech Communication, vol. 16, pp. 175–205, Feb. 1995.
B. Yegnanarayana, S. Rajendran, V. R. Ramachandran, and A. S. M. Kumar, “Significance of knowledge sources for TTS system for Indian languages,” SADHANA Academy Proc. in Engineering Sciences, vol. 19, pp. 147–169, Feb. 1994.
M. R. Portnoff, “Time-scale modification of speech based on short-time Fourier analysis,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 29, pp. 374–390, June. 1981.
M. R. Schroeder, J. L. Flanagan, and E. A. Lundry, “Bandwidth compression of speech by analytic-signal rooting,” Proc. IEEE, vol. 55, pp. 396–401, Mar. 1967.
D. H. Klatt, “Review of text-to-speech conversion for English,” Journal of Acoustic Society of America, vol. 82(3), pp. 737–793, Sep. 1987.
M. Narendranadh, H. A. Murthy, S. Rajendran, and B. Yegnanarayana, “Transformation of formants for voice conversion using artificial neural networks,” Speech Communication, vol. 16, pp. 206–216, Feb. 1995.
E. P. Neuburg, “Simple pitch-dependent algorithm for high-quality speech rate changing,” Journal of Acoustic Society of America, vol. 63, pp. 624–625, Feb. 1978.
E. B. George and M. J. T. Smith, “Speech Analysis/Synthesis and modification using an Analysis-by-Synthesis/Overlap-Add Sinusoidal model,” IEEE Trans. Speech and Audio Processing, vol. 5, pp. 389–406, Sept. 1997.
R. Crochiere, “A weighted overlap-add method of short time Fourier analysis/synthesis,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 28, pp. 99–102, Feb. 1980.
S. Roucos and A. Wilgus, “High quality time-scale modification of speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Tampa, Florida, USA), pp. 493–496, Mar. 1985.
J. Laroche, Y. Stylianou, and E. Moulines, “HNS: Speech modification based on a harmonic + noise model,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Minneapolis, USA), pp. 550–553, Apr. 1993.
Y. Stylianou, “Applying the harmonic plus noise model in concatenative speech synthesis,” IEEE Trans. Speech and Audio Processing, vol. 9, pp. 21–29, Jan. 2001.
H. Kawahara, “Speech representation and transformation using adaptive interpolation of weighted spectrum: Vocoder revisited,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. 2, (Munich, Germany), pp. 1303–1306, 1997.
H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigne, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds,” Speech Communication, vol. 27, pp. 187–207, 1999.
R. MuraliSankar, A. G. Ramakrishnan, and P. Prathibha, “Modification of pitch using DCT in source domain,” Speech Communication, vol. 42, pp. 143–154, Jan. 2004.
R. MuraliSankar, A. G. Ramakrishnan, A. K. Rohitprasad, and M. Anoop, “DCT baced pitch modification,” in Proc. SPCOM 2001 6th Biennial Conference, (Bangalore, India), pp. 114–117, July 2001.
W. Verhelst, “Overlap-add methods for time-scaling of speech,” Speech Communication, vol. 30, pp. 207–221, 2000.
D. O’Brien and A. Monaghan, “Shape invariant time-scale modification of speech using a harmonic model,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Arizona, USA), 1999.
D. O’Brien and A. Monaghan, “Shape invariant pitch modification of speech using a harmonic model,” in Proc. Eurospeech, (Budapest), 1999.
D. O’Brien and A. Monaghan, Improvements in Speech Synthesis, ch. Shape invariant pitch and time-scale modification of speech based on harmonic model. Chichester: John Wiley & Sons, 2001.
B. Yegnanarayana, C. d’Alessandro, and V. Darsinos, “An iterative algorithm for decomposition of speech signals into periodic and aperiodic components,” IEEE Trans. Speech and Audio Processing, vol. 6, pp. 1–11, Jan. 1998.
S. Lemmetty, “Review of speech synthesis technology,” Master’s thesis, Dept. of Electrical and Communications Engineering, Helsinki University of Technology, Espoo, Finland, Mar. 1999.
R. Kortekaas and A. Kohlrausch, “Psychoacoustical evaluation of the Pitch Synchronous Overlap-and-Add speech waveform manipulation technique using single formant stimuli,” Journal of Acoustic Society of America, vol. 101, no. 4, pp. 2202–2213, 1997.
H. Valbret, E. Moulines, and J. P. Tubach, “Voice transformation using PSOLA techniques,” Speech Communication, vol. 11, pp. 175–187, 1992.
Y. Jiang and P. Murphy, “Production based pitch modification of voiced speech,” in Proc. Int. Conf. Spoken Language Processing, (Denver, Colorado, USA), pp. 2073–2076, Sept. 2002.
S. Haykin, Neural Networks: A Comprehensive Foundation. New Delhi, India: Pearson Education Aisa, Inc., 1999.
B. Yegnanarayana, Artificial Neural Networks. New Delhi, India: Printice-Hall, 1999.
V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 2001.
A. Smola and B. Scholkopf, A Tutorial on Support Vector Regression. Technical report Neuro COLT NC-TR-98-030, 1998.
X. Huang, A. Acero, and H. W. Hon, Spoken Language Processing. New York, NJ, USA: Prentice-Hall, Inc., 2001.
D. H. Klatt, “Linguistic uses of segmental duration in English: Acoustic and perceptual evidence,” Journal of Acoustic Society of America, vol. 59, pp. 1209–1221, 1976.
A. W. F. Huggins, “On the perception of temporal phenomena in speech,” Journal of Acoustic Society of America, vol. 4, pp. 1279–1290, 1972.
D. K. Oller, “The effect of position in utterance on speech segment duration in English,” Journal of Acoustic Society of America, vol. 54, pp. 1247–1253, 1973.
T. H. Crystal and A. S. House, “Characterization and modeling of speech segment durations,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 2791–2794, 1986.
T. H. Crystal and A. S. House, “The duration of American English vowels: an overview,” Journal of Phonetics, vol. 16, pp. 263–284, 1988.
T. H. Crystal and A. S. House, “The duration of American English stop consonants: An overview,” Journal of Phonetics, vol. 16, pp. 285–294, 1988.
K. N. Reddy, “The duration of Telugu speech sounds: an acoustic study,” Special issue of Journal of IETE on Speech processing, pp. 57–63, 1988.
S. R. Savithri, “Duration analysis of Kannada vowels,” Journal of Acoustical Society of India, vol. 4, pp. 34–40, 1986.
K. S. Rao, S. V. Gangashetty, and B. Yegnanarayana, “Duration analysis for Telugu language,” in Int. Conf. Natural Language Processing, (Mysore, India), pp. 152–158, Dec. 2003.
N. Umeda, “Linguistic rules for text-to-speech synthesis,” Proc. IEEE, vol. 4, pp. 443–451, 1976.
A. Chopde, “Itrans Indian language transliteration package version 5.2 source.” http://www.aczone.con/itrans/ .
A. N. Khan, S. V. Gangashetty, and B. Yegnanarayana, “Syllabic properties of three Indian languages: Implications for speech recognition and language identification,” in Int. Conf. Natural Language Processing, (Mysore, India), pp. 125–134, Dec. 2003.
O. Fujimura, “Syllable as a unit of speech recognition,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 23, pp. 82–87, Feb. 1975.
K. S. Rao and S. G. Koolagudi, “Selection of suitable features for modeling the durations of syllables,” Journal of Software Engineering and Applications, vol. 3, Dec. 2010.
M. Riedi, “A neural network based model of segmental duration for speech synthesis,” in Proc. European Conf. Speech Communication and Technology, (Madrid), pp. 599–602, Sept. 1995.
W. N. Campbell, “Predicting segmental durations for accommodation within a syllable-level timing framework,” in Proc. European Conf. Speech Communication and Technology, vol. 2, (Berlin, Germany), pp. 1081–1084, Sept. 1993.
S. Rajendran, K. S. Rao, B. Yegnanarayana, and K. N. Reddy, “Syllable duration in broadcast news in Telugu: A preliminary study,” in National Conf. on Language Technology Tools: Implementation of Telugu/Urdu, (Hyderabad, India), Oct. 2003.
K. S. Rao, S. V. Gangashetty, and B. Yegnanarayana, “Duration analysis for Telugu language,” in Int. Conf. on Natural Language Processing (ICON), (Mysore, India), Dec. 2003.
S. Lee, K. Hirose, and N. Minematsu, “Incoporation of prosodic modules for large vocabulary continuous speech recognition,” in Proc. ISCA Workshop on Prosody in Speech recognition and understanding, pp. 97–101, 2001.
K. Ivano, T. Seki, and S. Furui, “Noise robust speech recognition using F0 contour extract by Hough transform,” in Proc. Int. Conf. Spoken Language Processing, pp. 941–944, 2002.
L. Mary and B. Yegnanarayana, “Prosodic features for speaker verification,” in Proc. Int. Conf. Spoken Language Processing, (Pittsburgh, PA, USA), pp. 917–920, Sep. 2006.
L. Mary, Multi level implicit features for language and speaker recognition. PhD thesis, Dept. of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, India, June 2006.
L. Mary and B. Yegnanarayana, “Consonant-vowel based features for language identification,” in Int. Conf. Natural Language Processing, (Kanpur, India), pp. 103–106, Dec. 2005.
L. Mary, K. S. Rao, and B. Yegnanarayana, “Neural network classifiers for language identification using phonotactic and prosodic features,” in Proc. Int. Conf. Intelligent Sensing and Information Processing (ICISIP), (Chennai, India), pp. 404–408, Jan. 2005.
K. K. Kumar, K. S. Rao, and B. Yegnanarayana, “Duration knowledge for text-to-speech system for telugu,” in Int. Conf. Knowledge based computer systems (KBCS), (Mumbai, India), Dec. 2002.
C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167, 1998.
T. B. Trafalis and H. Lnce, “Support vector machine for regression and applications to financial forecasting,” in Int. Joint Conf. Neural Networks, pp. 348–353, June 2000.
J. R. Bellegarda, K. E. A. Silverman, K. Lenzo, and V. Anderson, “Statistical prosodic modeling: From corpus design to parameter estimation,” IEEE Trans. Speech and Audio Processing, vol. 9, pp. 52–66, Jan. 2001.
J. R. Bellegarda and K. E. A. Silverman, “Improved duration modeling of English phonemes using a root sinusoidal transformation,” in Proc. Int. Conf. Spoken Language Processing, pp. 21–24, Dec. 1998.
K. E. A. Silverman and J. R. Bellegarda, “Using a sigmoid transformation for improved modeling of phoneme duration,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Phoenix, AZ, USA), pp. 385–388, Mar. 1999.
B. Siebenhaar, B. Zellner-Keller, and E. Keller, “Phonetic and timing considerations in a Swiss high German TTS system,” in Improvements in Speech Synthesis (E. Keller, G. Bailly, A. Monaghan, J. Terken, and M. Huckvale, eds.), Chichester: John Wiley, 2001.
C. S. Gupta, S. R. M. Prasanna, and B. Yegnanarayana, “Autoassociative neural network models for online speaker verification using source features from vowels,” in Int. Joint Conf. Neural Networks, (Honululu, Hawii, USA), May 2002.
B. Yegnanarayana and S. P. Kishore, “AANN an alternative to GMM for pattern recognition,” Neural Networks, vol. 15, pp. 459–469, Apr. 2002.
K. S. Rao and B. Yegnanarayana, “Modeling syllable duration in Indian languages using neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Montreal, Quebec, Canada), May 2004.
K. S. Rao and B. Yegnanarayana, “Modeling syllable duration in indian languages using support vector machines,” in 2nd Int. Conf. Intelligent Sensing and Information Processing (ICISIP-2005), (Chennai, India), Jan. 2005.
K. S. Rao and B. Yegnanarayana, “Modeling durations of syllables using neural networks,” Computer Speech and Language, vol. 21, pp. 282–295, Apr. 2007.
K. S. Rao, “Modeling supra-segmental features of syllables using neural networks,” in Speech, Audio, Image and Biomedical Signal Processing using Neural Networks (P. B. Prasad and S. R. M. Prasanna, eds.), pp. 71–95, Springer, 2008.
K. S. Rao and B. Yegnanarayana, “Impact of constraints on prosody modeling for Indian languiages,” in 3rd International Conference on Natural Language Processing (ICON-2004), (Hyderabad, India), pp. 225–236, Dec. 2004.
K. S. Rao and B. Yegnanarayana, “Two-stage duration model for indian languages using neural networks,” in Lecture Notes in Computer Science, Neural Information Processing (Springer), pp. 1179–1185, 2004.
K. S. Rao, “Application of prosody models for developing speech systems in Indian languages,” International Journal of Speech Technology, vol. 14, pp. 19–23, March 2011.
S. R. M. Prasanna and B. Yegnanarayana, “Extraction of pitch in adverse conditions,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Montreal, Canada), May 2004.
K. S. Rao and B. Yegnanarayana, “Intonation modeling for indian languages,” Computer Speech and Language, vol. 23, pp. 240–256, Apr. 2009.
K. S. Rao and B. Yegnanarayana, “Intonation modeling for indian languages,” in 8th Int. Conf. on Spoken Language Processing (Interspeech-2004), (Jeju Island, Korea), pp. 733–736, Oct. 2004.
L. Mary, K. S. Rao, S. V. Gangashetty, and B. Yegnanarayana, “Neural network models for capturing duration and intonation knowledge for language and speaker identification,” in 8th Int. Conf. on Cognitive and Neural systems (ICCNS), (Boston, MA, USA), May 2004.
L. Mary, K. S. Rao, and B. Yegnanarayana, “Neural network classifiers for language identification using syntactic and prosodic features,” in 2nd Int. Conf. Intelligent Sensing and Information Processing (ICISIP-2005), (Chennai, India), Jan. 2005.
S. G. Koolagudi and K. S. Rao, “Neural network models for capturing prosodic knowledge for emotion recognition,” in 12th Int. Conf. on Cognitive and Neural systems (ICCNS), (Boston, MA, USA), May 2008.
P. S. Murthy and B. Yegnanarayana, “Robustness of group-delay-based method for extraction of significant excitation from speech signals,” IEEE Trans. Speech and Audio Processing, vol. 7, pp. 609–619, Nov. 1999.
R. Smits and B. Yegnanarayana, “Determination of instants of significant excitation in speech using group delay function,” IEEE Trans. Speech and Audio Processing, vol. 3, pp. 325–333, Sept. 1995.
J. Makhoul, “Linear prediction: A tutorial review,” Proc. IEEE, vol. 63, pp. 561–580, Apr. 1975.
A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-time signal processing. Upper Saddle River, NJ.: Prentice-Hall, 1999.
W. M. Fisher, G. R. Doddington, and K. M. Goudie-Marshall, “The DARPA speech recognition database: Specifications and status,” in Proc. DARPA Workshop on speech recognition, pp. 93–99, Feb. 1986.
J. R. Deller, J. G. Proakis, and J. H. L. Hansen, Discrete-time processing of speech signals. New York, USA: Macmilan Publishing Company, 1993.
R. V. Hogg and J. Ledolter, Engineering Statistics. 866 Third Avenue, New York, USA: Macmillan Publishing Company, 1987.
K. S. Rao and B. Yegnanarayana, “Prosody modification using instants of significant excitation,” IEEE Trans. Speech and Audio Processing, vol. 14, pp. 972–980, May 2006.
K. S. Rao and B. Yegnanarayana, “Prosodic manipulation using instants of significant excitation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, April 2003.
K. S. Rao and B. Yegnanarayana, “Prosodic manipulation using instants of significant excitation,” in IEEE Int. Conf. Multimedia and Expo, (Baltimore, Maryland, USA), July 2003.
T. V. Ananthapadmanabha and B. Yegnanarayana, “Epoch extraction from linear prediction residual for identification of closed glottis interval,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 27, pp. 309–319, Aug. 1979.
A. V. Oppenheim and R. W. Schafer, Digital Signal Processing. Englewood Cliffs, New Jersey, USA: Prentice Hall, 1975.
B. Yegnanarayana, S. R. M. Prasanna, and K. S. Rao, “Speech enhancement using excitation source information,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. 1, (Orlando, Florida, USA), pp. 541–544, May 2002.
D. Gabor, “Theory of communication,” J. IEE, vol. 93, no. 2, pp. 429–457, 1946.
N. S. Krishna, H. A. Murthy, and T. A. Gonsalves, “Text-to-speech (tts) in indian languages,” in Int. Conf. Natural Language Processing, 2002.
S. Srikanth, S. R. R. Kumar, R. Sundar, and B. Yegnanarayana, A text-to-speech conversion system for Indian languages based on waveform concatenation model. Technical report no.11, Project VOIS, Dept. of Computer Science and Engineering, Indian Institute of Technology Madras, Mar. 1989.
B. Zellner, “Fast and slow speech rate: A characterization for French,” in Proc. Int. Conf. Spoken Language Processing, (Sydney, Australia.), pp. 542–545, Dec. 1998.
S. R. M. Prasanna and J. M. Zachariah, “Detection of vowel onset point in speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Orlando, Florida, USA), May 2002.
S. V. Gangashetty, C. C. Sekhar, and B. Yegnanarayana, “Extraction of fixed dimension patterns from varying duration segments of consonant-vowel utterances,” in Proc. IEEE Int. Conf. Intelligent Sensing and Information Processing, (Chennai, India), pp. 159–164, Jan. 2004.
Database for Indian languages. Speech and Vision lab, Indian Institute of Technology Madras, India, 2001.
H. A. Murthy and B. Yegnanarayana, “Formant extraction from group delay function,” Speech Communication, vol. 10, pp. 209–221, Mar. 1991.
K. S. Rao, S. R. M. Prasanna, and B. Yegnanarayana, “Determination of instants of significant excitation in speech using hilbert envelope and group delay function,” IEEE Signal Processing Letters, vol. 14, pp. 762–765, Oct. 2007.
K. S. Rao, “Real time prosody modification,” Journal of Signal and Information Processing, Nov. 2010.
K. S. Rao and B. Yegnanarayana, “Neural network models for text-to-speech synthesis,” in 5th International Conference on Knowledge Based Computer Systems (KBCS-2004), (Hyderabad, India), pp. 520–530, Dec. 2004.
K. S. Rao and B. Yegnanarayana, “Duration modification using glottal closure instants and vowel onset points,” Speech Communication, vol. 51, pp. 1263–1269, Dec. 2009.
K. S. Rao and B. Yegnanarayana, “Voice conversion by prosody and vocal tract modification,” in 9th Int. Conf. Information Technology, (Bhubaneswar, Orissa, India), Dec 2006.
K. S. Rao, “Voice conversion by mapping the speaker-specific features using pitch synchronous approach,” Computer Speech and Language, vol. 24, pp. 474–494, July 2010.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media New York
About this chapter
Cite this chapter
Rao, K.S. (2012). Analysis of Durations of Sound Units. In: Predicting Prosody from Text for Text-to-Speech Synthesis. SpringerBriefs in Electrical and Computer Engineering(). Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1338-7_3
Download citation
DOI: https://doi.org/10.1007/978-1-4614-1338-7_3
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1337-0
Online ISBN: 978-1-4614-1338-7
eBook Packages: EngineeringEngineering (R0)