Skip to main content
Log in

Automatic syllabification of speech signal using short time energy and vowel onset points

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript


This paper describes a language independent method for automatic syllabification of speech signal. This method utilizes the valleys in short time energy (STE) contour and location of vowel onset points (VOP) for marking the syllable boundaries. In the proposed method, automatic syllabification is performed in three steps. First, long silence/pause regions are marked with the help of speech/non-speech detection. Then VOPs are located from the Hilbert Envelope of LP residual. The existence of more than one VOP in a continuous speech region (identified using speech/non-speech detection in the first step) is an indication of syllable boundaries within the region. Location with minimum energy in the STE contour between two consecutive VOP is identified as the syllable boundary. Since automatic VOP detection algorithm fails to detect some of the VOPs, certain syllable boundaries will be missed. Therefore, at the third step, additional syllable boundaries are detected from STE contour by fixing a valley threshold which is equal to the mean value of STE corresponding to each speech region between two consecutive syllable boundaries. This method is evaluated for 50 sentences each in read, extempore and conversational mode speech of Malayalam and Bengali languages. Overall accuracy of 80% is obtained with ± 50 ms tolerance with reference to manually marked syllable boundaries for this database. Method also shows good accuracy in case of TIMIT and NTIMIT data without tuning of thresholds and other parameters. This method is useful for applications that do not require exact syllable boundaries, rather a meaningful separation of syllables. Application of this technique for prosody based emotion recognition is illustrated using Emo-DB German emotional database.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others


  • Mary, L., Anish, Babu K. K., & joseph, Aju. (2012). Analysis and detection of mimicked speech based on prosodic features. International Journal of Speech Technology, 15, 407–417.

    Article  Google Scholar 

  • Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosody for language and speaker recognition. Speech Communication, 50(10), 782–796.

    Article  Google Scholar 

  • Mermelstein, P. (1975). Automatic segmentation of speech into syllabic units. The Journal of the Acoustical Society of America, 58(4), 880–883.

    Article  Google Scholar 

  • Mohanan, V., & Mary, L. (2016). Prosody based emotion recognition using SVM. In Proceedings of the International Conference on Signal & Speech Processing (ICSSP-2016), Kollam.

  • Nagarajan, T., Murthy, Hema A., Hegde, Rajesh M. (2003). Automatic segmentation of speech into syllable-like units. Eurospeech. Geneva, pp.2893-2896

  • Nair, L. M., & Mary, L. (2015). Pair-wise language discrimination using phonotactic information. In Proceedings of the 2015 International Conference on Control Communication & Computing India (ICCC), Trivandrum (pp. 544-547).

  • Nel, P., & du Preez, J. (2003). Automatic syllabification using hierarchical hidden markov models. In Proceedings of the ICASSP (pp. 768–771) Cambridge, MA: MIT Press.

  • Pradhan, G., & Prasanna, S. R. M. (2011). Significance of vowel onset point information for speaker verification. International Journal of Computer and Communication Technology, 2, 56–61.

    Google Scholar 

  • Prasad, V. K., Nagarajan, T., & Murthy, Hema A. (2004). Automatic segmentation of continuous speech using minimum phase group delay functions. Speech Communication, 42, 429–446.

    Article  Google Scholar 

  • Prasanna, S. R. M. (2004). Event-based analysis of speech, Ph.D thesis, Indian Institute of Technology Madras, Department of Computer Science and Engg., Chennai

  • Prasanna, S.R. M., Yegnanarayana, B. (2005). Detection of vowel onset point events using excitation information, INTERSPEECH, pp.1133-1136

  • Rao, K. S., & Yegnanarayana, B. (2009). Intonation modeling for Indian languages. Computer, Speech and Language, 23(2), 240–256.

    Article  Google Scholar 

  • Sebastian, K., & Mary, L. (2016). FASR: Effect of voice disguise. Paper presented at the 2016 International Conference on Emerging Technological Trends (ICETT), Kollam (pp. 1–4).

  • Villing, R., Timoney, J., & Ward, T. (2004). Automatic blind syllable segmentation for continuous speech.ISSC. Belfast

  • Zhang, Y., & Glass, J. (2009). Speech rhythm guided syllable nuclei detection. In Proceeding of the ICASSP (pp. 3797–3800). Cambridge, MA: MIT Press.

Download references


The authors would like to thank Kerala State Council for Science, Technology and Environment (KSCSTE), Government of Kerala, India for their support.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Leena Mary.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mary, L., Antony, A.P., Babu, B.P. et al. Automatic syllabification of speech signal using short time energy and vowel onset points. Int J Speech Technol 21, 571–579 (2018).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: