Significance of Epoch Identification Accuracy in Prosody Modification for Effective Emotion Conversion

  • S. Lakshmi PriyaEmail author
  • D. Govind
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 968)


Estimating the accurate pitch marks for prosody modification is an essential step in the epoch based time and pitch scale (prosody) modification of a given speech. In epoch based prosody modification, the perceptual quality of the time and pitch scale modified speech depends on the accuracy with which glottal closure instants (epochs) are estimated. The objective of the present work is to improve the perceptual quality of the prosody modified speech by accurately estimating the epochs location. In the present work the effectiveness of variational mode decomposition (VMD) in spectral smoothing and wavelet synchrosqueezing transform (WSST) in time-frequency sharpening of a given signal is exploited for refining the zero frequency filtering (ZFF) method which is one of the simple and popular epoch extraction method. The proposed refinements to the ZFF method found to provide improved epoch estimation performance on emotive speech utterances where the conventional ZFF method show severe degradation due to rapid pitch variations. Improved mean opinion scores are obtained based on the subjective evaluation tests performed on the prosody modified speech with the epochs estimated using the refined ZFF method. The reason for improved perceptual quality in the prosody modified speech is the better identification accuracy of the estimated epochs using the proposed method as compared to the conventional ZFF method in the case of emotive speech signals.


Zero frequency filtering Variational mode decomposition Wavelet SynchroSqueezed Transform 


  1. 1.
    Rao, K.S., Yegnanarayana, B.: Prosody modification using instants of significant excitation. IEEE Trans. Audio Speech Lang. Process. 14, 972–980 (2006)CrossRefGoogle Scholar
  2. 2.
    Govind, D., Prasanna, S.R.M.: Dynamic prosody modification using zero frequency filtered signal. Int. J. Speech Technol. 16(1), 41–54 (2013)CrossRefGoogle Scholar
  3. 3.
    Prasanna, S.R.M., Govind, D., Rao, K.S., Yenanarayana, B.: Fast prosody modification using instants of significant excitation. In: Proceedings of the Speech Prosody, May 2010Google Scholar
  4. 4.
    Cabral, J.P., Oliveira, L.C.: Emo voice: a system to generate emotions in speech. In: Proceedings of the INTERSPEECH, pp. 1798–1801 (2006)Google Scholar
  5. 5.
    Clark, R.A.J., Richmond, K., King, S.: Multisyn: open-domain unit selection for the festival speech synthesis system. Speech Commun. 49, 317–330 (2007)CrossRefGoogle Scholar
  6. 6.
    Rudresh, S., Vasisht, A., Vijayan, K., Seelamantula, C.S.: Epoch-synchronous overlap-add (ESOLA) for time- and pitch- scale modification of speech signals. arXiv:1801.06492v1, pp. 1–10, January 2018
  7. 7.
    Tao, J., Kang, Y., Li, A.: Prosody conversion from neutral speech to emotional speech. IEEE Trans. Audio Speech Lang. Process. 14, 1145–1154 (2006)CrossRefGoogle Scholar
  8. 8.
    Portnoff, M.R.: Time-scale modification of speech based on short-time fourier analysis. IEEE Trans. Acoust. Speech Signal Process. ASSP 29, 374–390 (1981)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Allen, J.B., Rabiner, L.R.: A unified approach to short-time fourier analysis and synthesis. Proc. IEEE 65(11), 1558–1564 (1977)CrossRefGoogle Scholar
  10. 10.
    Mourlines, E., Laroche, J.: Non-parametric techniques for pitch-scale and time-scale modification of speech. Speech Commun. 16, 175–205 (1995)CrossRefGoogle Scholar
  11. 11.
    Moulines, E., Charpentier, F.: Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun. 9, 452–467 (1990)Google Scholar
  12. 12.
    Govind, D., Joy, T.T.: Improving the flexibility of dynamic prosody modification using instants of significant excitation. Int. J. Circuits Syst. Signal Process. 35(7), 2518–2543 (2016)CrossRefGoogle Scholar
  13. 13.
    Yegnanarayana, B., Veldhuis, R.N.J.: Extraction of vocal-tract system characteristics from speech signals. IEEE Trans. Speech Audio Process. 6(4), 313–327 (1998)CrossRefGoogle Scholar
  14. 14.
    Adiga, N., Govind, D., Prasanna, S.R.M.: Significance of epoch identification accuracy for prosody modification. In: Proceedings of the International Conference on Signal Processing and Communication (SPCOM) (2014)Google Scholar
  15. 15.
    Murty, K.S.R., Yegnanarayana, B.: Epoch extraction from speech signals. IEEE Trans. Audio Speech Lang. Process. 16(8), 1602–1614 (2008)CrossRefGoogle Scholar
  16. 16.
    Drugman, T., Dutoit, T.: Glottal closure and opening instant detection from speech signals. In: Proceedings of the INTERSPEECH, pp. 2891–2895 (2009)Google Scholar
  17. 17.
    Kadiri, S.R., Yegnanarayana, B.: Epoch extraction from emotional speech using single frequency filtering approach. Speech Commun. 86, 52–63 (2017)CrossRefGoogle Scholar
  18. 18.
    Prathosh, A.P., Ananthapadmanabha, T.V., Ramakrishnan, A.G.: Epoch extraction based on integrated linear prediction residual using plosion index. IEEE Trans. Audio Speech Lang. Process. 21(12), 2471–2480 (2013)CrossRefGoogle Scholar
  19. 19.
    Govind, D., Prasanna, S.R.M.: Epoch extraction from emotional speech. In: Proceedings of the Signal Processing & Communications (SPCOM), pp. 1–5, July 2012Google Scholar
  20. 20.
    Govind, D., Pravena, D., Ajay, G.: Improved epoch extraction using variational mode decomposition based spectral smoothing of zero frequency filtered emotive speech signals. In: Proceedings of National Conference on Communications (NCC). Indian Institute of Technology Hyderabad, February 2018Google Scholar
  21. 21.
    Dragomiretskiy, K., Zosso, D.: Variational mode decomposition. IEEE Trans. Signal Process. 62(3), 531–544 (2014)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Daubechies, I., Maes, S.: A nonlinear squeezing of the continuous wavelet transforms based on auditory nerve model. In: Wavelets in Medicinal Biology. CRC Press (1996)Google Scholar
  23. 23.
    Daubechies, I., Lu, J., Wu, H.T.: Synchrosqueezed wavelet transforms: an empirical mode decomposition-like tool. Appl. Comput. Harmon. Anal. 30, 243–261 (2011)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Kominek, J., Black, A.: CMU-arctic speech databases. In: 5th ISCA Speech Synthesis Workshop, Pittsburgh, PA, pp. 223–224 (2004)Google Scholar
  25. 25.
    Burkhardt, F., Paeschke, A., Rolfes, M., Sendlemeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of the INTERSPEECH, pp. 1517–1520 (2005)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Center for Computational Engineering and Networking (CEN)Amrita School of Engineering, Amrita Vishwa VidyapeethamCoimbatoreIndia

Personalised recommendations