Automatic Speech and Singing Discrimination for Audio Data Indexing

  • Wei-Ho TsaiEmail author
  • Cin-Hao Ma
Part of the International Series on Computer Entertainment and Media Technology book series (ISCEMT)


In this study, we propose a technique of automatically discriminating speech from singing voices, which can be of great use for handling big audio data. The proposed discrimination approach is based on both timbre and pitch feature analyses. In using timbre features, voice recordings are converted into Mel-Frequency Cepstral Coefficients and their first derivatives and then analyzed using Gaussian mixture models. In using pitch feature, we convert voice recordings into MIDI note sequences and then use bigram models to analyze the dynamic change information of the notes. Our experiments, conducted using a database including 600 test recordings from 20 subjects, show that the proposed system can achieve 94.3 % accuracy.


Discrimination Pitch Singing Speech Timbre Voice 



This work was supported in part by the National Science Council, Taiwan, under Grants NSC101-2628-E-027-001.


  1. 1.
    S. Rosenau, An analysis of phonetic differences between German singing and speaking voices, in Proc. 14th Int. Congress of Phonetic Sciences (ICPhS), 1999Google Scholar
  2. 2.
    D. Gerhard, Computationally measurable differences between speech and song, Ph.D. dissertation, Simon Fraser University, 2003Google Scholar
  3. 3.
    J. Saunders, “Real-time discrimination of broadcast speech/music,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP), pp. 993–996, May 1996Google Scholar
  4. 4.
    E. Scheirer, M. Slaney, Construction and evaluation of a robust multi-feature speech/music discriminator, in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP), pp. 1331–1334, Apr 1997Google Scholar
  5. 5.
    G. Williams and D. Ellis, “Speech/music discrimination based on posterior probability features,” in Proc. European Conf. Speech Commun. and Technology (Eurospeech), pp. 687–690, Sept 1999Google Scholar
  6. 6.
    M. Carey, E. Parris, H. Lloyd-Thomas, A comparison of features for speech/music discrimination, Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP), 1999Google Scholar
  7. 7.
    K. El-Maleh, M. Klein, G. Petrucci, P. Kabal, Speech/music discrimination for multimedia applications, in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP), 2000Google Scholar
  8. 8.
    W. Chou, L. Gu, Robust singing detection in speech/music discriminator design, in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP), 2001Google Scholar
  9. 9.
    J. Ajmera, I. McCowan, H. Bourlard, Speech/Music segmentation using entropy and dynamism features in a HMM classification framework. Speech Comm. 40, 351–363 (2003)CrossRefGoogle Scholar
  10. 10.
    C. Panagiotakis, G. Tziritas, A speech/music discriminator based on RMS and zero crossings. IEEE Trans. Multimedia 7(1), 155–166 (2005)CrossRefGoogle Scholar
  11. 11.
    J. E. Muñoz‐Expósito, S. Garcia‐Galán, N., Ruiz‐Reyes, P. Vera‐Candeas, F. Rivas‐Peña, Speech/Music discrimination using a single warped LPC-based feature, in Proc. Int. Symp. Music Information Retrieval, 2005Google Scholar
  12. 12.
    A. L. Berenzweig, D. P. W. Ellis, Locating singing voice segments within music signals, in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 119–122, 2001Google Scholar
  13. 13.
    L.R. Rabiner, A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)CrossRefGoogle Scholar
  14. 14.
    The Hidden Markov Model Toolkit (HTK)
  15. 15.
    S.B. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)CrossRefGoogle Scholar
  16. 16.
    The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)
  17. 17.
    M.-Y. Hwang, X. Huang, Shared distribution Hidden Markov Models for speech recognition. IEEE Trans. Speech Audio Process. 1(4), 414–420 (1993)CrossRefGoogle Scholar
  18. 18.
    S.J. Young, P.C. Woodland, State clustering in HMM-based continuous speech recognition. Comput. Speech Lang. 8(4), 369–384 (1994)CrossRefGoogle Scholar
  19. 19.
    V. Digalakis, P. Monaco, H. Murveit. Genones: generalised mixture tying in continuous speech HMM-based speech recognizers. IEEE Trans. Speech Audio Process., 4(4), 281–289 (1996)Google Scholar
  20. 20.
    D. Reynolds, R. Rose, Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3(1), 72–83 (1995)CrossRefGoogle Scholar
  21. 21.
    A. Dempster, N. Laird, D. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. 39, 1–38 (1977)MathSciNetzbMATHGoogle Scholar
  22. 22.
    H.M. Yu, W.H. Tsai, H.M. Wang, A query-by-singing system for retrieving karaoke music. IEEE Trans. Multimedia 10(8), 1626–1637 (2008)CrossRefGoogle Scholar
  23. 23.
    M. Piszczalski, B.A. Galler, Predicting musical pitch from component frequency ratios. J. Acoust. Soc. Amer. 66(3), 710–720 (1979)CrossRefGoogle Scholar
  24. 24.
    X. Huang, A. Acero, H. W. Hon, Spoken Language Processing, (Prentice Hall, 2001)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Department of Electronic EngineeringNational Taipei University of TechnologyTaipeiTaiwan

Personalised recommendations