Multimedia Tools and Applications

, Volume 75, Issue 23, pp 15509–15524 | Cite as

Comparative study of singing voice detection methods

  • Shingchern D. You
  • Yi-Chung Wu
  • Shih-Hsien Peng


Detecting Singing segments in a segment of a soundtrack is an important and useful technique in musical signal processing and retrieval. In this paper, we study the accuracy of detecting singing segments using the HMM (Hidden Markov Model) classifier with various features, including MFCC (Mel Frequency Cepstral Coefficients), LPCC (Linear Predictive Cepstral Coefficients), and LPC (Linear Prediction Coefficients). Simulation results show that detecting singing segments in a soundtrack is more difficult than detecting them among pure-instrument segments. In addition, combining MFCC and LPCC yield higher accuracy. The bootstrapping technique has only limited accuracy improvement to detect all singing segments in a soundtrack. To be complete, we also conduct an experiment to show that the time to perform music identification can be reduced by more than 40 % if we incorporate the singing-voice detection mechanism into the identification process.


MFCC LPCC HMM vocal bootstrapping music identification 



This work was supported in part by National Science Council (NSC) and Ministry of Science and Technology (MOST) of Taiwan through Grants NSC 101-2221-E-027-127 and MOST 103-2221-E-027-092.


  1. 1.
    Becchetti C, Ricotti LP (1999) Speech recognition: theory and C++ implementation. Wiley, New YorkGoogle Scholar
  2. 2.
    Berenzweig AL, Ellis DPW (2001) “Locating singing voice segments within music signals.” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 21–24Google Scholar
  3. 3.
    Cano P, Battle E, Kalker T, Haitsma J (2005) A review of audio fingerprinting. J VLSI Signal Process 41(3):271–284CrossRefGoogle Scholar
  4. 4.
    Casey MA (1987) “MPEG-7 sound recognition tools,” IEEE Trans. Circuits and Systems for Video Tech, vol. 11, no. 6, pp. 737–747, June, 2001.D. O’Shaughnessy, Speech Communication: Human and Machine, Addison-Wesley, Reading MAGoogle Scholar
  5. 5.
    Casey MA (2001) “Reduced-rank spectra and minimum-entropy priors as consistent and reliable cues for generalized sound recognition.” Proceedings of workshop for consistent & reliable acoustic cues for sound analysis. Columbia Univ., NY, USA, 167Google Scholar
  6. 6.
    Cho H, Choi M (2014) Personal mobile album/diary application development. J Converg 5(1):32–37Google Scholar
  7. 7.
  8. 8.
    ISO/IEC (2002) Information Technology -- Multimedia Content Description Interface -Part 4: Audio, IS 15938–4Google Scholar
  9. 9.
    ISO/IEC (2003) Information technology -- Multimedia content description interface -- Part 6: Reference software ISO 15938–6. The reference program is available at
  10. 10.
    Lindsay PH, Norman DA (1977) Human information processing: An introduction to psychology, 2nd edn. Academic, New YorkGoogle Scholar
  11. 11.
    Lukashevich H, H. et al (2007) “Effective singing voice detection in popular music using ARMA filtering.” Proc. 10th International Conference on Digital Audio Effects (DAFx-07), Bordeaux, France, 10–15Google Scholar
  12. 12.
    Murrphy K (19998) “Hidden Markov Model (HMM) Toolbox for Matlab,” available at
  13. 13.
    New TL et al (2004) “Singing voice detection in popular music.” Proc. 12th Annual ACM International Conference on Multimedia, 1–4Google Scholar
  14. 14.
    O’Shaughnessy D (1987) Speech communication: Human and machine. Addison-Wesley, ReadingzbMATHGoogle Scholar
  15. 15.
    Rabiner LR, Juang BH (1993) Fundamentals of speech recognition. Prentice Hall, Englewood CliffszbMATHGoogle Scholar
  16. 16.
    Rocamora M, Herrera P (2007) “Comparing audio descriptors for singing voice detection in music audio files.” Proc. of 11th Brazilian Symposium on Computer Music, 1–10Google Scholar
  17. 17.
    Tzanetakis G (2004) “Song-specific bootstrapping of singing voice structure.” Proc. 2004 I.E. International Conference on Multimedia and Expo, vol. 3, 2027–2030Google Scholar
  18. 18.
    Vembu S, Baumann S (2005) “Separation of vocals from polyphonic audio recordings.” Proc. of 6th International Conference on Music Information Retrieval (ISMIR 2005), 1–8Google Scholar
  19. 19.
    Yoon S-H, Min J (2013) An intelligent automatic early detection system of forest fire smoke signatures using gaussian mixture model. J Inf Process Syst 9(4):621–632CrossRefGoogle Scholar
  20. 20.
    You SD, Pu Y-H (2015) Using paired distances of signal peaks in stereo channels as fingerprints for copy identification. ACM Trans Multimedia Comput Commun Appl 12(1):22Google Scholar
  21. 21.
    You SD, Chen W-H (2015) Comparative study of methods for reducing dimensionality of MPEG-7 audio signature descriptors. Multimed Tools Appl 74(10):3579–3598CrossRefGoogle Scholar
  22. 22.
    You SD, Chen W-H, Chen W-K (2013) “Music identification system using MPEG-7 audio signature descriptors.” Sci World J 2013:doi: 10.1155/2013/752464

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Shingchern D. You
    • 1
  • Yi-Chung Wu
    • 1
  • Shih-Hsien Peng
    • 1
  1. 1.Department of Computer Science and Information EngineeringNational Taipei University of TechnologyTaipeiTaiwan

Personalised recommendations