Artificial Intelligence Review

, Volume 42, Issue 4, pp 637–661 | Cite as

A survey of tagging techniques for music, speech and environmental sound

  • Shufei DuanEmail author
  • Jinglan Zhang
  • Paul Roe
  • Michael Towsey


Sound tagging has been studied for years. Among all sound types, music, speech, and environmental sound are three hottest research areas. This survey aims to provide an overview about the state-of-the-art development in these areas. We discuss about the meaning of tagging in different sound areas at the beginning of the journey. Some examples of sound tagging applications are introduced in order to illustrate the significance of this research. Typical tagging techniques include manual, automatic, and semi-automatic approaches. After reviewing work in music, speech and environmental sound tagging, we compare them and state the research progress to date. Research gaps are identified for each research area and the common features and discriminations between three areas are discovered as well. Published datasets, tools used by researchers, and evaluation measures frequently applied in the analysis are listed. In the end, we summarise the worldwide distribution of countries dedicated to sound tagging research for years.


Sound tagging Music tagging Speech recognition Environmental sound tagging Manual tagging Automatic tagging Semi-automatic tagging 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Agranat I (2009) Automatically identifying animal species from their vocalizations. Paper presented at the fifth international conference on bio-acousticsGoogle Scholar
  2. Allegro S, Büchler M, Launer, S (2001) Automatic sound classification inspired by auditory scene analysis. Consistent and reliable acoustic cues for sound analysis CRAC oneday workshop Aalborg Denmark sunday September 2nd 2001 directly before Eurospeech 2001, 2005, 1–4Google Scholar
  3. Anusuya MA, Katti SK (2010) Speech recognition by machine. A Rev Int J Comput Sci Inf Secur IJCSIS 6(3): 181–205Google Scholar
  4. Arora R, Lutfi RA (2009) An efficient code for environmental sound classification. J Acoust Soc Am 126: 7CrossRefGoogle Scholar
  5. Bardeli R, Wolff D, Kurth F, Koch M, Tauchert KH, Frommolt KH (2010) Detecting bird sounds in a complex acoustic environment and application to bioacoustic monitoring. Pattern Recognit Lett 31(12): 1524–1534CrossRefGoogle Scholar
  6. Barrington L, Turnbull D, Lanckriet G (2008) Auto-tagging music content with semantic multinomials. In: Proceedings of internal conference on music information retrievalGoogle Scholar
  7. Bertin-Mahieux T, Eck D, Mandel M (2011) Automatic tagging of audio: the state-of-the-art. In: Machine audition: principles, algorithms and systems. IGI Global, pp 334–352Google Scholar
  8. Bischoff K, Firan CS, Nejdl W, Paiu R (2010) Bridging the gap between tagging and querying vocabularies: analyses and applications for enhancing multimedia IR. Web semantics: science, services and agents on the world wide webGoogle Scholar
  9. Brandes ST (2008) Automated sound recording and analysis techniques for bird surveys and conservation. Bird Conserv Int (SupplementS1) 18: S163–S173Google Scholar
  10. Brandes T, Naskrecki P, Figueroa H (2006) Using image processing to detect and classify narrow-band cricket and frog calls. J Acoust Soc Am 120: 2950–2957CrossRefGoogle Scholar
  11. Briggs F, Raich R, Fern XZ (6–9 Dec 2009) Audio classification of bird species: a statistical manifold approach. Paper presented at the data mining, 2009. ICDM ’09. Ninth IEEE international conference onGoogle Scholar
  12. Burred JJ, Cella C-E, Peeters G, Röbel A, Schwarz D (2008) Using the SDIF sound description interchange format for audio features. Paper presented at the ISMIRGoogle Scholar
  13. Cambron ME, Bowker RG (2006) An automated digital sound recording system: the amphibulator. Paper presented at the proceedings of the eighth IEEE international symposium on multimediaGoogle Scholar
  14. Cano P, Koppenberger M, Groux S, Ricard J, Wack N, Herrera P (2005) Nearest-neighbor automatic sound annotation with a wordnet taxonomy. J Intell Inf Syst 24(2–3): 99–111CrossRefGoogle Scholar
  15. Chen L, Wright P, Nejdl W (2009) Improving music genre classification using collaborative tagging data. Paper presented at the proceedings of the second ACM international conference on web search and data miningGoogle Scholar
  16. Chen Z, Maher RC (2006) Semi-automatic classification of bird vocalizations using spectral peak tracks. J Acoust Soc Am 120(5): 2974–2984CrossRefGoogle Scholar
  17. Cheng J, Sun Y, Ji L (2010) A call-independent and automatic acoustic system for the individual recognition of animals: a novel model using four passerines. Pattern Recognit 43(11): 3846–3852CrossRefzbMATHGoogle Scholar
  18. Clifton T (1983) Music as heard: a study in applied phenomenology. Yale University Press, New Haven and LondonGoogle Scholar
  19. Coviello E, Barrington L, Antoni C, Lanckriet GRG (9–13 Aug 2010) Automatic music tagging with time series models. Paper presented at the proceedings of the 11th international society for music information retrieval conference, Utrecht, NetherlandsGoogle Scholar
  20. Cowling M, Sitte R (2003) Comparison of techniques for environmental sound recognition. Pattern Recognit Lett 24(15): 2895–2907CrossRefGoogle Scholar
  21. Dhanalakshmi P, Palanivel S, Ramalingam V (2009) Classification of audio signals using SVM and RBFNN. Expert Syst Appl 36(3 Part 2): 6069–6075CrossRefGoogle Scholar
  22. Duan S, Towsey M, Zhang J, Truskinger A, Wimmer J, Roe P (6–9 Dec 2011) Acoustic component detection for automatic species recognition in environmental monitoring. Paper presented at the intelligent sensors, sensor networks and information processing (ISSNIP), 2011 seventh international conference onGoogle Scholar
  23. Dupont S, Luettin J (2000) Audio-visual speech modeling for continuous speech recognition. IEEE Trans Multimed 2(3): 141–151CrossRefGoogle Scholar
  24. Eck D, Lamere P, Bertin-Mahieux T, Green S (2007) Automatic generation of social tags for music recommendation. Paper presented at the advances in neural information processing systemsGoogle Scholar
  25. Franzen A, Gu IYH (5–8 Oct 2003) Classification of bird species by using key song searching: a comparative study. Paper presented at the systems, man and cybernetics, 2003. IEEE international conference onGoogle Scholar
  26. Furui S (2004) Fifty years of progress in speech and speaker recognition. Acoust Soc Am J 116(4): 2497–2498CrossRefMathSciNetGoogle Scholar
  27. Gordon L, Chervonenkis AY, Gammerman AJ, Shahmuradov IA, Solovyev VV (2003) Sequence alignment kernel for recognition of promoter regions. Bioinformatics 19(15): 1964–1971CrossRefGoogle Scholar
  28. Gunasekaran S, Revathy K (2010) Content-based classification and retrieval of wild animal sounds using feature selection algorithm. Paper presented at the machine learning and computing (ICMLC), 2010 second international conference onGoogle Scholar
  29. Hoffman M, Blei D, Cook P (2009) Easy as CBA: a simple probabilistic model for tagging music. Paper presented at the proceedings international symposium on music information retrieval, Kobe, JapanGoogle Scholar
  30. Hu W, Van Nghia T, Bulusu N, Chou CT, Jha S, Taylor A (15 Apr 2005) The design and evaluation of a hybrid sensor network for cane-toad monitoring. Paper presented at the information processing in sensor networks, 2005. IPSN 2005. Fourth international symposium onGoogle Scholar
  31. Huang C-J, Yang Y-J, Yang D-X, Chen Y-J (2009) Frog classification using machine learning techniques. Expert Syst Appl 36(2 Part 2): 3737–3743CrossRefGoogle Scholar
  32. Kim YE, Schmidt E, Emelle L (2008) MoodSwings: a collaborative game for music mood label. ISMIR’ 08: 231–236Google Scholar
  33. Kostek B, Szczuko P, Zwan P Processing of Musical Data Employing Rough Sets and Artificial Neural Networks.(2004) In: Tsumoto S, Slowinski R (eds) Rough sets and current trends in computing. Springer, Berlin 3066: pp 539–548Google Scholar
  34. Kuznetsov A, Pyshkin E (2010) Searching for music: from melodies in mind to the resources on the web. Paper presented at the proceedings of the 13th international conference on humans and computersGoogle Scholar
  35. Kwan C, Mei G, Zhao X, Ren Z, Xu R, Stanford V et al (17–21 May 2004) Bird classification algorithms: theory and experimental results. Paper presented at the IEEE international conference on acoustics, speech, and signal processing, 2004. Proceedings (ICASSP ’04)Google Scholar
  36. Lakshminarayanan B, Raich R, Fern X (13–15 Dec 2009) A syllable-level probabilistic framework for bird species identification. Paper presented at the machine learning and applications, 2009. ICMLA ’09. International conference onGoogle Scholar
  37. Lau A, Mason R, Pham B, Richards M, Roe P, Zhang J (11–14 June 2008) Monitoring the environment through acoustics using smartphone-based sensors and 3G networking. Paper presented at the proceedings of the second international workshop on wireless sensor network deployments (WiDeploy08); 4th IEEE international conference on distributed computing in sensor systems, DCOSS 2008, GreeceGoogle Scholar
  38. Law E, West K, Mandel M, Bay M, Downie JS (2009) Evaluation of algorithms using games: the case of music tagging. Evaluation, pp, 387–392Google Scholar
  39. Levy M, Sandler M (2009) Music information retrieval using social tags and audio. Multimed IEEE Trans 11(3): 383–395CrossRefGoogle Scholar
  40. Lidy T, Silla CN Jr, Cornelis O, Gouyon F, Rauber A, Kaestner CAA et al (2010) On the suitability of state-of-the-art music information retrieval methods for analyzing, categorizing and accessing non-Western and ethnic music collections. Signal Process 90(4): 1032–1048CrossRefzbMATHGoogle Scholar
  41. Liu D (2003) Automatic mood detection from acoustic music data. Paper presented at the proceedings of the international conference on music information retrievalGoogle Scholar
  42. Lo H-Y, Lin S-D, Wang H-M (2011) Audio tag annotation and retrieval using tag count information. Paper presented at the proceedings of the 17th international conference on advances in multimedia modeling, volume part IGoogle Scholar
  43. Mandel MI, Ellis DPW (2008) Multiple-instance learning for music information retrieval. Paper presented at the the preceedings of the 9th international conference on music information retrieval (ISMIR)Google Scholar
  44. Martin K (1998) Toward automatic sound source recognition: identifying musical instruments. Paper presented at the NATO computational hearing advanced study instituteGoogle Scholar
  45. McKinney MF, Breebaart J (2003) Features for audio and music classification. Paper presented at the proceedings of the 4th ISMIRGoogle Scholar
  46. Milicevic A, Nanopoulos A, Ivanovic M (2010) Social tagging in recommender systems: a survey of the state-of-the-art and possible extensions. Artif Intell Rev 33(3): 187–209CrossRefGoogle Scholar
  47. Miotto R, Barrington L, Lanckriet G (2010) Improving auto-tagging by modeling semantic co-occurrences. Paper presented at the international society of music information retrieval conference, UtrechtGoogle Scholar
  48. Mitrovic D, Zeppelzauer M, Breiteneder C (2006) Discrimination and retrieval of animal soundsGoogle Scholar
  49. Mitrovic D, Zeppelzauer M, Eidenberger H (2009) On feature selection in environmental sound recognition. Paper presented at the ELMAR, 2009. ELMAR ’09. International symposiumGoogle Scholar
  50. Moore R (1994) Twenty things we still don’t know about speech. Paper presented at the progress and prospects of speech research and technology: proceedings of the CRIM/FORWISS workshopGoogle Scholar
  51. Nanopoulos A, Karydis I (22–27 May 2011) Know thy neighbor: combining audio features and social tags for effective music similarity. Paper presented at the acoustics, speech and signal processing (ICASSP), 2011 IEEE international conference onGoogle Scholar
  52. Negishi Y, Kawaguchi N (2007) Instant learning sound sensor: flexible environmental sound recognition system. Paper presented at the fourth international conference on networked sensing systemsGoogle Scholar
  53. Ness SR, Theocharis A, Tzanetakis G, Martins LG (2009) Improving automatic music tag annotation using stacked generalization of probabilistic SVM outputs. Paper presented at the proceedings of the 17th ACM international conference on MultimediaGoogle Scholar
  54. Ogihara FWXWBSTLaM (2009) Tag integrated multi-label music style classification with hypergraph. In: Proceedings of the 10th international society for music information retrieval conference, pp 363–368Google Scholar
  55. Olson DL, Delen D (2008) Advanced data mining techniques, 1st edn. Springer, p 138, ISBN 3540769161Google Scholar
  56. Orio N (2006) Music retrieval: a tutorial and review. Found Trends Inf Retr 1((1): 1–96CrossRefzbMATHGoogle Scholar
  57. Panagakis Y, Kotropoulos C (22–27 May 2011) Automatic music tagging via PARAFAC2. Paper presented at the acoustics, speech and signal processing (ICASSP), 2011 IEEE international conference onGoogle Scholar
  58. Planitz B, Roe P, Sumitomo J, Towsey M, Williamson I, Wimmer J, et al (2009) Listening to nature: acoustic monitoring of the environment. Paper presented at the microsoftescience workshopGoogle Scholar
  59. Reed J, Lee C (2009) On the importance of modeling temporal information in music tag annotation. Paper presented at the acoustics, speech and signal processing, 2009. ICASSP 2009. IEEE international conference onGoogle Scholar
  60. Selin A, Turunen J, Tanttu JT (2007) Wavelets in recognition of bird sounds. EURASIP J Appl Signal Process 1: 141–141Google Scholar
  61. Selina C, Narayanan S, Jay Kuo CC (2008) Environmental sound recognition using MP-based features. Paper presented at the acoustics, speech and signal processing, 2008. ICASSP 2008. IEEE international conference onGoogle Scholar
  62. Selouani SA, Kardouchi M, Hervet E, Roy D (2005) Automatic birdsong recognition based on autoregressive time-delay neural networks. Paper presented at the computational intelligence methods and applications, 2005 ICSC congress onGoogle Scholar
  63. Stowell D, Plumbley M (2011) Birdsong and C4DM: a survey of UK birdsong and machine recognition for music researchers: centre for digital music. University of London, Queen MaryGoogle Scholar
  64. Sundaram S, Narayanan S (2007) Analysis of audio clustering using word descriptions. Paper presented at the acoustics, speech and signal processing, 2007. ICASSP 2007. IEEE international conference onGoogle Scholar
  65. Takagi J, Ohishi Y, Kimura A, Sugiyama M, Yamada M, Kameoka H (22–27 May 2011) Automatic audio tag classification via semi-supervised canonical density estimation. Paper presented at the acoustics, speech and signal processing (ICASSP), 2011 IEEE international conference onGoogle Scholar
  66. Temko A, Nadeu C (2006) Classification of acoustic events using SVM-based clustering schemes. Pattern Recognit 39(4): 682–694CrossRefzbMATHGoogle Scholar
  67. Temko A, Nadeu C (2009) Acoustic event detection in meeting-room environments. Pattern Recognit Lett 30(14): 1281–1288CrossRefGoogle Scholar
  68. Thanh D, Bulusu N, Wen H (2008) Lightweight acoustic classification for cane-toad monitoring. Paper presented at the signals, systems and computers, 42nd asilomar conference on signal processingGoogle Scholar
  69. Tingle D, Kim YE, Turnbull D (2010) Exploring automatic music annotation with “acoustically-objective” tags. Paper presented at the proceedings of the international conference on multimedia information retrievalGoogle Scholar
  70. Towsey M, Planitz B, Nantes A, Wimmer J, Roe P (2012) A toolbox for animal call recognition. Bioacoustics, 1–19Google Scholar
  71. Truskinger AM, Yang H, Wimmer J, Zhang J, Williamson I, Roe P (2011) Large scale participatory acoustic sensor data analysis: tools and reputation models to enhance effectivenessGoogle Scholar
  72. Turnbull D, Barrington L, Torres D, Lanckriet G (2008) Semantic annotation and retrieval of music and sound effects. Audio Speech Lang Process IEEE Trans 16(2): 467–476CrossRefGoogle Scholar
  73. Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. Speech Audio Process IEEE Trans 10(5): 293–302CrossRefGoogle Scholar
  74. Uribe OA, Meana HMP, Miyatake MN (7–9 Sep 2005) Environmental sounds recognition system using the speech recognition system techniques. Paper presented at the electrical and electronics engineering, 2005 2nd international conference onGoogle Scholar
  75. Vilches E, Escobar IA, Vallejo EE, Taylor CE (2006) Data mining applied to acoustic bird species recognition. Paper presented at the pattern recognition, 2006. ICPR 2006. 18th international conference onGoogle Scholar
  76. Weninger F, Schuller B (2011) Audio recognition in the wild: static and dynamic classification on a real-world database of animal vocalizations. Paper presented at the acoustics, speech and signal processing (ICASSP), 2011 IEEE international conference onGoogle Scholar
  77. Wichern G, Yamada M, Thornburg H, Sugiyama M, Spanias A (14–19 Mar 2010) Automatic audio tagging using covariate shift adaptation. Paper presented at the acoustics speech and signal processing (ICASSP), 2010 IEEE international conference onGoogle Scholar
  78. Wold E, Blum T, Keislar D, Wheaten J (1996) Content-based classification, search, and retrieval of audio. Multimed IEEE 3(3): 27–36CrossRefGoogle Scholar
  79. Yang H, Zhang J, Roe P (2011) Using reputation management in participatory sensing for data classification. Paper presented at the proeccedings of 2nd international conference on ambient systems, networks and technologiesGoogle Scholar
  80. Yella S, Gupta NK, Dougherty MS (2007) Comparison of pattern recognition techniques for the classification of impact acoustic emissions. Transp Res Part C Emerg Technol 15(6): 345–360CrossRefGoogle Scholar
  81. Yong L, Ying L (25–26 Dec 2010) Eco-environmental sound classification based on matching pursuit and support vector Machine. Paper presented at the information engineering and computer science (ICIECS), 2010 2nd international conference onGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2012

Authors and Affiliations

  • Shufei Duan
    • 1
    Email author
  • Jinglan Zhang
    • 1
  • Paul Roe
    • 1
  • Michael Towsey
    • 1
  1. 1.Faculty of Science and EngineeringQueensland University of TechnologyBrisbaneAustralia

Personalised recommendations