Multimedia Tools and Applications

, Volume 76, Issue 5, pp 7421–7444 | Cite as

Ensemble audio segmentation for radio and television programmes

  • Paula Lopez-Otero
  • Laura Docio-Fernandez
  • Carmen Garcia-Mateo


State-of-the-art audio segmentation strategies obtain good results when performing simple tasks but its performance is degraded when segmenting real-world scenarios such as radio and television programmes; this issue can be partially solved by performing a fusion of different audio segmentation strategies. Hence, a framework to perform decision-level fusion in the audio segmentation task is presented in this paper. First, the class-conditional probabilities of each audio segmentation strategy are estimated from a confusion matrix obtained by performing audio segmentation in a training dataset. Performance measures are extracted from these class-conditional probabilities, which are used to compute different estimates of the classifier’s reliability; specifically, reliability estimates based on precision, recall, accuracy, F-score and mutual information were proposed. These reliability estimates are used as weights in a weighted majority voting fusion strategy. The validity of the proposed fusion scheme and reliability estimates was assessed in the framework of Albayzin 2010, 2012 and 2014 audio segmentation evaluations, which consisted in segmenting collections of radio and television programmes. The experimental results showed that this simple fusion strategy improves the performance achieved by the individual audio segmentation strategies and by other well-known decision-level fusion strategies.


Ensemble classification Confusion matrix Reliability estimation Audio segmentation 


  1. 1.
    Anguera X, Hernando J (2004) XBIC: Nueva Medida para segmentación de locutor hacia el indexado automático de la señal de voz. In: III Jornadas en tecnología del habla, 237–242Google Scholar
  2. 2.
    Butko T, Nadeu C (2011) Audio segmentation of broadcast news in the albayzin-2010 evaluation: overview, results, and discussion. EURASIP Journal on Audio, Speech and Music Processing 2011(1)Google Scholar
  3. 3.
    Butko T, Nadeu C, Schulz H (2010) Albayzin-2010 audio segmentation evaluation: Evaluation setup and results. In: Proceedings of FALA 2010 - VI jornadas en tecnología del habla and II iberian SLTech workshop, 305–308Google Scholar
  4. 4.
    Castan D, Ortega A, Miguel A, Lleida E (2014) Audio segmentation-by-classification approach based on factor analysis in broadcast news domain. EURASIP Journal on Audio, Speech and Music Processing 2014(34)Google Scholar
  5. 5.
    Castanedo F (2013) A review of data fusion techniques. Sci World J:2013Google Scholar
  6. 6.
    Cettolo M, Vescovi M (2003) Efficient audio segmentation algorithms based on the BIC. In: Proceedings of ICASSP VI, 537–540Google Scholar
  7. 7.
    Cho S, Kim J (1995) Multiple network fusion using fuzzy logic. IEEE Trans Neural Netw 6(2):497–501CrossRefGoogle Scholar
  8. 8.
    Comon P (1994) Independent component analysis - a new concept? Signal Process 36:287– 314CrossRefMATHGoogle Scholar
  9. 9.
    Delacourt P, Kryze D, Wellekens CJ (2000) DISTBIC: a speaker-based segmentation for audio data indexing. Speech Comm 32(1-2):111–126CrossRefGoogle Scholar
  10. 10.
    Do CT, Barras C, Lee VB, Sarkar AK (2013) Augmenting short-term cepstral features with long-term discriminative features for speaker verification of telephone data. In: Proceedings of interspeech, 2484–2488Google Scholar
  11. 11.
    Franco-Pedroso J, Gomez-Rincon E, Ramos D, Gonzalez-Rodriguez J (2014) ATVS-UAM system description for the albayzin 2014 audio segmentation evaluation. In: Proceedings of iberspeech 2014: VIII jornadas en tecnología del habla and IV iberian SLTech workshop, 247–252Google Scholar
  12. 12.
    Gunatilaka AH, Baertlein BA (2001) Feature-Level And Decision-Level fusion of noncoincidently sampled sensors for land mine detection. IEEE Trans Pattern Anal Mach Intell 23(6):577–589CrossRefGoogle Scholar
  13. 13.
    Hall M (1998) Correlation-based feature subset selection for machine learning. Ph.D. Thesis, University of Waikato, Hamilton, New ZealandGoogle Scholar
  14. 14.
    Huang YS, Suen CY (1993) The Behavior-Knowledge space method for combination of multiple classifiers. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp 347–352Google Scholar
  15. 15.
    Kasapoglu NG, Anfinsen SN, Eltoft T (2012) Fusion of optical and multifrequency PolSAR data for forest classification. In: Proceedings of IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp 3355–3358Google Scholar
  16. 16.
    Kittler J, Hatef M, Duln P, Matas J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239CrossRefGoogle Scholar
  17. 17.
    Koa AH, Sabourina R, de Souza Britto Jr. A, Oliveira L (2007) Pairwise fusion matrix for combining classifiers. Pattern Recogn 40(8):2198–2210CrossRefMATHGoogle Scholar
  18. 18.
    Kuncheva LI (2004) Combining pattern classifiers: methods and algorithms. Wiley-ScienceGoogle Scholar
  19. 19.
    Kuncheva L, Rodriguez J (2014) A weighted voting framework for classifiers ensembles. Knowl Inf Syst 38(2)Google Scholar
  20. 20.
    Littlestone N, Warmuth M (1994) Weighted majority algorithm. Inf Comput:212–261Google Scholar
  21. 21.
    Lopez-Otero P, Docio-Fernandez L, Garcia-Mateo C (2014) GTM-UVIgo System for Albayzin 2014 Audio Segmentation Evaluation. In: Proceedings of iberspeech 2014: VIII jornadas en tecnología del habla and IV iberian SLTech workshop, 253–262Google Scholar
  22. 22.
    Meinedo H, Neto J (2005) A Stream-Based audio segmentation, classification and clustering Pre-Processing system for broadcast news using ANN models. In: Proceedings of interspeech, 237–240Google Scholar
  23. 23.
    Metze F, Rawat S, Wang Y (2014) Improved audio features for Large-Scale multimedia event detection. In: IEEE International conference on multimedia and expo, ICME, 1–6Google Scholar
  24. 24.
    Molina L (2002) Feature selection algorithms: a survey and experimental evaluation. In: Proceedings of IEEE international conference on data mining, 306–313Google Scholar
  25. 25.
    Ortega A, Castan D, Miguel A, Lleida E (2014) The albayzin 2014 audio segmentation evaluation. In: Proceedings of iberspeech: VIII jornadas en tecnología del habla and IV iberian SLTech workshop, 283–289Google Scholar
  26. 26.
    Polikar R (2006) Ensemble based systems in decision making. IEEE Circuits Syst Mag 6(3):21–45CrossRefGoogle Scholar
  27. 27.
    Ramona M, Richard G (2009) Comparison of different strategies for a SVM-based audio segmentation. In: Proceedings of the european signal processing conference (EUSIPCO)Google Scholar
  28. 28.
    Rodriguez-Fuentes L, Penagarikano M, Varona A, Diez M, Bordel G (2012) GTTS Systems for the albayzin 2012 audio segmentation evaluation. In: Proceedings of iberspeech 2012: VII jornadas en tecnología del habla and III iberian SLTech workshop, 590–595Google Scholar
  29. 29.
    Ross A, Govindarajan R (2005) Feature level fusion using hand and face biometrics. In: Proceedings of SPIE conference on biometric technology for human identification II 5779, 196–204Google Scholar
  30. 30.
    Rybach D, Gollan C, Schlüter R, Ney H (2009) Audio segmentation for speech recognition using segment features. In: Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP), 4197–4200Google Scholar
  31. 31.
    Schuller B, Metze F, Steidl S, Batliner A, Eyben F, Polzehl T (2010) Late fusion of individual engines for improved recognition of negative emotion in speech - learning vs. democratic vote. In: Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP), 5230–5233Google Scholar
  32. 32.
    Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464MathSciNetCrossRefMATHGoogle Scholar
  33. 33.
    Seyerlehner K, Pohle T, Schedl M, Widmer G (2007) Automatic music detection in television productions. In: Proceedings of the 10th international conference on digital audio effects (DAFx-07)Google Scholar
  34. 34.
    Shafer G (1976) A mathematical theory of evidence. Princeton University Press, PrincetonMATHGoogle Scholar
  35. 35.
    Silvestre-Cerdà J, Giménez A, Andrés-Ferrer J, Civera J, Juan A (2012) Albayzin evaluation: the PRHLT-UPV audio segmentation system. In: Proceedings of iberspeech: VII jornadas en tecnología del habla and III iberian SLTech workshop, 596–600Google Scholar
  36. 36.
    Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437CrossRefGoogle Scholar
  37. 37.
    Tao Q, Veldhuis R (2009) Threshold-optimized decision-level fusion and its application to biometrics. Pattern Recogn 42:823–836CrossRefGoogle Scholar
  38. 38.
    Tavarez D, Navas E, Alonso A, Erro D, Saratxaga I, Hernaez I (2014) Aholab audio segmentation system for albayzin 2014 evaluation campaign. In: Proceedings of iberspeech 2014: VIII jornadas en tecnología del habla and IV iberian SLTech workshop, 273–282Google Scholar
  39. 39.
    Tulys P, Akkermans A, Kevenaar T, Schrijen G, Bazen A, Veldhuis R (2005) Practical biometric authentication with template protection. In: Proceedings of 5th international conference on audio- and video-based personal authentication, 436–446Google Scholar
  40. 40.
    Tzanetakis G (2002) Manipulation, analysis and retrieval systems for audio signals. Ph.D. Thesis, Princeton UniversityGoogle Scholar
  41. 41.
    Young SJ, Kershaw D, Odell J, Ollason D, Valtchev V, Woodland P (2006) The HTK book version 3.4, Cambridge University PressGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.AtlantTIC Research Center, Multimedia Technologies GroupUniversity of VigoVigoSpain

Personalised recommendations