On the Use of Audio Events for Improving Video Scene Segmentation

  • Panagiotis Sidiropoulos
  • Vasileios Mezaris
  • Ioannis Kompatsiaris
  • Hugo Meinedo
  • Miguel Bugalho
  • Isabel Trancoso
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 158)


This work deals with the problem of automatic temporal segmentation of a video into elementary semantic units known as scenes. Its novelty lies in the use of high-level audio information, in the form of audio events, for the improvement of scene segmentation performance. More specifically, the proposed technique is built upon a recently proposed audio-visual scene segmentation approach that involves the construction of multiple scene transition graphs (STGs) that separately exploit information coming from different modalities. In the extension of the latter approach presented in this work, audio event detection results are introduced to the definition of an audio-based scene transition graph, while a visual-based scene transition graph is also defined independently. The results of these two types of STGs are subsequently combined. The results of the application of the proposed technique to broadcast videos demonstrate the usefulness of audio events for scene segmentation and highlight the importance of introducing additional high-level information to the scene segmentation algorithms.


Video analysis Scene segmentation Audio events Scene transition graph 



This work was supported by the European Commission under contracts FP6-045547 VIDI-Video and FP7-248984 GLOCAL.


  1. 1.
    Tsamoura E, Mezaris V, Kompatsiaris I (2008) Gradual transition detection using color coherence and other criteria in a video shot meta-segmentation framework. In: Proceedings of IEEE international conference on image processing, workshop on multimedia information retrieval (ICIP-MIR 2008), pp 45–48Google Scholar
  2. 2.
    Hanjalic A, Lagendijk RL, Biemond J (1999) Automated high-level movie segmentation for advanced video-retrieval systems. IEEE Trans Circ Syst Video Technol 9(4):580–588Google Scholar
  3. 3.
    Yeung M, Yeo BL, Liu B (1998) Segmentation of video by clustering and graph analysis. Comput Vis Image Understand 71(1):94–109CrossRefGoogle Scholar
  4. 4.
    Chasanis V, Likas A, Galatsanos N (2009) Scene detection in videos using shot clustering and sequence alignment. IEEE Trans Multimed 11(1):89–100CrossRefGoogle Scholar
  5. 5.
    Nitanda N, Haseyama M, Kitajima H (2005) Audio signal segmentation and classification for scene-cut detection. In: Proc IEEE Int Symp Circ Syst 4:4030–4033Google Scholar
  6. 6.
    Chianese A, Moscato V, Penta A, Picariello A (2008) Scene detection using visual and audio attention. In: Proceedings of Ambi-Sys workshop on ambient media delivery and interactive televisionGoogle Scholar
  7. 7.
    Wilson K, Divakaran A (2009) Discriminative genre-independent audio-visual scene change detection. In: Proceedings of SPIE conference on multimedia content access: algorithms and systems III, vol 7255Google Scholar
  8. 8.
    Wang J, Duan L, Liu Q, Lu H, Jin J (2008) A multimodal scheme for program segmentation and representation in broadcast video streams. IEEE Trans Multimed 10(3):393–408CrossRefGoogle Scholar
  9. 9.
    Sidiropoulos P, Mezaris V, Kompatsiaris I, Meinedo H, Trancoso I (2009) Multi-modal scene segmentation using scene transition graphs. In: Proceedings of ACM Multimedia, pp 665–668Google Scholar
  10. 10.
    Amaral R, Meinedo H, Caseiro D, Trancoso I, Neto J (2007) A prototype system for selective dissemination of broadcast news in European Portuguese. EURASIP J Adv Sig Proces 2007:1–11Google Scholar
  11. 11.
    Meinedo H (2008) Audio pre-processing and speech recognition for Broadcast News. PhD thesis, IST, Technical University of LisbonGoogle Scholar
  12. 12.
    Trancoso I, Pellegrini T, Portelo J, Meinedo H, Bugalho M, Abad A, Neto J (2009) Audio contributions to semantic video search. In: Proceedings of IEEE international conference on multimedia and expo, pp 630–633Google Scholar
  13. 13.
    Bugalho M, Portelo J, Trancoso I, Pellegrini T, Abad A (2009) Detecting audio events for semantic video search. In: Proceedings of interspeech 2009Google Scholar
  14. 14.
    Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines.
  15. 15.
    Meinedo H, Trancoso I (2010) Age and gender classification using fusion of acoustic and prosodic features. In: Proceedings of Interspeech 2010Google Scholar
  16. 16.
    Vendrig J, Worring M (2002) Systematic evaluation of logical story unit segmentation. IEEE Trans Multimed 4(4):492–499CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Panagiotis Sidiropoulos
    • 1
    • 2
  • Vasileios Mezaris
    • 1
  • Ioannis Kompatsiaris
    • 1
  • Hugo Meinedo
    • 3
  • Miguel Bugalho
    • 3
    • 4
  • Isabel Trancoso
    • 3
    • 4
  1. 1.Centre for Research and Technology HellasInformatics and Telematics InstituteThermiGreece
  2. 2.Faculty of Engineering and Physical Sciences, Center for Vision, Speech and Signal ProcessingUniversity of SurreyGuildford, SurreyUK
  3. 3.INESC-ID LisboaRua Alves Redol 9Portugal
  4. 4.IST/UTLRua Alves Redol 9Portugal

Personalised recommendations