Scene Determination Using Auditive Segmentation Models of Edited Video

  • Silvia Pfeiffer
  • Uma Srinivasan
Part of the The Springer International Series in Video Computing book series (VICO, volume 4)


This chapter describes different approaches that use audio features for determination of scenes in edited video. It focuses on analyzing the sound track of videos for extraction of higher-level video structure. We define a scene in a video as a temporal interval which is semantically coherent. The semantic coherence of a scene is often constructed during cinematic editing of a video. An example is the use of music for concatenation of several shots into a scene which describes a lengthy passage of time such as the journey of a character. Some semantic coherence is also inherent to the unedited video material such as the sound ambience at a specific setting, or the change pattern of speakers in a dialog. Another kind of semantic coherence is constructed from the textual content of the sound track revealing, for example, different stories contained in a news broadcast or documentary. This chapter explains the types of scenes that can be constructed via audio cues from a film art perspective. It discusses the feasibility of automatic extraction of these scene types and finally presents s survey of existing approaches.


Scene determination scene types audio content analysis sound classes shot clustering audio segmentation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Aarts, R. and Dekkers, R. A real-time speech-music discriminator. J. Audio Eng. Soc. 47, 9 (1999), 720–725.Google Scholar
  2. [2]
    Adams, B., Dorai, C., and Venkatesh, S. Study of shot length and motion as contributing factors to movie tempo. In Proc. ACM Multimedia 2000 (Los Angeles, CA, USA, November 2000), pp. 353–355.Google Scholar
  3. [3]
    Aigrain, R, Zhang, H., and Petkovic, D. Content-based representation and retrieval of visual media. Multimedia Tools and Applications 3 (1996), 179–202.CrossRefGoogle Scholar
  4. [4]
    Alatan, A., Akansu, A., and Wolf, W. Comparative analysis of hidden Markov models for multi-modal dialogue scene indexing. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing ICASSP 2000 (Istanbul, Turkey, 2000), vol. 4, IEEE, pp. 2401–2404.Google Scholar
  5. [5]
    Beigi, H. and Maes, S. Speaker, channel and environment change detection. In Proceedings of the World Congress on Automation, 1998 (Anchorage, Alaska, May 1998), pp. 18–22.Google Scholar
  6. [6]
    Bordwell, D. and Thompson, K. Film Art: An Introduction, 5th ed. McGraw-Hill, New York, 1997.Google Scholar
  7. [7]
    Carey, M., Parris, E., and Lloyd-Thomas, H. A comparison of features for speech, music discrimination. In IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (1999), vol. 1, pp. 149–152.Google Scholar
  8. [8]
    El-Maleh, K., Klein, M., Petrucci, G., and Kabal, P. Speech/music discrimination for multimedia applications. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing ICASSP 2000 (Istanbul, Turkey, 2000), vol. 4, IEEE, pp. 2445–2449.Google Scholar
  9. [9]
    El-Maleh, K., Samouelian, A., and Kabal, P. Frame-level noise classification in mobile environments. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) (Phoenix, AZ, USA, Mar 1999), pp. 237–240.Google Scholar
  10. [10]
    Foote, J. Content-based retrieval of music and audio. In Proc. SPIE, Multimedia Storage and Archiving Systems II (San José, CA, USA, 1997), C.-C. Kuo and others, Eds., vol. 3229, pp. 138–147.Google Scholar
  11. [11]
    Gerhard, D. Ph.D. depth paper: Audio signal classification. Tech. rep., School of Computing Science, Simon Fraser University, Burnaby, Canada, February 2000.Google Scholar
  12. [12]
    Hauptmann, A. and Smith, M. Text, speech, and vision for video segmentation: The Informedia project. In AAAI-95 Fall Symposium on Computational Models for Integrating Language and Vision (November 1995), pp. 90–95.Google Scholar
  13. [13]
    Huang, J., Liu, Z., and Wang, Y. Integration of audio and visual information for content-based video segmentation. In Proc. IEEE Intl. Conf. Image Processing (ICIP-98) (Chicago, IL, Oct 1998), vol. 3, pp. 526–530.Google Scholar
  14. [14]
    Huang, Q., Liu, Z., Rosenberg, A., Gibbon, D., and Shahraray, B. Automated generation of news content hierarchy by integrating audio, video, and text information. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) (Phoenix, AZ, USA, Mar 1999), pp. 3025–3028.Google Scholar
  15. [15]
    Jiang, H., Lin, T., and Zhang, H. Video segmentation with the assistance of audio content analysis. In Proc. IEEE Ml Conf. on Multimedia and Expo, ICME 2000 (2000), vol. 3, IEEE, pp. 1507–1510.Google Scholar
  16. [16]
    Kemp, T., Schmidt, M., Westphal, M., and Waibel, A. Strategies for automatic segmentation of audio data. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing ICASSP 2000 (2000), vol. 3, IEEE, pp. 1423–1426.Google Scholar
  17. [17]
    Kimber, D. and Wilcox, L. Acoustic segmentation for audio browsers. In Proc. Interface Conference (Fairfax, 1996), pp. 295–304.Google Scholar
  18. [18]
    Kubala, F., Colbath, S., Liu, D., Srivastava, A., and Makhoul, J. Integrated technologies for indexing spoken language. Communications of the ACM 43,2 (Feb 2000), 48–56.CrossRefGoogle Scholar
  19. [19]
    Lindley, C. and Srinivasan, U. Query semantics for content-based retrieval of video data: An empirical investigation. In Storage and Retrieval Issues in Image- and Multimedia Databases, in conjunction with 9th International Conference DEXA98 (Vienna, Austria, Aug 1998).Google Scholar
  20. [20]
    Liu, Z., Wang, Y, and Chen, T. Audio feature extraction and analysis for scene segmentation and classification. Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology 20, 1/2 (Oct 1998).Google Scholar
  21. [21]
    Lu, G. and Hankinson, T. An investigation of automatic audio classification and segmentation. In Proc. 5th Intl. Conf. on Signal Processing WCCC-ICSP 2000 (2000), vol. 2, IEEE, pp. 776–781.Google Scholar
  22. [22]
    Metz, C. Film Language: A Semiotics of the Cinema. The University of Chicago Press, 1974. trans, by M. Taylor.Google Scholar
  23. [23]
    Minami, K., Akutsu, A., Hamada, H., and Tonomura, Y. Video handling with music and speech detection. IEEE Multimedia 5, 3 (July–September 1998), 17–25.CrossRefGoogle Scholar
  24. [24]
    Nakajima, Y., Lu, Y., Sugano, M., Yoneyama, A., Yanagihara, H., and Kurematsu, A. A fast audio classification from MPEG. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) (Phoenix, Arizona, USA, May 1999), vol. IV, pp. 3005–3008.Google Scholar
  25. [25]
    Nam, J., Cetin, A., and Tewfik, A.H. Speaker identification and video analysis for hierarchical video shot classification. In Proc. IEEE Intl. Conf. on Image Processing (ICIP) (Santa Barbara, CA, USA, Oct 1997), vol. 2, pp. 550–555.Google Scholar
  26. [26]
    Nam, J. and Tewfik, A.H. Combined audio and visual streams analysis for video sequence segmentation. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) (Munich, Germany, Apr 1997), vol. IV, pp. 2665–2668.Google Scholar
  27. [27]
    Patel, N. and Sethi, I. Audio characterization for video indexing. In Proc. SPIE, Storage and Retrieval for Still Image and Video Databases IV (San Jose, CA, USA, February 1996), vol. 2670, pp. 373–384.Google Scholar
  28. [28]
    Pfeiffer, S., Lienhart, R., and Effelsberg, W. Scene determination based on video and audio features. Multimedia Tools and Applications 15 (2001), 363–384.CrossRefGoogle Scholar
  29. [29]
    Philibert, A. Speech/music discriminator. Tech. rep., Tampere University of Technology, Department of Information Technology, 1999.Google Scholar
  30. [30]
    Rea, P. W. and Irving, D.K. Producing and Directing the Short Film and Video. Focul Press, 1995.Google Scholar
  31. [31]
    Samouelian, A., Robert-Ribes, J., and Plumpe, M. Speech, silence, music and noise classification of TV broadcast material. In Proc. Intl. Conf. on Spoken Language Processing (Sydney, 1998), pp. 1099–1102.Google Scholar
  32. [32]
    Saraceno, C. and Leonardi, R. Audio as a support to scene change detection and characterization of video sequences. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) (München, May 1997), pp. 2597–2600.Google Scholar
  33. [33]
    Saraceno, C. and Leonardi, R. Identification of story units in AV sequences by joint audio and video processing. In Proc. Intl. Conf. Image Processing (ICIP-98) (Chicago, IL, Oct 1998), vol. 1, pp. 363–367.Google Scholar
  34. [34]
    Saunders, J. Real-time discrimination of broadcast speech/music. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) (Atlanta, Georgia, USA, 1996), pp. 993–996.Google Scholar
  35. [35]
    Scheirer, E. and Slaney, M. Construction and evaluation of a robust multifeature speech/music discriminator. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) (München, April 1997).Google Scholar
  36. [36]
    Spina, M. and Zue, V. Automatic transcription of general audio data: Preliminary analyses. In Proc. Intl. Conf. on Spoken Language Processing, ICSLP 96 (Philadelphia, PA, Oct 1996), vol. 2, pp. 594–597.Google Scholar
  37. [37]
    Srinivasan, U., Lindley, C., and Simpson-Young, B. Database Semantics- Semantic Issues in Multimedia Systems. Kluwer Academic Publishers, Jan 1999, ch. A Multi-Model Framework for Video Information Systems, pp. 85–108.Google Scholar
  38. [38]
    Srinivasan, U., Nepal, S., and Reynolds, G. Modelling high level semantics for video data management. In Proceedings of ISIMP 2001 (Hong Kong, May 2001), pp. 291–295.Google Scholar
  39. [39]
    Stam, R., Burgoyne, R., and Flitterman, S. New Vocabularies in Film Semiotics: Structuralism, Post-Structuralism, and Beyond. Routeledge, 1996.Google Scholar
  40. [40]
    Team, S.M. Maestro: Conductor of multimedia analysis technologies. Communications of the ACM 43, 2 (Feb 2000), 57–63.CrossRefGoogle Scholar
  41. [41]
    Tsekeridou, S. and Pitas, I. Audio-visual content analysis for content-based video indexing. In Proc. IEEE Intl. Conf. on Multimedia Computing and Systems (ICMCS) (1999), vol. 1, pp. 667–672.Google Scholar
  42. [42]
    Tzanetakis, G. and Cook, F. A framework for audio analysis based on classification and temporal segmentation. In Proc. 25th EUROMICRO Conference, 1999 (1999), vol. 2, IEEE, pp. 61–67.Google Scholar
  43. [43]
    Tzanetakis, G. and Cook, P. Sound analysis using MPEG compressed audio. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing ICASSP 2000 (Istanbul, Turkey, 2000), vol. 2, IEEE, pp. 761–764.Google Scholar
  44. [44]
    Venugopal, S., Ramakrishnan, K., Srinivas, S., and Balakrishnan, N. Audio scene analysis and scene change detection in the MPEG compressed domain. In IEEE Third Workshop on Multimedia Signal Processing, MMSP 19999 (1999), IEEE, pp. 191–196.Google Scholar
  45. [45]
    Wang, Y., Liu, Z., and Huang, J.-C. Multimedia content analysis. IEEE Signal Processing Magazine 17, 6 (Nov 2000), 12–36.CrossRefGoogle Scholar
  46. [46]
    Williams, G. and Ellis, D. Speech/music discrimination based on posterior probability features. In Proc. EuroSpeech (Budapest, Hungary, September 1999), pp. 687–690.Google Scholar
  47. [47]
    Wyse, L. and Smoliar, S. Toward content-based audio indexing and retrieval and a new speaker discrimination technique. In Proc. International Joint Conference on Artificial Intelligence IJCAI (Montreal, Aug 1995), pp. 149–152.Google Scholar
  48. [48]
    Zhang, T. and Kuo, J. C.-C. Heuristic approach for generic audio data segmentation and annotation. In Proc. ACM Multimedia (Orlando, 1999), pp. 67–76.Google Scholar
  49. [49]
    Zhang, T. and Kuo, J. C.-C. Hierarchical classification of audio data for archiving and retrieving. In IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (1999), vol. IV, pp. 3001–3004.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2002

Authors and Affiliations

  • Silvia Pfeiffer
    • 1
  • Uma Srinivasan
    • 1
  1. 1.CSIRO Mathematical and Information SciencesNorth RydeAustralia

Personalised recommendations