Abstract
A content-based movie analysis, indexing and skimming system is developed in this research. Specifically, it includes the following three major modules: 1) an event detection module, where three types of movie events, namely, two-speaker dialogs, multiple-speaker dialogs, and hybrid events are extracted from the content. Multiple media cues such as audio, speech, visual and face information are integrated to achieve this goal; 2) a speaker identification module, where an adaptive speaker identification scheme is proposed to recognize target movie cast members for content indexing purposes. Both audio and visual sources are exploited in the identification process, where the audio source is analyzed to recognize speakers using a likelihood-based approach, and the visual source is examined to locate talking faces with face detection/recognition and mouth tracking techniques; 3) a movie skimming module, where an event-based skimming system is developed to abstract movie content in the form of a short video clip for content browsing purposes. Extensive experiments on integrating multiple media cues for movie content analysis, indexing and skimming have yielded encouraging results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Block, B., The Visual Story: Seeing the Structure of Film, TV, and New Media. Massachusetts, Focal Press, 2001.
Chen, S., and Gopalakrishnan, P, P., “Speaker, Environment and Channel Change Detection and Clustering Via the Bayesian Information Criterion,” Proc. of DARPA Broadcast News Transcription and Understanding Workshop, 1998.
Hauptmann, A.G., and Smith, M.A., “Text, Speech, and Vision For Video Segmentation: The Informedia Project,” Proc. of the AAAI Fall Symposium on Computer Models for Integrating Language and Vision, 1995.
HP Labs, Computational Video Group, “The HP Face Detection and Recognition Library,” User’s Guide and Reference Manual, Version 2. 2, December 1998.
Huang, J., Liu, Z., and Wang, Y., “Integration of Audio and Visual Information For Content-based Video Segmentation,” ICIP’98, October 1998.
Johnson, S.E., “Who Spoke When? - Automatic Segmentation and Clustering For Determining Speaker Turns,” Eurospeech’99, 1999.
Li, D., Wei, G., Sethi, I.K., and Dimitrova, N., “Person Identification in TV Programs,” Journal of Electronic Imaging, 10 (4): 930–938, 2001.
Li, M., Li, D., Dimitrova, N., and Sethi, I., “Audio-visual Talking Face Detection,” ICME’03, July 2003.
Li, Q., Zheng, J., Zhou, Q., and Lee, C., “A Robust, Real-time Endpoint Detector With Energy Normalization For ASR in Adverse Environments,”, ICASSP’01, May 2001.
Li, Y., and Kuo, C.-C., Content-based Video Analysis, Indexing and Representation Using Multimodal Information, Ph.D Thesis, University of Southern California, 2003.
Li, Y., Narayanan, S., and Kuo, C.-C, C.-C., “Identification of Speakers in Movie Dialogs Using Audiovisual Cues,” ICASSP’02, Orlando, May 2002.
Li, Y., Narayanan, S., and Kuo, C.-C., “Adaptive Speaker Identification with AudioVisual Cues For Movie Content Analysis,” Invited Paper in Pattern Recognition Letters with special issue on Recent Trends in Video Computing, 2003.
Liu, F., Kim, J., and Kuo, C.-C., “Adaptive Delay Concealment For Internet Voice Applications with Packet-based Time-scale Modification,” ICASSP’01, 2001.
Mardia, K., Kent, J., and Bibby, J., Multivariate Analysis. Academic Press, San Diego, 1979.
Martello, S., and Toth, P., Knapsack Problems: Algorithms and Computer Implementations. Chichester, NY, Wiley and Sons, 1990.
Mokbel, C., “Online Adaptation of HMMs to Real-life Conditions: A Unified Framework,” IEEE Transactions on Speech and Audio Processing, 9 (4): 342–357, May 2001.
Monaco, J., How To Read A Film: The Art, Technology, Language, History and Theory of Film and Media, New York, Oxford University Press, 1982.
MPEG Requirements Group, “MPEG-7 Context, Objectives and Technical Roadmap,” Doc. ISO/MPEG N2861, MPEG Vancouver Meeting, July 1999.
Pfeiffer, S., Lienhart, R., Fischer, S., and Effelsberg, W., “Abstracting Digital Movies Automatically,” Journal of Visual Communication and Image Representation, 7 (4): 345–353, December 1996.
Reisz, K., and Millar, G., The Technique of Film Editing. New York: Hastings House, Publishers, 1968.
Reynolds, D., and Rose, R., “Robust Text-independent Speaker Identification Using Gaussian Mixture Speaker Models,” IEEE Transactions on Speech and Audio Processing, 3 (1): 72–83, 1995.
Rui, Y., Huang, T.S., and Mehrotra, S., “Constructing Table-of-content For Video,” ACM Journal of Multimedia Systems, 7 (5): 359–368, 1998.
Smith, M., and Kanade, T., “Video Skimming and Characterization Through the Combination of Image and Language Understanding Techniques,” Proc. of the IEEE Computer Vision and Pattern Recognition, pages 775–781, 1997.
Sundaram, H., and Chang, S.F., “Determining Computable Scenes in Films and Their Structures Using Audio-visual Memory Models,” ACM Multimedia’00, Marina Del Rey, November 2000.
Tarkovsky, A., Sculpting in Time - Reflections on the Cinema, Austin, University of Texas Press, 1986.
Taskiran, C.M., Amir, A., Ponceleon, D., and Delp, E.J., “Automated Video Summarization Using Speech Transcripts,” Proc. of SPIE, 4676: 37 1382, January 2002.
Toklu, C., Liou, S.P., and Das, M, M., “Videoabstract: A Hybrid Approach To Generate Semantically Meaningful Video Summaries,” ICME’00, New York, 2000.
Tsekeridou, S., and Pitas, I., “Content-based Video Parsing and Indexing Based on Audio-visual Interaction,” IEEE Transactions on Circuits and Systems for Video Technology, 11 (4): 522–535, 2001.
Tseng, B.L., Lin, C.Y., and Smith, J.R., “Video Summarization and Personalization For Pervasive Mobile Devices,” Proc. of SPIE, 4676: 359370, January 2002.
Wan, V., and Campbell, W., “Support Vector Machines for Speaker Verification and Identification,” Proc. of the IEEE Signal Processing Society Workshop on Neural Networks, 2: 775–784, 2000.
Yeung, M., Yeo, B.L., and Liu, B., “Extracting Story Units From Long Programs For Video Browsing and Navigation,” IEEE Proceedings of Multimedia, pages 296–305, 1996.
Yeung, M., and Yeo, B.L., “Video Content Characterization and Compaction For Digital Library Applications,” Proc. of SPIE, 3022: 45–58, February 1997.
Zhang, T., and Kuo, C.-C., “Audio Content Analysis For On-line Audiovisual Data Segmentation,” IEEE Transactions on Speech and Audio Processing, 9 (4): 441–457, 2001.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer Science+Business Media New York
About this chapter
Cite this chapter
Li, Y., Narayanan, S., Kuo, CC.J. (2003). Movie Content Analysis, Indexing and Skimming Via Multimodal Information. In: Rosenfeld, A., Doermann, D., DeMenthon, D. (eds) Video Mining. The Springer International Series in Video Computing, vol 6. Springer, Boston, MA. https://doi.org/10.1007/978-1-4757-6928-9_5
Download citation
DOI: https://doi.org/10.1007/978-1-4757-6928-9_5
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-5383-4
Online ISBN: 978-1-4757-6928-9
eBook Packages: Springer Book Archive