Abstract
The advent of affordable wearable devices with a video camera has established the new form of social data, lifelogs, where lives of people are captured to video. Enormous amount of lifelog data and need for on-site processing demand new fast video processing methods. In this work, we experimentally investigate seven hours of lifelogs and point out novel findings: (1) audio cues are exceptionally strong for lifelog processing; (2) cascades of audio and video detectors improve accuracy and enable fast (super frame rate) processing speed. We first construct strong detectors using state-of-the-art audio and visual features: Mel-frequency cepstral coefficients (MFCC), colour (RGB) histograms, and local patch descriptors (SIFT). In the second stage, we construct a cascade of the trained detectors and optimise cascade parameters. Separating the detector and cascade optimisation stages simplify training and results to a fast and accurate processing pipeline.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Gygli, M., Grabner, H., Riemenschneider, H., Gool, L.V.: Creating summaries from user videos (2014)
Zhao, B., Xing, E.: Quasi real-time summarization for consumer videos. In: Proceedings of the CVPR (2014)
Kyperountas, M., Kotropoulos, C., Pitas, I.: Enhanced eigen-audioframes for audiovisual scene change detection. IEEE Trans. Multimedia 9(4), 785–797 (2007)
Song, Y., Zhao, M., Yagnik, J., Wu, X.: Taxonomic classification for web-based videos. In: Proceedings of the CVPR (2010)
Viola, P., Jones, M.: Robust real-time face detection. Int. J. Comput. Vis. 57, 137–154 (2001)
Chen, M., Xu, Z., Weinberger, K., Chapelle, O., Kedem, D.: Classifier cascade for minimizing feature evaluation cost. In: AISTATS (2012)
Wu, T., Zhu, S.C.: Learning near-optimal cost-sensitive decision policy for object detection. In: ICCV (2013)
Shen, C., Wang, P., Paisitkriangkrai, S., van den Hengel, A.: Training effective node classifiers for cascade classification. Int. J. Comput. Vis. 103, 326–347 (2013)
Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classfiers. IEEE PAMI 20, 226–239 (1998)
Wang, M.: Movie2comics: towards a lively video content presentation. IEEE Trans. Multimedia 14, 858–870 (2012)
Yip, S.: The automatic video editor. In: ACM Multimedia, pp. 596–597 (2003)
Chen, S.C., Shyu, M.L., Liao, W., Zhang, C.: Scene change detection by audio and video clues. In: ICME, vol. 2, pp. 365–368 (2002)
Pfeiffer, S., Lienhart, R., Effelsberg, W.: Scene determination based on video and audio features. In: Multimedia Tools and Applications, pp. 685–690 (1999)
Jiang, H., Lin, T., Zhang, H.: Video segmentation with the assistance of audio content analysis. In: IEEE International Conference on Multimedia and Expo (III), pp. 1507–1510 (2000)
Smeaton, A.F., Over, P., Kraaij, W.: Trecvid: evaluating the effectiveness of information retrieval tasks on digital video. In: Proceedings of ACM Multimedia, New York, USA (2004)
Gargi, U., Kasturi, R., Strayer, S.H.: Performance characterization of video-shot-change detection methods. IEEE Trans. Circuits Syst. Video Technol. 10(1), 1–13 (2000)
Lowe, D.G.: Distinctive features from scale-invariant keypoints. Int. J. Comp. Vis. 60, 91–110 (2004)
Steven, B., Davis, P.M.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In: IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-28, pp. 357–366 (1980)
Fabro, M., Boszormenyi, L.: State-of-the-art and future challenges in video scene detection: a survey. Multimedia Syst. 19, 427–454 (2013)
Smeaton, A., Over, P., Doherty, A.: Video shot boundary detection: seven years of TRECVid activity. Comput. Vis. Image Underst. 114, 411–418 (2010)
Rabiner, L., Juang, B.H.: Fundamentals of Speech Recognition. Prentice Hall, Upper Saddle River (1993)
Heittola, T., Measaros, A., Virtanen, T., Eronen, A.: Sound event detection in multisource environments using source separation. In: Workshop on Machine Listening in Multisource Environments, Florence, Italy, pp. 36–40 (2011)
Aucouturier, J.-J., Defreville, B., Pachet, F.: The bag-of-frames approach to audio pattern recognition: a sufficient model for urban soundscape but not for polyphonic music. J. Acoust. Soc. Am. 122, 881–891 (2007)
Downie, J.: Music information retrieval. Ann. Rev. Inf. Sci. Technol. 37, 295–340 (2003)
Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: Proceedings of the ICCV (2003)
Csurka, G., Dance, C., Willamowski, J., Fan, L., Bray, C.: Visual categorization with bags of keypoints. In: ECCV Workshop on Statistical Learning in Computer Vision (2004)
Tuytelaars, T., Lampert, C., Blaschko, M., Buntine, W.: Unsupervised object discovery: a comparison. Int. J. Comput. Vis. 88, 284–302 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Mahkonen, K., Kämäräinen, JK., Virtanen, T. (2015). Lifelog Scene Change Detection Using Cascades of Audio and Video Detectors. In: Jawahar, C., Shan, S. (eds) Computer Vision - ACCV 2014 Workshops. ACCV 2014. Lecture Notes in Computer Science(), vol 9010. Springer, Cham. https://doi.org/10.1007/978-3-319-16634-6_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-16634-6_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16633-9
Online ISBN: 978-3-319-16634-6
eBook Packages: Computer ScienceComputer Science (R0)