Latent semantic learning with time-series cross correlation analysis for video scene detection and classification

Cheng, Shyi-Chyi; Su, Jui-Yuan; Hsiao, Kuei-Fang; Rashvand, Habib F.

doi:10.1007/s11042-015-2548-y

Latent semantic learning with time-series cross correlation analysis for video scene detection and classification

Published: 19 March 2015

Volume 75, pages 12919–12940, (2016)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Shyi-Chyi Cheng¹,
Jui-Yuan Su¹,
Kuei-Fang Hsiao² &
…
Habib F. Rashvand³

454 Accesses
6 Citations
Explore all metrics

Abstract

This paper presents a novel, latent semantic learning method based on the proposed time-series cross correlation analysis for extracting a discriminative dynamic scene model to address the recognition problems of video event recognition and 3D human body gesture. Typical dynamic texture analysis poses the problems of modeling, learning, recognizing and synthesizing the images of dynamic scenes based on the autoregressive moving average (ARMA) model. Instead of applying the ARMA approach to capture the temporal structure of video sequences, this algorithm uses the learned dynamic scene model to semantically transform video sequences into multiple scenes with a lower computational effort. Therefore, to generate a discriminative dynamic scene model with space-time information preserved is crucial for the success of the proposed latent semantic learning. To achieve the goal, the k-medoids clustering with appearance distance metrics first used to partition all frames of training video sequences, regardless of their scene types, to provide an initial key-frame codebook. To discover the temporal structure of the dynamic scene model, we develop a time-series cross correlation analysis (TSCCA) to the latent semantic learning, with an alternating dynamic programing (ADP) to embed the time relationship between the training images into the dynamic scene model. We also tackle the problem of dynamic programming, which is supposed to produce large temporal misalignment for periodic activities. Moreover, the discriminative power of the model is estimated by a deterministic projection-based learning algorithm. Finally, based on the learned dynamic scene model, this paper uses a support vector machine (SVM) with a two-channel string kernel for video scene classification. Two test datasets, one for video event classification and the other for 3D human body gesture recognition, are used to verify the effectiveness of the proposed approach. Experimental results demonstrate that the proposed algorithm obtains good performance in terms of classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Complex Activity Recognition Via Attribute Dynamics

Article 21 June 2016

Dynamic Texture Video Classification Using Extreme Learning Machine

Spatio-temporal information for human action recognition

Article Open access 24 November 2016

References

3D Gesture Recognition Database (2014). Available: http://www-dsp.elet.polimi.it/ispg/index.php/description.html. Accessed 11 Jan 2015
Ali S, Shah M (2010) Human action recognition in videos using kinematic features and multiple instance learning. IEEE Trans Pattern Anal Mach Intell 32(2):288–303. doi:10.1109/TPAMI.2008.284
Article Google Scholar
Ballan L, Bertini M, Bimbo AD, Seidenari L, Serra G (2011) Event detection and recognition for semantic annotation of video. Multimedia Tools Appl 51(1):279–302
Article Google Scholar
Ballan L, Bertini M, Bimbo AD, Serra G (2010) Video event classification using string kernels. Multimedia Tools Appl 48(1):69–87
Article Google Scholar
Barnachon M, Bouakaz S, Boufama B, Guillou E (2014) Ongoing human action recognition with motion capture. Pattern Recogn 47(1):238–247
Article Google Scholar
Bhattacharyya A (1943) On a measure of divergence between two statistical populations defined by their probability distributions. Bull Calcutta Math Soc 35(1):99–109
MathSciNet MATH Google Scholar
Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shapes. In: Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on. 1395–1402 Vol. 1392. doi:10.1109/ICCV.2005.28
Branco JA, Croux C, Filzmoser P, Oliveira MR (2005) Robust canonical correlations: a comparative study. Comput Stat 20(2):203–269
Article MathSciNet MATH Google Scholar
Brezeale D, Cook DJ (2008) Automatic video classification: a survey of the literature. IEEE Trans Syst Man Cybern Part C Appl Rev 38(3):416–430. doi:10.1109/TSMCC.2008.919173
Article Google Scholar
Caspi Y, Irani M (2002) Spatio-temporal alignment of sequences. IEEE Trans Pattern Anal Mach Intell 24(11):1409–1424. doi:10.1109/TPAMI.2002.1046148
Article Google Scholar
Caspi Y, Simakov D, Irani M (2006) Feature-based sequence-to-sequence matching. Int J Comput Vis 68(1):53–64
Article Google Scholar
Chan AB, Vasconcelos N (2008) Modeling, clustering, and segmenting video with mixtures of dynamic textures. IEEE Trans Pattern Anal Mach Intell 30(5):909–926. doi:10.1109/TPAMI.2007.70738
Article Google Scholar
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3). doi:10.1145/1541880.1541882
Chang C-C, Lin C-J. (2001) LIBSVM: a library for support vector machines. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm. Accessed 11 Jan 2015
Chen Y-L, Cheng S-C, Chen PY-P (2012) Reordering video shots for event classification using bag-of-words models and string kernels. In: Proceeding of the 27th Conference on Image and Vision Comuputing
Cheng S-C, Kuo C-T, Wu D-C (2010) A novel 3D mesh compression using mesh segmentation with multiple principal plane analysis. Pattern Recogn 43(1):261–279
MATH Google Scholar
Chuang C-H, Cheng S-C, Chang C-C, Chen PY-P (2014) Model-based approach to spatial-temporal sampling of video clips for video object detection by classification. J Vis Commun Image Represent 25(5):1018–1030
Article Google Scholar
Doretto G, Chiuso A, Wu YN, Soatto S (2003) Dynamic textures. Int J Comput Vis 51(2):91–109
Article MATH Google Scholar
Felzenszwalb PF, Girshick RB, McAllester D (2010) Cascade object detection with deformable part models. In: Computer Vision and Pattern Recognition (CVPR), 2010 I.E. Conference on. 2241–2248. doi:10.1109/CVPR.2010.5539906
Han D, Bo L, Sminchisescu C (2009) Selection and context for action recognition. In: ICCV
Hu R, Wang T, Collomosse J (2011) A bag-of-regions approach to sketch-based image retrieval. In: Image Processing (ICIP), 2011 18th IEEE International Conference on. 3661–3664. doi:10.1109/ICIP.2011.6116513
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li F-F (2014) Large-Scale Video Classification with Convolutional Neural Networks. In: Computer Vision and Pattern Recognition (CVPR), 2014 I.E. Conference on. 1725–1732. doi:10.1109/CVPR.2014.223
Klank U, Zia M, Beetz M (2009) 3D model selection from an internet database for robotic vision. In: Robotics and Automation, 2009. ICRA '09. IEEE International Conference on. 2406–2411. doi:10.1109/ROBOT.2009.5152488
Kläser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. In: British Machine Vision Conference
Kovashka A, Grauman K (2010) Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: IEEE Conference Computer Vision and Pattern Recognition (CVPR). 2046–2053
Laptev I, Caputo B, Schuldt C, Lindeberg T (2007) Local velocity-adapted motion events for spatio-temporal recognition. Comput Vis Image Underst 108(3):207–229
Article Google Scholar
Laptev I, Lindeberg T (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123
Article Google Scholar
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. 1–8. doi:10.1109/CVPR.2008.4587756
Lavee G, Rivlin E, Rudzsky M (2009) Understanding video events: a survey of methods for automatic interpretation of semantic occurrences in video. IEEE Trans Syst Man Cybern Part C Appl Rev 39(5):489–504
Article Google Scholar
Li L, Prakash B (2011) Time series clustering: complex is simpler! Proceedings of the 28th International Conference on Machine Learning (ICML-11):185–192
Ma Z, Yang Y, Sebe N, Zheng K, Hauptmann AG (2013) Multimedia event detection using a classifier-specific intermediate representation. IEEE Trans Multimedia 15(7):1628–1637. doi:10.1109/TMM.2013.2264928
Article Google Scholar
Microsoft (2012) Kinect sdk. Available: http://www.microsoft.com/en-us/kinectforwindows/develop/. Accessed 11 Jan 2015
Nam Y, Rho S, Park J (2012) Intelligent video surveillance system: 3-tier context-aware surveillance system with metadata. Multimedia Tools and Applications 57(2):315–334
Article Google Scholar
Niebles J, Wang H, Fei-Fei L (2008) Unsupervised learning of human action categories using spatial-temporal words. Int J Comput Vis 79(3):299–318
Article Google Scholar
Park JH, Rho S, Jeong CS, Kim J (2013) Multiple 3D object position estimation and tracking using double filtering on multi-core processor. Multimedia Tools and Applications 63(1):161–180
Article Google Scholar
Pierobon M, Marcon M, Sarti A, Tubaro S (2007) A human action classifier from 4-D data (3-D+time) based on an invariant body shape descriptor and Hidden Markov Models. In: Proc. International Conference on Signal Processing and Multimedia Applications. 406–413
Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990
Article Google Scholar
Raptis M, Kirovski D, Hoppe H (2011) Real-time classification of dance gestures from skeleton animation. In: Proc. of the 2011 ACM Siggraph/Eurographics Symposium on Computer Animation - SCA’11. 147–156
Ravichandran A, Vidal R (2011) Video registration using dynamic textures. IEEE Trans Pattern Anal Mach Intell 33(1):158–171. doi:10.1109/TPAMI.2010.61
Article Google Scholar
Rodriguez MD, Ahmed J, Shah M (2008) Action mach a spatiotemporal maximum average correlation height filter for action recognition. In: Int’l Conf. Computer Vision and Pattern Recognition
Sun J, Wu X, Yan S, Cheong L-F, Chua T-S, Li J (2009) Hierarchical spatio-temporal context modeling for action recognition. In: CVPR
Wang F, Jiang Y, Ngo C (2008) Video event detection using motion relativity and viosual relatedness. In: Proceedings of ACM Multimedia. 239–248
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: Computer Vision and Pattern Recognition (CVPR), 2012 I.E. Conference on. 1290–1297. doi:10.1109/CVPR.2012.6247813
Wang H, Ullah MM, Kläser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. Proceeding 20th British Machine Vision Conference
Weng M-F, Chuang Y-Y (2012) Cross-domain multicue fusion for concept-based video indexing. IEEE Trans Pattern Anal Mach Intell (TPAMI) 34(10):1927–1941. doi:10.1109/TPAMI.2011.273
Article Google Scholar
Yao A, Gall J, Gool LV (2010) A Hough transform-based voting framework for action recognition. In: IEEE Conf. Computer Vision and Pattern Recognition
Yao B, Nie B, Liu Z, Zhu S-C (2014) Animated pose templates for modeling and detecting human actions. IEEE Trans Pattern Anal Mach Intell 36(3):436–452. doi:10.1109/TPAMI.2013.144
Article Google Scholar
Yao B, Zhu S-C (2009) Learning deformable action templates from cluttered videos. In: Computer Vision, 2009 I.E. 12th International Conference on. 1507–1514. doi:10.1109/ICCV.2009.5459277
Yeffet L, Wolf L (2009) Local trinary patterns for human action recognition. In: Computer Vision, 2009 I.E. 12th International Conference on. 492–497. doi:10.1109/ICCV.2009.5459201

Download references

Acknowledgments

This work was supported in part by Ministry of Science and Technology, Taiwan, under grant numbers MOST 103-2221-E-019-018-MY2 and MOST 103-2511-S-130-003. The study has benefitted from Mr. Yun-Lun Chen for his technical support.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, National Taiwan Ocean University, Keelung City, Taiwan
Shyi-Chyi Cheng & Jui-Yuan Su
Department of Information Management, Ming-Chuan University, Taipei, Taiwan
Kuei-Fang Hsiao
Advanced Communication Systems, University of Warwick, Coventry, UK
Habib F. Rashvand

Authors

Shyi-Chyi Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Jui-Yuan Su
View author publications
You can also search for this author in PubMed Google Scholar
Kuei-Fang Hsiao
View author publications
You can also search for this author in PubMed Google Scholar
Habib F. Rashvand
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kuei-Fang Hsiao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, SC., Su, JY., Hsiao, KF. et al. Latent semantic learning with time-series cross correlation analysis for video scene detection and classification. Multimed Tools Appl 75, 12919–12940 (2016). https://doi.org/10.1007/s11042-015-2548-y

Download citation

Received: 19 October 2014
Revised: 17 January 2015
Accepted: 05 March 2015
Published: 19 March 2015
Issue Date: October 2016
DOI: https://doi.org/10.1007/s11042-015-2548-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Latent semantic learning with time-series cross correlation analysis for video scene detection and classification

Abstract

Access this article

Similar content being viewed by others

Complex Activity Recognition Via Attribute Dynamics

Dynamic Texture Video Classification Using Extreme Learning Machine

Spatio-temporal information for human action recognition

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Latent semantic learning with time-series cross correlation analysis for video scene detection and classification

Abstract

Access this article

Similar content being viewed by others

Complex Activity Recognition Via Attribute Dynamics

Dynamic Texture Video Classification Using Extreme Learning Machine

Spatio-temporal information for human action recognition

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation