Skip to main content

Advertisement

Log in

Latent semantic learning with time-series cross correlation analysis for video scene detection and classification

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This paper presents a novel, latent semantic learning method based on the proposed time-series cross correlation analysis for extracting a discriminative dynamic scene model to address the recognition problems of video event recognition and 3D human body gesture. Typical dynamic texture analysis poses the problems of modeling, learning, recognizing and synthesizing the images of dynamic scenes based on the autoregressive moving average (ARMA) model. Instead of applying the ARMA approach to capture the temporal structure of video sequences, this algorithm uses the learned dynamic scene model to semantically transform video sequences into multiple scenes with a lower computational effort. Therefore, to generate a discriminative dynamic scene model with space-time information preserved is crucial for the success of the proposed latent semantic learning. To achieve the goal, the k-medoids clustering with appearance distance metrics first used to partition all frames of training video sequences, regardless of their scene types, to provide an initial key-frame codebook. To discover the temporal structure of the dynamic scene model, we develop a time-series cross correlation analysis (TSCCA) to the latent semantic learning, with an alternating dynamic programing (ADP) to embed the time relationship between the training images into the dynamic scene model. We also tackle the problem of dynamic programming, which is supposed to produce large temporal misalignment for periodic activities. Moreover, the discriminative power of the model is estimated by a deterministic projection-based learning algorithm. Finally, based on the learned dynamic scene model, this paper uses a support vector machine (SVM) with a two-channel string kernel for video scene classification. Two test datasets, one for video event classification and the other for 3D human body gesture recognition, are used to verify the effectiveness of the proposed approach. Experimental results demonstrate that the proposed algorithm obtains good performance in terms of classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. 3D Gesture Recognition Database (2014). Available: http://www-dsp.elet.polimi.it/ispg/index.php/description.html. Accessed 11 Jan 2015

  2. Ali S, Shah M (2010) Human action recognition in videos using kinematic features and multiple instance learning. IEEE Trans Pattern Anal Mach Intell 32(2):288–303. doi:10.1109/TPAMI.2008.284

    Article  Google Scholar 

  3. Ballan L, Bertini M, Bimbo AD, Seidenari L, Serra G (2011) Event detection and recognition for semantic annotation of video. Multimedia Tools Appl 51(1):279–302

    Article  Google Scholar 

  4. Ballan L, Bertini M, Bimbo AD, Serra G (2010) Video event classification using string kernels. Multimedia Tools Appl 48(1):69–87

    Article  Google Scholar 

  5. Barnachon M, Bouakaz S, Boufama B, Guillou E (2014) Ongoing human action recognition with motion capture. Pattern Recogn 47(1):238–247

    Article  Google Scholar 

  6. Bhattacharyya A (1943) On a measure of divergence between two statistical populations defined by their probability distributions. Bull Calcutta Math Soc 35(1):99–109

    MathSciNet  MATH  Google Scholar 

  7. Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shapes. In: Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on. 1395–1402 Vol. 1392. doi:10.1109/ICCV.2005.28

  8. Branco JA, Croux C, Filzmoser P, Oliveira MR (2005) Robust canonical correlations: a comparative study. Comput Stat 20(2):203–269

    Article  MathSciNet  MATH  Google Scholar 

  9. Brezeale D, Cook DJ (2008) Automatic video classification: a survey of the literature. IEEE Trans Syst Man Cybern Part C Appl Rev 38(3):416–430. doi:10.1109/TSMCC.2008.919173

    Article  Google Scholar 

  10. Caspi Y, Irani M (2002) Spatio-temporal alignment of sequences. IEEE Trans Pattern Anal Mach Intell 24(11):1409–1424. doi:10.1109/TPAMI.2002.1046148

    Article  Google Scholar 

  11. Caspi Y, Simakov D, Irani M (2006) Feature-based sequence-to-sequence matching. Int J Comput Vis 68(1):53–64

    Article  Google Scholar 

  12. Chan AB, Vasconcelos N (2008) Modeling, clustering, and segmenting video with mixtures of dynamic textures. IEEE Trans Pattern Anal Mach Intell 30(5):909–926. doi:10.1109/TPAMI.2007.70738

    Article  Google Scholar 

  13. Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3). doi:10.1145/1541880.1541882

  14. Chang C-C, Lin C-J. (2001) LIBSVM: a library for support vector machines. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm. Accessed 11 Jan 2015

  15. Chen Y-L, Cheng S-C, Chen PY-P (2012) Reordering video shots for event classification using bag-of-words models and string kernels. In: Proceeding of the 27th Conference on Image and Vision Comuputing

  16. Cheng S-C, Kuo C-T, Wu D-C (2010) A novel 3D mesh compression using mesh segmentation with multiple principal plane analysis. Pattern Recogn 43(1):261–279

    MATH  Google Scholar 

  17. Chuang C-H, Cheng S-C, Chang C-C, Chen PY-P (2014) Model-based approach to spatial-temporal sampling of video clips for video object detection by classification. J Vis Commun Image Represent 25(5):1018–1030

    Article  Google Scholar 

  18. Doretto G, Chiuso A, Wu YN, Soatto S (2003) Dynamic textures. Int J Comput Vis 51(2):91–109

    Article  MATH  Google Scholar 

  19. Felzenszwalb PF, Girshick RB, McAllester D (2010) Cascade object detection with deformable part models. In: Computer Vision and Pattern Recognition (CVPR), 2010 I.E. Conference on. 2241–2248. doi:10.1109/CVPR.2010.5539906

  20. Han D, Bo L, Sminchisescu C (2009) Selection and context for action recognition. In: ICCV

  21. Hu R, Wang T, Collomosse J (2011) A bag-of-regions approach to sketch-based image retrieval. In: Image Processing (ICIP), 2011 18th IEEE International Conference on. 3661–3664. doi:10.1109/ICIP.2011.6116513

  22. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li F-F (2014) Large-Scale Video Classification with Convolutional Neural Networks. In: Computer Vision and Pattern Recognition (CVPR), 2014 I.E. Conference on. 1725–1732. doi:10.1109/CVPR.2014.223

  23. Klank U, Zia M, Beetz M (2009) 3D model selection from an internet database for robotic vision. In: Robotics and Automation, 2009. ICRA '09. IEEE International Conference on. 2406–2411. doi:10.1109/ROBOT.2009.5152488

  24. Kläser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. In: British Machine Vision Conference

  25. Kovashka A, Grauman K (2010) Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: IEEE Conference Computer Vision and Pattern Recognition (CVPR). 2046–2053

  26. Laptev I, Caputo B, Schuldt C, Lindeberg T (2007) Local velocity-adapted motion events for spatio-temporal recognition. Comput Vis Image Underst 108(3):207–229

    Article  Google Scholar 

  27. Laptev I, Lindeberg T (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123

    Article  Google Scholar 

  28. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. 1–8. doi:10.1109/CVPR.2008.4587756

  29. Lavee G, Rivlin E, Rudzsky M (2009) Understanding video events: a survey of methods for automatic interpretation of semantic occurrences in video. IEEE Trans Syst Man Cybern Part C Appl Rev 39(5):489–504

    Article  Google Scholar 

  30. Li L, Prakash B (2011) Time series clustering: complex is simpler! Proceedings of the 28th International Conference on Machine Learning (ICML-11):185–192

  31. Ma Z, Yang Y, Sebe N, Zheng K, Hauptmann AG (2013) Multimedia event detection using a classifier-specific intermediate representation. IEEE Trans Multimedia 15(7):1628–1637. doi:10.1109/TMM.2013.2264928

    Article  Google Scholar 

  32. Microsoft (2012) Kinect sdk. Available: http://www.microsoft.com/en-us/kinectforwindows/develop/. Accessed 11 Jan 2015

  33. Nam Y, Rho S, Park J (2012) Intelligent video surveillance system: 3-tier context-aware surveillance system with metadata. Multimedia Tools and Applications 57(2):315–334

    Article  Google Scholar 

  34. Niebles J, Wang H, Fei-Fei L (2008) Unsupervised learning of human action categories using spatial-temporal words. Int J Comput Vis 79(3):299–318

    Article  Google Scholar 

  35. Park JH, Rho S, Jeong CS, Kim J (2013) Multiple 3D object position estimation and tracking using double filtering on multi-core processor. Multimedia Tools and Applications 63(1):161–180

    Article  Google Scholar 

  36. Pierobon M, Marcon M, Sarti A, Tubaro S (2007) A human action classifier from 4-D data (3-D+time) based on an invariant body shape descriptor and Hidden Markov Models. In: Proc. International Conference on Signal Processing and Multimedia Applications. 406–413

  37. Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990

    Article  Google Scholar 

  38. Raptis M, Kirovski D, Hoppe H (2011) Real-time classification of dance gestures from skeleton animation. In: Proc. of the 2011 ACM Siggraph/Eurographics Symposium on Computer Animation - SCA’11. 147–156

  39. Ravichandran A, Vidal R (2011) Video registration using dynamic textures. IEEE Trans Pattern Anal Mach Intell 33(1):158–171. doi:10.1109/TPAMI.2010.61

    Article  Google Scholar 

  40. Rodriguez MD, Ahmed J, Shah M (2008) Action mach a spatiotemporal maximum average correlation height filter for action recognition. In: Int’l Conf. Computer Vision and Pattern Recognition

  41. Sun J, Wu X, Yan S, Cheong L-F, Chua T-S, Li J (2009) Hierarchical spatio-temporal context modeling for action recognition. In: CVPR

  42. Wang F, Jiang Y, Ngo C (2008) Video event detection using motion relativity and viosual relatedness. In: Proceedings of ACM Multimedia. 239–248

  43. Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: Computer Vision and Pattern Recognition (CVPR), 2012 I.E. Conference on. 1290–1297. doi:10.1109/CVPR.2012.6247813

  44. Wang H, Ullah MM, Kläser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. Proceeding 20th British Machine Vision Conference

  45. Weng M-F, Chuang Y-Y (2012) Cross-domain multicue fusion for concept-based video indexing. IEEE Trans Pattern Anal Mach Intell (TPAMI) 34(10):1927–1941. doi:10.1109/TPAMI.2011.273

    Article  Google Scholar 

  46. Yao A, Gall J, Gool LV (2010) A Hough transform-based voting framework for action recognition. In: IEEE Conf. Computer Vision and Pattern Recognition

  47. Yao B, Nie B, Liu Z, Zhu S-C (2014) Animated pose templates for modeling and detecting human actions. IEEE Trans Pattern Anal Mach Intell 36(3):436–452. doi:10.1109/TPAMI.2013.144

    Article  Google Scholar 

  48. Yao B, Zhu S-C (2009) Learning deformable action templates from cluttered videos. In: Computer Vision, 2009 I.E. 12th International Conference on. 1507–1514. doi:10.1109/ICCV.2009.5459277

  49. Yeffet L, Wolf L (2009) Local trinary patterns for human action recognition. In: Computer Vision, 2009 I.E. 12th International Conference on. 492–497. doi:10.1109/ICCV.2009.5459201

Download references

Acknowledgments

This work was supported in part by Ministry of Science and Technology, Taiwan, under grant numbers MOST 103-2221-E-019-018-MY2 and MOST 103-2511-S-130-003. The study has benefitted from Mr. Yun-Lun Chen for his technical support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kuei-Fang Hsiao.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheng, SC., Su, JY., Hsiao, KF. et al. Latent semantic learning with time-series cross correlation analysis for video scene detection and classification. Multimed Tools Appl 75, 12919–12940 (2016). https://doi.org/10.1007/s11042-015-2548-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-015-2548-y

Keywords

Navigation