Multimedia Tools and Applications

, Volume 75, Issue 20, pp 12919–12940 | Cite as

Latent semantic learning with time-series cross correlation analysis for video scene detection and classification

  • Shyi-Chyi Cheng
  • Jui-Yuan Su
  • Kuei-Fang Hsiao
  • Habib F. Rashvand


This paper presents a novel, latent semantic learning method based on the proposed time-series cross correlation analysis for extracting a discriminative dynamic scene model to address the recognition problems of video event recognition and 3D human body gesture. Typical dynamic texture analysis poses the problems of modeling, learning, recognizing and synthesizing the images of dynamic scenes based on the autoregressive moving average (ARMA) model. Instead of applying the ARMA approach to capture the temporal structure of video sequences, this algorithm uses the learned dynamic scene model to semantically transform video sequences into multiple scenes with a lower computational effort. Therefore, to generate a discriminative dynamic scene model with space-time information preserved is crucial for the success of the proposed latent semantic learning. To achieve the goal, the k-medoids clustering with appearance distance metrics first used to partition all frames of training video sequences, regardless of their scene types, to provide an initial key-frame codebook. To discover the temporal structure of the dynamic scene model, we develop a time-series cross correlation analysis (TSCCA) to the latent semantic learning, with an alternating dynamic programing (ADP) to embed the time relationship between the training images into the dynamic scene model. We also tackle the problem of dynamic programming, which is supposed to produce large temporal misalignment for periodic activities. Moreover, the discriminative power of the model is estimated by a deterministic projection-based learning algorithm. Finally, based on the learned dynamic scene model, this paper uses a support vector machine (SVM) with a two-channel string kernel for video scene classification. Two test datasets, one for video event classification and the other for 3D human body gesture recognition, are used to verify the effectiveness of the proposed approach. Experimental results demonstrate that the proposed algorithm obtains good performance in terms of classification accuracy.


Time-series cross correlation analysis Dynamic scene model K-medoids clustering SVM Dynamic programming 



This work was supported in part by Ministry of Science and Technology, Taiwan, under grant numbers MOST 103-2221-E-019-018-MY2 and MOST 103-2511-S-130-003. The study has benefitted from Mr. Yun-Lun Chen for his technical support.


  1. 1.
    3D Gesture Recognition Database (2014). Available: Accessed 11 Jan 2015
  2. 2.
    Ali S, Shah M (2010) Human action recognition in videos using kinematic features and multiple instance learning. IEEE Trans Pattern Anal Mach Intell 32(2):288–303. doi:10.1109/TPAMI.2008.284 CrossRefGoogle Scholar
  3. 3.
    Ballan L, Bertini M, Bimbo AD, Seidenari L, Serra G (2011) Event detection and recognition for semantic annotation of video. Multimedia Tools Appl 51(1):279–302CrossRefGoogle Scholar
  4. 4.
    Ballan L, Bertini M, Bimbo AD, Serra G (2010) Video event classification using string kernels. Multimedia Tools Appl 48(1):69–87CrossRefGoogle Scholar
  5. 5.
    Barnachon M, Bouakaz S, Boufama B, Guillou E (2014) Ongoing human action recognition with motion capture. Pattern Recogn 47(1):238–247CrossRefGoogle Scholar
  6. 6.
    Bhattacharyya A (1943) On a measure of divergence between two statistical populations defined by their probability distributions. Bull Calcutta Math Soc 35(1):99–109MathSciNetMATHGoogle Scholar
  7. 7.
    Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shapes. In: Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on. 1395–1402 Vol. 1392. doi:10.1109/ICCV.2005.28
  8. 8.
    Branco JA, Croux C, Filzmoser P, Oliveira MR (2005) Robust canonical correlations: a comparative study. Comput Stat 20(2):203–269MathSciNetCrossRefMATHGoogle Scholar
  9. 9.
    Brezeale D, Cook DJ (2008) Automatic video classification: a survey of the literature. IEEE Trans Syst Man Cybern Part C Appl Rev 38(3):416–430. doi:10.1109/TSMCC.2008.919173 CrossRefGoogle Scholar
  10. 10.
    Caspi Y, Irani M (2002) Spatio-temporal alignment of sequences. IEEE Trans Pattern Anal Mach Intell 24(11):1409–1424. doi:10.1109/TPAMI.2002.1046148 CrossRefGoogle Scholar
  11. 11.
    Caspi Y, Simakov D, Irani M (2006) Feature-based sequence-to-sequence matching. Int J Comput Vis 68(1):53–64CrossRefGoogle Scholar
  12. 12.
    Chan AB, Vasconcelos N (2008) Modeling, clustering, and segmenting video with mixtures of dynamic textures. IEEE Trans Pattern Anal Mach Intell 30(5):909–926. doi:10.1109/TPAMI.2007.70738 CrossRefGoogle Scholar
  13. 13.
    Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3). doi:10.1145/1541880.1541882
  14. 14.
    Chang C-C, Lin C-J. (2001) LIBSVM: a library for support vector machines. Available: Accessed 11 Jan 2015
  15. 15.
    Chen Y-L, Cheng S-C, Chen PY-P (2012) Reordering video shots for event classification using bag-of-words models and string kernels. In: Proceeding of the 27th Conference on Image and Vision ComuputingGoogle Scholar
  16. 16.
    Cheng S-C, Kuo C-T, Wu D-C (2010) A novel 3D mesh compression using mesh segmentation with multiple principal plane analysis. Pattern Recogn 43(1):261–279MATHGoogle Scholar
  17. 17.
    Chuang C-H, Cheng S-C, Chang C-C, Chen PY-P (2014) Model-based approach to spatial-temporal sampling of video clips for video object detection by classification. J Vis Commun Image Represent 25(5):1018–1030CrossRefGoogle Scholar
  18. 18.
    Doretto G, Chiuso A, Wu YN, Soatto S (2003) Dynamic textures. Int J Comput Vis 51(2):91–109CrossRefMATHGoogle Scholar
  19. 19.
    Felzenszwalb PF, Girshick RB, McAllester D (2010) Cascade object detection with deformable part models. In: Computer Vision and Pattern Recognition (CVPR), 2010 I.E. Conference on. 2241–2248. doi:10.1109/CVPR.2010.5539906
  20. 20.
    Han D, Bo L, Sminchisescu C (2009) Selection and context for action recognition. In: ICCVGoogle Scholar
  21. 21.
    Hu R, Wang T, Collomosse J (2011) A bag-of-regions approach to sketch-based image retrieval. In: Image Processing (ICIP), 2011 18th IEEE International Conference on. 3661–3664. doi:10.1109/ICIP.2011.6116513
  22. 22.
    Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li F-F (2014) Large-Scale Video Classification with Convolutional Neural Networks. In: Computer Vision and Pattern Recognition (CVPR), 2014 I.E. Conference on. 1725–1732. doi:10.1109/CVPR.2014.223
  23. 23.
    Klank U, Zia M, Beetz M (2009) 3D model selection from an internet database for robotic vision. In: Robotics and Automation, 2009. ICRA '09. IEEE International Conference on. 2406–2411. doi:10.1109/ROBOT.2009.5152488
  24. 24.
    Kläser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. In: British Machine Vision ConferenceGoogle Scholar
  25. 25.
    Kovashka A, Grauman K (2010) Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: IEEE Conference Computer Vision and Pattern Recognition (CVPR). 2046–2053Google Scholar
  26. 26.
    Laptev I, Caputo B, Schuldt C, Lindeberg T (2007) Local velocity-adapted motion events for spatio-temporal recognition. Comput Vis Image Underst 108(3):207–229CrossRefGoogle Scholar
  27. 27.
    Laptev I, Lindeberg T (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123CrossRefGoogle Scholar
  28. 28.
    Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. 1–8. doi:10.1109/CVPR.2008.4587756
  29. 29.
    Lavee G, Rivlin E, Rudzsky M (2009) Understanding video events: a survey of methods for automatic interpretation of semantic occurrences in video. IEEE Trans Syst Man Cybern Part C Appl Rev 39(5):489–504CrossRefGoogle Scholar
  30. 30.
    Li L, Prakash B (2011) Time series clustering: complex is simpler! Proceedings of the 28th International Conference on Machine Learning (ICML-11):185–192Google Scholar
  31. 31.
    Ma Z, Yang Y, Sebe N, Zheng K, Hauptmann AG (2013) Multimedia event detection using a classifier-specific intermediate representation. IEEE Trans Multimedia 15(7):1628–1637. doi:10.1109/TMM.2013.2264928 CrossRefGoogle Scholar
  32. 32.
    Microsoft (2012) Kinect sdk. Available: Accessed 11 Jan 2015
  33. 33.
    Nam Y, Rho S, Park J (2012) Intelligent video surveillance system: 3-tier context-aware surveillance system with metadata. Multimedia Tools and Applications 57(2):315–334CrossRefGoogle Scholar
  34. 34.
    Niebles J, Wang H, Fei-Fei L (2008) Unsupervised learning of human action categories using spatial-temporal words. Int J Comput Vis 79(3):299–318CrossRefGoogle Scholar
  35. 35.
    Park JH, Rho S, Jeong CS, Kim J (2013) Multiple 3D object position estimation and tracking using double filtering on multi-core processor. Multimedia Tools and Applications 63(1):161–180CrossRefGoogle Scholar
  36. 36.
    Pierobon M, Marcon M, Sarti A, Tubaro S (2007) A human action classifier from 4-D data (3-D+time) based on an invariant body shape descriptor and Hidden Markov Models. In: Proc. International Conference on Signal Processing and Multimedia Applications. 406–413Google Scholar
  37. 37.
    Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990CrossRefGoogle Scholar
  38. 38.
    Raptis M, Kirovski D, Hoppe H (2011) Real-time classification of dance gestures from skeleton animation. In: Proc. of the 2011 ACM Siggraph/Eurographics Symposium on Computer Animation - SCA’11. 147–156Google Scholar
  39. 39.
    Ravichandran A, Vidal R (2011) Video registration using dynamic textures. IEEE Trans Pattern Anal Mach Intell 33(1):158–171. doi:10.1109/TPAMI.2010.61 CrossRefGoogle Scholar
  40. 40.
    Rodriguez MD, Ahmed J, Shah M (2008) Action mach a spatiotemporal maximum average correlation height filter for action recognition. In: Int’l Conf. Computer Vision and Pattern RecognitionGoogle Scholar
  41. 41.
    Sun J, Wu X, Yan S, Cheong L-F, Chua T-S, Li J (2009) Hierarchical spatio-temporal context modeling for action recognition. In: CVPRGoogle Scholar
  42. 42.
    Wang F, Jiang Y, Ngo C (2008) Video event detection using motion relativity and viosual relatedness. In: Proceedings of ACM Multimedia. 239–248Google Scholar
  43. 43.
    Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: Computer Vision and Pattern Recognition (CVPR), 2012 I.E. Conference on. 1290–1297. doi:10.1109/CVPR.2012.6247813
  44. 44.
    Wang H, Ullah MM, Kläser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. Proceeding 20th British Machine Vision ConferenceGoogle Scholar
  45. 45.
    Weng M-F, Chuang Y-Y (2012) Cross-domain multicue fusion for concept-based video indexing. IEEE Trans Pattern Anal Mach Intell (TPAMI) 34(10):1927–1941. doi:10.1109/TPAMI.2011.273 CrossRefGoogle Scholar
  46. 46.
    Yao A, Gall J, Gool LV (2010) A Hough transform-based voting framework for action recognition. In: IEEE Conf. Computer Vision and Pattern RecognitionGoogle Scholar
  47. 47.
    Yao B, Nie B, Liu Z, Zhu S-C (2014) Animated pose templates for modeling and detecting human actions. IEEE Trans Pattern Anal Mach Intell 36(3):436–452. doi:10.1109/TPAMI.2013.144 CrossRefGoogle Scholar
  48. 48.
    Yao B, Zhu S-C (2009) Learning deformable action templates from cluttered videos. In: Computer Vision, 2009 I.E. 12th International Conference on. 1507–1514. doi:10.1109/ICCV.2009.5459277
  49. 49.
    Yeffet L, Wolf L (2009) Local trinary patterns for human action recognition. In: Computer Vision, 2009 I.E. 12th International Conference on. 492–497. doi:10.1109/ICCV.2009.5459201

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Shyi-Chyi Cheng
    • 1
  • Jui-Yuan Su
    • 1
  • Kuei-Fang Hsiao
    • 2
  • Habib F. Rashvand
    • 3
  1. 1.Department of Computer Science and EngineeringNational Taiwan Ocean UniversityKeelung CityTaiwan
  2. 2.Department of Information ManagementMing-Chuan UniversityTaipeiTaiwan
  3. 3.Advanced Communication SystemsUniversity of WarwickCoventryUK

Personalised recommendations