Skip to main content

Event and Activity Recognition in Video Surveillance for Cyber-Physical Systems

  • Chapter
  • First Online:
Emergence of Cyber Physical System and IoT in Smart Automation and Robotics

Abstract

In this chapter, we aim to aid the development of Cyber-Physical Systems (CPS) in automated understanding of events and activities in various applications of video-surveillance. These events are mostly captured by drones, CCTVs or novice and unskilled individuals on low-end devices. Being unconstrained in nature, these videos are immensely challenging due to a number of quality factors. We present an extensive account of the various approaches taken to solve the problem over the years. This ranges from methods as early as Structure from Motion (SFM) based approaches to recent solution frameworks involving deep neural networks. We show that the long-term motion patterns alone play a pivotal role in the task of recognizing an event. Consequently each video is significantly represented by a fixed number of key-frames using a graph-based approach. Only the temporal features are exploited using a hybrid Convolutional Neural Network (CNN + Recurrent Neural Network (RNN)) architecture. The results we obtain are encouraging as they outperform standard temporal CNNs and are at par with those using spatial information along with motion cues. Further exploring multistream models, we conceive a multi-tier fusion strategy for the spatial and temporal wings of a network. A consolidated representation of the respective individual prediction vectors on video and frame levels is obtained using a biased conflation technique. The fusion strategy endows us with greater rise in precision on each stage as compared to the state-of-the-art methods, and thus a powerful consensus is achieved in classification. Results are recorded on four benchmark datasets widely used in the domain of action recognition, namely Columbia Consumer Videos (CCV), Human Motion Database (HMDB), UCF-101 and Kodak’s Consumer Video (KCV). It is inferable that focusing on better classification of the video sequences certainly leads to robust actuation of a system designed for event surveillance and object cum activity tracking.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Alzubi, J., Nayyar, A., & Kumar, A. (2018). Machine learning from theory to algorithms: An overview. In: Journal of Physics: Conference Series, 1142.

    Google Scholar 

  • Bhattacharyya, A. (1946). On a measure of divergence between two multinomial populations. Sankhyā: The Indian Journal of Statistics, 401–406.

    Google Scholar 

  • Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R. (2005). Actions as space-time shapes. In Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV’05) (Vol. 2, pp. 1395–1402). IEEE.

    Google Scholar 

  • Bobick, A. F., & Davis, J. W. (2001). The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3), 257–267.

    Article  Google Scholar 

  • Borisyuk, F., Gordo, A., & Sivakumar, V. (2018). Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 71-79).

    Google Scholar 

  • Cedras, C., & Shah, M. (1995). Motion-based recognition a survey. Image and Vision Computing, 13(2), 129–155.

    Article  Google Scholar 

  • Chen, L., Duan, L., Xu, D.: Event Recognition in Videos by Learning from Heterogeneous Web Sources. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2666–2673 (2013)

    Google Scholar 

  • Cherian, A., & Gould, S. (2019). Second-order temporal pooling for action recognition. International Journal of Computer Vision, 127(4), 340–362.

    Article  Google Scholar 

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) (Vol. 1, pp. 886-893). IEEE.

    Google Scholar 

  • Derpanis, K. G. (2004). The Harris corner detector (pp. 1–2). York University.

    Google Scholar 

  • Duan, L., Xu, D., & Chang, S. F. (2012). Exploiting web images for event recognition in consumer videos: A multiple source domain adaptation approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1338–1345). IEEE.

    Google Scholar 

  • Dubey, S., Singh, P., Yadav, P., & Singh, K. K. (2020). Household waste management system using IoT and machine learning. Procedia Computer Science, 167, 1950–1959.

    Article  Google Scholar 

  • Elgammal, A., Harwood, D., & Davis, L. (2000). Non-parametric model for background subtraction. In Proceedings of the European Conference on Computer Vision (pp. 751–767). Springer.

    Google Scholar 

  • Feng, Y., Wu, X., Wang, H., & Liu, J. (2014). Multi-group adaptation for event recognition from videos. In Proceeding of the 22nd International Conference on Pattern Recognition (pp. 3915–3920). IEEE.

    Google Scholar 

  • Ghosh, S., Kundu, A., & Jana, D. (2011). Implementation challenges of time synchronization in vehicular networks. In Proceedings of the IEEE Recent Advances in Intelligent Computational Systems (pp. 575–580). IEEE.

    Google Scholar 

  • Girdhar, R., & Ramanan, D. (2017). Attentional pooling for action recognition. In Proceedings of the Advances in Neural Information Processing Systems (pp. 34–45).

    Google Scholar 

  • Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., & Russell, B. (2017). Actionvlad: Learning spatio-temporal aggregation for action classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 971–980).

    Google Scholar 

  • Gould, K., & Shah, M. (1989). The trajectory primal sketch: A multi-scale scheme for representing motion characteristics. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 79–80). IEEE Computer Society.

    Google Scholar 

  • Gupta, R., Tanwar, S., Al-Turjman, F., Italiya, P., Nauman, A., & Kim, S. W. (2020). Smart contract privacy protection using AI in cyber-physical systems: Tools, techniques and challenges. IEEE Access, 8, 24746–24772.

    Article  Google Scholar 

  • Hill, T. (2011). Conflations of probability distributions. Transactions of the American Mathematical Society, 363(6), 3351–3372.

    Article  MathSciNet  MATH  Google Scholar 

  • Horn, B. K., & Schunck, B. G. (1993). Determining optical flow: A retrospective. Artificial Intelligence, 59, 81–87.

    Article  Google Scholar 

  • Jain, R., Nayyar, A., Bachhety, S. (2020). Factex: A practical approach to crime detection. In Data management, analytics and innovation (pp. 503–516). Springer.

    Google Scholar 

  • Jana, D., & Bandyopadhyay, D. (2013). Efficient management of security and privacy issues in mobile cloud environment. In Proceedings of the Annual IEEE India Conference (INDICON) (pp. 1–6). IEEE.

    Google Scholar 

  • Jana, D., & Bandyopadhyay, D. (2015). Controlled privacy in mobile cloud. In Proceedings of the IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS) (pp. 98–103). IEEE.

    Google Scholar 

  • Jana, P., Bhaumik, S., & Mohanta, P. P. (2019). A multi-tier fusion strategy for event classification in unconstrained videos. In Proceedings of the 8th International Conference on Pattern Recognition and Machine Intelligence (PReMI) (pp. 515–524). Springer.

    Google Scholar 

  • Jana, P., Bhaumik, S., & Mohanta, P. P. (2019). Key-frame based event recognition in unconstrained videos using temporal features. In Proceedings of the IEEE Region 10 Symposium (TENSYMP) (pp. 349–354). IEEE.

    Google Scholar 

  • Jana, P., Ghosh, S., Sarkar, R., & Nasipuri, M. (2017). A fuzzy C-means based approach towards efficient document image binarization. In Proceedings of the 9th International Conference on Advances in Pattern Recognition (ICAPR) (pp. 332–337). IEEE.

    Google Scholar 

  • Jiang, Y. G., Dai, Q., Xue, X., Liu, W., & Ngo, C. W. (2012). Trajectory-based modeling of human actions with motion reference points. In Proceedings of the European Conference on Computer Vision (pp. 425–438). Springer.

    Google Scholar 

  • Jiang, Y. G., Ye, G., Chang, S. F., Ellis, D., & Loui, A. C. (2011). Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval (pp. 1–8). http://www.ee.columbia.edu/ln/dvmm/CCV/. Accessed July 2020.

  • Kalra, G. S., Kathuria, R. S., & Kumar, A. (2019). YouTube video classification based on title and description text. In Proceedings of the International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (pp. 74–79). IEEE.

    Google Scholar 

  • Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1725–1732).

    Google Scholar 

  • Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (pp. 2556–2563). IEEE. https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/. Accessed July 2020.

  • Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.

    Article  Google Scholar 

  • Lee, J., Abu-El-Haija, S., Varadarajan, B., & Natsev, A. (2018). Collaborative deep metric learning for video understanding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 481–490).

    Google Scholar 

  • Li, Y., Liu, C., Ji, Y., Gong, S., & Xu, H. (2020). Spatio-temporal deep residual network with hierarchical attentions for video event recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16(2s), 1–21.

    Google Scholar 

  • Liu, K., Li, Y., Xu, N., & Natarajan, P. (2018). Learn to combine modalities in multimodal deep learning. arXiv preprint arXiv:1805.11730.

  • Loui, A., Luo, J., Chang, S. F., Ellis, D., Jiang, W., Kennedy, L., Lee, K., & Yanagawa, A. (2007). Kodak’s consumer video benchmark data set: Concept definition and annotation. In Proceedings of the International Workshop on Multimedia Information Retrieval (pp. 245–254). http://www.ee.columbia.edu/ln/dvmm/consumervideo/. Accessed July 2020.

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  • Lu, J., Hu, J., & Zhou, J. (2017). Deep metric learning for visual understanding: An overview of recent advances. IEEE Signal Processing Magazine, 34(6), 76–84.

    Article  Google Scholar 

  • Luo, C., Jin, L., & Sun, Z. (2019). MORAN: A multi-object rectified attention network for scene text recognition. Pattern Recognition, 90, 109–118.

    Article  Google Scholar 

  • Luo, M., Chang, X., Nie, L., Yang, Y., Hauptmann, A. G., & Zheng, Q. (2018). An adaptive semisupervised feature analysis for video semantic recognition. IEEE Transactions on Cybernetics, 48(2), 648–660.

    Article  Google Scholar 

  • Mazari, A., & Sahbi, H. (2019). Human action recognition with deep temporal pyramids. arXiv preprint arXiv:1905.00745.

  • Mohanta, P. P., Saha, S. K., & Chanda, B. (2011). A model-based shot boundary detection technique using frame transition parameters. IEEE Transactions on Multimedia, 14(1), 223–233.

    Article  Google Scholar 

  • Mukherjee, A., Jana, P., Chakraborty, S., & Saha, S. K. (2020). Two stage semantic segmentation by SEEDS and Fork Net. In Proceedings of the IEEE Calcutta Conference (CALCON) (pp. 283–287). IEEE.

    Google Scholar 

  • Mukhopadhyay, M., Pal, S., Nayyar, A., Pramanik, P.K.D., Dasgupta, N., & Choudhury, P. (2020). Facial emotion detection to assess learner’s state of mind in an online learning system. In Proceedings of the 5th International Conference on Intelligent Information Technology (pp. 107–115).

    Google Scholar 

  • Oron, S., Bar-Hille, A., & Avidan, S. (2014). Extended Lucas-Kanade tracking. In Proceedings of the European Conference on Computer Vision (pp. 142–156). Springer.

    Google Scholar 

  • Padikkapparambil, J., Ncube, C., Singh, K. K., & Singh, A. (2020). Internet of things technologies for elderly health-care applications. In Emergence of Pharmaceutical Industry Growth with Industrial IoT Approach (pp. 217–243). Elsevier.

    Google Scholar 

  • Paul, S., Chaudhuri, S., & Jana, D. (2016). Increasing the fault tolerance of NameNode: A proposal for using DataNode as a secondary backup node. International Journal of Advanced Research in Computer Science and Software Engineering, 6(6), 416–422.

    Google Scholar 

  • Peng, Y., Ye, H., Lin, Y., Bao, Y., Zhao, Z., Qiu, H., Lu, Y., Wang, L., & Zheng, Y. (2017). Large-scale video classification with elastic streaming sequential data processing system. In Proceedings of the Workshop on Large-Scale Video Classification Challenge (pp. 1–7).

    Google Scholar 

  • Pinar, A. J., Rice, J., Hu, L., Anderson, D. T., & Havens, T. C. (2016). Efficient multiple kernel classification using feature and decision level fusion. IEEE Transactions on Fuzzy Systems, 25(6), 1403–1416.

    Article  Google Scholar 

  • Polana, R., & Nelson, R. (1994). Detecting activities. Journal of Visual Communication and Image Representation, 5(2), 172–180.

    Article  Google Scholar 

  • Potter, M. C. (1976). Short-term conceptual memory for pictures. Journal of Experimental Psychology: Human Learning and Memory, 2(5).

    Google Scholar 

  • Priyadarshni, V., Nayyar, A., Solanki, A., Anuragi, A. (2019). Human age classification system using K-NN classifier. In Proceedings of the International Conference on Advanced Informatics for Computing Research (pp. 294–311). Springer.

    Google Scholar 

  • Rana, M. A. T. (2011). Kernel and classifier level fusion for image classification. University of Surrey. https://books.google.co.in/books?id=24udAQAACAAJ.

  • Rother, C., Kolmogorov, V., & Blake, A. (2004). GrabCut: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics (TOG), 23(3), 309–314.

    Article  Google Scholar 

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Saeed, F., Paul, A., Karthigaikumar, P., & Nayyar, A. (2019). Convolutional neural network based early fire detection. Multimedia Tools and Applications, 1–17.

    Google Scholar 

  • Sehgal, A., Agrawal, R., Bhardwaj, R., & Singh, K. K. (2020). Reliability analysis of wireless link for IoT applications under shadow-fading conditions. Procedia Computer Science, 167, 1515–1523.

    Article  Google Scholar 

  • Sharma, P., Singh, A., Raheja, S., & Singh, K. K. (2019). Automatic vehicle detection using spatial time frame and object based classification. Journal of Intelligent & Fuzzy Systems, 37(6), 8147–8157.

    Article  Google Scholar 

  • Singh, A. K., Firoz, N., Tripathi, A., Singh, K., Choudhary, P., & Vashist, P. C. (2020). Internet of things: From hype to reality. An Industrial IoT Approach for Pharmaceutical Industry Growth, 2, 191.

    Article  Google Scholar 

  • Singh, M., Sachan, S., Singh, A., & Singh, K. K. (2020). Internet of things in pharma industry: Possibilities and challenges. In Emergence of pharmaceutical industry growth with industrial IoT approach (pp. 195–216). Elsevier.

    Google Scholar 

  • Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. https://www.crcv.ucf.edu/data/UCF101.php. Accessed July 2020.

  • Tanwar, S. (2020). Fog data analytics for IoT applications-Next generation process model with state-of-the-art technologies. Studies in Big Data, 76, 1–497.

    Google Scholar 

  • Ukil, A., Jana, D., & De Sarkar, A. (2013). A security framework in cloud computing infrastructure. International Journal of Network Security & Its Applications, 5(5), 11–24.

    Article  Google Scholar 

  • Varior, R. R., Shuai, B., Lu, J., Xu, D., & Wang, G. (2016). A Siamese long short-term memory architecture for human re-identification. In Proceedings of the European Conference on Computer Vision (pp. 135–153). Springer.

    Google Scholar 

  • Wang, H., Wu, X., & Jia, Y. (2016). Heterogeneous domain adaptation method for video annotation. IET Computer Vision, 11(2), 181–187.

    Article  Google Scholar 

  • Wang, H., Ullah, M. M., Klaser, A., Laptev, I., & Schmid, C. (2009, September). Evaluation of local spatio-temporal features for action recognition. In Proceedings of the British Machine Vision Conference (BMVC) (pp. 124.1–124.11). BMVA Press.

    Google Scholar 

  • Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision (pp. 20–36). Springer.

    Google Scholar 

  • Wu, Z., Jiang, Y. G., Wang, X., Ye, H., Xue, X., & Wang, J. (2015). Fusing multi-stream deep networks for video classification. arXiv preprint arXiv:1509.06086.

  • Zang, J., Wang, L., Liu, Z., Zhang, Q., Hua, G., & Zheng, N. (2018). Attention-based temporal weighted convolutional neural network for action recognition. In Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations (pp. 97–108). Springer.

    Google Scholar 

  • Zhang, L., & Xiang, X. (2020). Video event classification based on two-stage neural network. Multimedia Tools and Applications, 1–16.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Partha Pratim Mohanta .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Bhaumik, S., Jana, P., Mohanta, P.P. (2021). Event and Activity Recognition in Video Surveillance for Cyber-Physical Systems. In: Singh, K.K., Nayyar, A., Tanwar, S., Abouhawwash, M. (eds) Emergence of Cyber Physical System and IoT in Smart Automation and Robotics. Advances in Science, Technology & Innovation. Springer, Cham. https://doi.org/10.1007/978-3-030-66222-6_4

Download citation

Publish with us

Policies and ethics