Event and Activity Recognition in Video Surveillance for Cyber-Physical Systems

Bhaumik, Swarnabja; Jana, Prithwish; Mohanta, Partha Pratim

doi:10.1007/978-3-030-66222-6_4

Swarnabja Bhaumik²⁴,
Prithwish Jana²⁵ &
Partha Pratim Mohanta²⁶

Part of the book series: Advances in Science, Technology & Innovation ((ASTI))

495 Accesses
2 Citations
4 Altmetric

Abstract

In this chapter, we aim to aid the development of Cyber-Physical Systems (CPS) in automated understanding of events and activities in various applications of video-surveillance. These events are mostly captured by drones, CCTVs or novice and unskilled individuals on low-end devices. Being unconstrained in nature, these videos are immensely challenging due to a number of quality factors. We present an extensive account of the various approaches taken to solve the problem over the years. This ranges from methods as early as Structure from Motion (SFM) based approaches to recent solution frameworks involving deep neural networks. We show that the long-term motion patterns alone play a pivotal role in the task of recognizing an event. Consequently each video is significantly represented by a fixed number of key-frames using a graph-based approach. Only the temporal features are exploited using a hybrid Convolutional Neural Network (CNN + Recurrent Neural Network (RNN)) architecture. The results we obtain are encouraging as they outperform standard temporal CNNs and are at par with those using spatial information along with motion cues. Further exploring multistream models, we conceive a multi-tier fusion strategy for the spatial and temporal wings of a network. A consolidated representation of the respective individual prediction vectors on video and frame levels is obtained using a biased conflation technique. The fusion strategy endows us with greater rise in precision on each stage as compared to the state-of-the-art methods, and thus a powerful consensus is achieved in classification. Results are recorded on four benchmark datasets widely used in the domain of action recognition, namely Columbia Consumer Videos (CCV), Human Motion Database (HMDB), UCF-101 and Kodak’s Consumer Video (KCV). It is inferable that focusing on better classification of the video sequences certainly leads to robust actuation of a system designed for event surveillance and object cum activity tracking.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alzubi, J., Nayyar, A., & Kumar, A. (2018). Machine learning from theory to algorithms: An overview. In: Journal of Physics: Conference Series, 1142.
Google Scholar
Bhattacharyya, A. (1946). On a measure of divergence between two multinomial populations. Sankhyā: The Indian Journal of Statistics, 401–406.
Google Scholar
Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R. (2005). Actions as space-time shapes. In Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV’05) (Vol. 2, pp. 1395–1402). IEEE.
Google Scholar
Bobick, A. F., & Davis, J. W. (2001). The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3), 257–267.
Article Google Scholar
Borisyuk, F., Gordo, A., & Sivakumar, V. (2018). Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 71-79).
Google Scholar
Cedras, C., & Shah, M. (1995). Motion-based recognition a survey. Image and Vision Computing, 13(2), 129–155.
Article Google Scholar
Chen, L., Duan, L., Xu, D.: Event Recognition in Videos by Learning from Heterogeneous Web Sources. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2666–2673 (2013)
Google Scholar
Cherian, A., & Gould, S. (2019). Second-order temporal pooling for action recognition. International Journal of Computer Vision, 127(4), 340–362.
Article Google Scholar
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) (Vol. 1, pp. 886-893). IEEE.
Google Scholar
Derpanis, K. G. (2004). The Harris corner detector (pp. 1–2). York University.
Google Scholar
Duan, L., Xu, D., & Chang, S. F. (2012). Exploiting web images for event recognition in consumer videos: A multiple source domain adaptation approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1338–1345). IEEE.
Google Scholar
Dubey, S., Singh, P., Yadav, P., & Singh, K. K. (2020). Household waste management system using IoT and machine learning. Procedia Computer Science, 167, 1950–1959.
Article Google Scholar
Elgammal, A., Harwood, D., & Davis, L. (2000). Non-parametric model for background subtraction. In Proceedings of the European Conference on Computer Vision (pp. 751–767). Springer.
Google Scholar
Feng, Y., Wu, X., Wang, H., & Liu, J. (2014). Multi-group adaptation for event recognition from videos. In Proceeding of the 22nd International Conference on Pattern Recognition (pp. 3915–3920). IEEE.
Google Scholar
Ghosh, S., Kundu, A., & Jana, D. (2011). Implementation challenges of time synchronization in vehicular networks. In Proceedings of the IEEE Recent Advances in Intelligent Computational Systems (pp. 575–580). IEEE.
Google Scholar
Girdhar, R., & Ramanan, D. (2017). Attentional pooling for action recognition. In Proceedings of the Advances in Neural Information Processing Systems (pp. 34–45).
Google Scholar
Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., & Russell, B. (2017). Actionvlad: Learning spatio-temporal aggregation for action classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 971–980).
Google Scholar
Gould, K., & Shah, M. (1989). The trajectory primal sketch: A multi-scale scheme for representing motion characteristics. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 79–80). IEEE Computer Society.
Google Scholar
Gupta, R., Tanwar, S., Al-Turjman, F., Italiya, P., Nauman, A., & Kim, S. W. (2020). Smart contract privacy protection using AI in cyber-physical systems: Tools, techniques and challenges. IEEE Access, 8, 24746–24772.
Article Google Scholar
Hill, T. (2011). Conflations of probability distributions. Transactions of the American Mathematical Society, 363(6), 3351–3372.
Article MathSciNet MATH Google Scholar
Horn, B. K., & Schunck, B. G. (1993). Determining optical flow: A retrospective. Artificial Intelligence, 59, 81–87.
Article Google Scholar
Jain, R., Nayyar, A., Bachhety, S. (2020). Factex: A practical approach to crime detection. In Data management, analytics and innovation (pp. 503–516). Springer.
Google Scholar
Jana, D., & Bandyopadhyay, D. (2013). Efficient management of security and privacy issues in mobile cloud environment. In Proceedings of the Annual IEEE India Conference (INDICON) (pp. 1–6). IEEE.
Google Scholar
Jana, D., & Bandyopadhyay, D. (2015). Controlled privacy in mobile cloud. In Proceedings of the IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS) (pp. 98–103). IEEE.
Google Scholar
Jana, P., Bhaumik, S., & Mohanta, P. P. (2019). A multi-tier fusion strategy for event classification in unconstrained videos. In Proceedings of the 8th International Conference on Pattern Recognition and Machine Intelligence (PReMI) (pp. 515–524). Springer.
Google Scholar
Jana, P., Bhaumik, S., & Mohanta, P. P. (2019). Key-frame based event recognition in unconstrained videos using temporal features. In Proceedings of the IEEE Region 10 Symposium (TENSYMP) (pp. 349–354). IEEE.
Google Scholar
Jana, P., Ghosh, S., Sarkar, R., & Nasipuri, M. (2017). A fuzzy C-means based approach towards efficient document image binarization. In Proceedings of the 9th International Conference on Advances in Pattern Recognition (ICAPR) (pp. 332–337). IEEE.
Google Scholar
Jiang, Y. G., Dai, Q., Xue, X., Liu, W., & Ngo, C. W. (2012). Trajectory-based modeling of human actions with motion reference points. In Proceedings of the European Conference on Computer Vision (pp. 425–438). Springer.
Google Scholar
Jiang, Y. G., Ye, G., Chang, S. F., Ellis, D., & Loui, A. C. (2011). Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval (pp. 1–8). http://www.ee.columbia.edu/ln/dvmm/CCV/. Accessed July 2020.
Kalra, G. S., Kathuria, R. S., & Kumar, A. (2019). YouTube video classification based on title and description text. In Proceedings of the International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (pp. 74–79). IEEE.
Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1725–1732).
Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (pp. 2556–2563). IEEE. https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/. Accessed July 2020.
Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.
Article Google Scholar
Lee, J., Abu-El-Haija, S., Varadarajan, B., & Natsev, A. (2018). Collaborative deep metric learning for video understanding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 481–490).
Google Scholar
Li, Y., Liu, C., Ji, Y., Gong, S., & Xu, H. (2020). Spatio-temporal deep residual network with hierarchical attentions for video event recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16(2s), 1–21.
Google Scholar
Liu, K., Li, Y., Xu, N., & Natarajan, P. (2018). Learn to combine modalities in multimodal deep learning. arXiv preprint arXiv:1805.11730.
Loui, A., Luo, J., Chang, S. F., Ellis, D., Jiang, W., Kennedy, L., Lee, K., & Yanagawa, A. (2007). Kodak’s consumer video benchmark data set: Concept definition and annotation. In Proceedings of the International Workshop on Multimedia Information Retrieval (pp. 245–254). http://www.ee.columbia.edu/ln/dvmm/consumervideo/. Accessed July 2020.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Article Google Scholar
Lu, J., Hu, J., & Zhou, J. (2017). Deep metric learning for visual understanding: An overview of recent advances. IEEE Signal Processing Magazine, 34(6), 76–84.
Article Google Scholar
Luo, C., Jin, L., & Sun, Z. (2019). MORAN: A multi-object rectified attention network for scene text recognition. Pattern Recognition, 90, 109–118.
Article Google Scholar
Luo, M., Chang, X., Nie, L., Yang, Y., Hauptmann, A. G., & Zheng, Q. (2018). An adaptive semisupervised feature analysis for video semantic recognition. IEEE Transactions on Cybernetics, 48(2), 648–660.
Article Google Scholar
Mazari, A., & Sahbi, H. (2019). Human action recognition with deep temporal pyramids. arXiv preprint arXiv:1905.00745.
Mohanta, P. P., Saha, S. K., & Chanda, B. (2011). A model-based shot boundary detection technique using frame transition parameters. IEEE Transactions on Multimedia, 14(1), 223–233.
Article Google Scholar
Mukherjee, A., Jana, P., Chakraborty, S., & Saha, S. K. (2020). Two stage semantic segmentation by SEEDS and Fork Net. In Proceedings of the IEEE Calcutta Conference (CALCON) (pp. 283–287). IEEE.
Google Scholar
Mukhopadhyay, M., Pal, S., Nayyar, A., Pramanik, P.K.D., Dasgupta, N., & Choudhury, P. (2020). Facial emotion detection to assess learner’s state of mind in an online learning system. In Proceedings of the 5th International Conference on Intelligent Information Technology (pp. 107–115).
Google Scholar
Oron, S., Bar-Hille, A., & Avidan, S. (2014). Extended Lucas-Kanade tracking. In Proceedings of the European Conference on Computer Vision (pp. 142–156). Springer.
Google Scholar
Padikkapparambil, J., Ncube, C., Singh, K. K., & Singh, A. (2020). Internet of things technologies for elderly health-care applications. In Emergence of Pharmaceutical Industry Growth with Industrial IoT Approach (pp. 217–243). Elsevier.
Google Scholar
Paul, S., Chaudhuri, S., & Jana, D. (2016). Increasing the fault tolerance of NameNode: A proposal for using DataNode as a secondary backup node. International Journal of Advanced Research in Computer Science and Software Engineering, 6(6), 416–422.
Google Scholar
Peng, Y., Ye, H., Lin, Y., Bao, Y., Zhao, Z., Qiu, H., Lu, Y., Wang, L., & Zheng, Y. (2017). Large-scale video classification with elastic streaming sequential data processing system. In Proceedings of the Workshop on Large-Scale Video Classification Challenge (pp. 1–7).
Google Scholar
Pinar, A. J., Rice, J., Hu, L., Anderson, D. T., & Havens, T. C. (2016). Efficient multiple kernel classification using feature and decision level fusion. IEEE Transactions on Fuzzy Systems, 25(6), 1403–1416.
Article Google Scholar
Polana, R., & Nelson, R. (1994). Detecting activities. Journal of Visual Communication and Image Representation, 5(2), 172–180.
Article Google Scholar
Potter, M. C. (1976). Short-term conceptual memory for pictures. Journal of Experimental Psychology: Human Learning and Memory, 2(5).
Google Scholar
Priyadarshni, V., Nayyar, A., Solanki, A., Anuragi, A. (2019). Human age classification system using K-NN classifier. In Proceedings of the International Conference on Advanced Informatics for Computing Research (pp. 294–311). Springer.
Google Scholar
Rana, M. A. T. (2011). Kernel and classifier level fusion for image classification. University of Surrey. https://books.google.co.in/books?id=24udAQAACAAJ.
Rother, C., Kolmogorov, V., & Blake, A. (2004). GrabCut: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics (TOG), 23(3), 309–314.
Article Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Article MathSciNet Google Scholar
Saeed, F., Paul, A., Karthigaikumar, P., & Nayyar, A. (2019). Convolutional neural network based early fire detection. Multimedia Tools and Applications, 1–17.
Google Scholar
Sehgal, A., Agrawal, R., Bhardwaj, R., & Singh, K. K. (2020). Reliability analysis of wireless link for IoT applications under shadow-fading conditions. Procedia Computer Science, 167, 1515–1523.
Article Google Scholar
Sharma, P., Singh, A., Raheja, S., & Singh, K. K. (2019). Automatic vehicle detection using spatial time frame and object based classification. Journal of Intelligent & Fuzzy Systems, 37(6), 8147–8157.
Article Google Scholar
Singh, A. K., Firoz, N., Tripathi, A., Singh, K., Choudhary, P., & Vashist, P. C. (2020). Internet of things: From hype to reality. An Industrial IoT Approach for Pharmaceutical Industry Growth, 2, 191.
Article Google Scholar
Singh, M., Sachan, S., Singh, A., & Singh, K. K. (2020). Internet of things in pharma industry: Possibilities and challenges. In Emergence of pharmaceutical industry growth with industrial IoT approach (pp. 195–216). Elsevier.
Google Scholar
Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. https://www.crcv.ucf.edu/data/UCF101.php. Accessed July 2020.
Tanwar, S. (2020). Fog data analytics for IoT applications-Next generation process model with state-of-the-art technologies. Studies in Big Data, 76, 1–497.
Google Scholar
Ukil, A., Jana, D., & De Sarkar, A. (2013). A security framework in cloud computing infrastructure. International Journal of Network Security & Its Applications, 5(5), 11–24.
Article Google Scholar
Varior, R. R., Shuai, B., Lu, J., Xu, D., & Wang, G. (2016). A Siamese long short-term memory architecture for human re-identification. In Proceedings of the European Conference on Computer Vision (pp. 135–153). Springer.
Google Scholar
Wang, H., Wu, X., & Jia, Y. (2016). Heterogeneous domain adaptation method for video annotation. IET Computer Vision, 11(2), 181–187.
Article Google Scholar
Wang, H., Ullah, M. M., Klaser, A., Laptev, I., & Schmid, C. (2009, September). Evaluation of local spatio-temporal features for action recognition. In Proceedings of the British Machine Vision Conference (BMVC) (pp. 124.1–124.11). BMVA Press.
Google Scholar
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision (pp. 20–36). Springer.
Google Scholar
Wu, Z., Jiang, Y. G., Wang, X., Ye, H., Xue, X., & Wang, J. (2015). Fusing multi-stream deep networks for video classification. arXiv preprint arXiv:1509.06086.
Zang, J., Wang, L., Liu, Z., Zhang, Q., Hua, G., & Zheng, N. (2018). Attention-based temporal weighted convolutional neural network for action recognition. In Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations (pp. 97–108). Springer.
Google Scholar
Zhang, L., & Xiang, X. (2020). Video event classification based on two-stage neural network. Multimedia Tools and Applications, 1–16.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Meghnad Saha Institute of Technology, Kolkata, India
Swarnabja Bhaumik
Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India
Prithwish Jana
Electronics and Communication Sciences Unit, Indian Statistical Institute, Kolkata, India
Partha Pratim Mohanta

Authors

Swarnabja Bhaumik
View author publications
You can also search for this author in PubMed Google Scholar
Prithwish Jana
View author publications
You can also search for this author in PubMed Google Scholar
Partha Pratim Mohanta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Partha Pratim Mohanta .

Editor information

Editors and Affiliations

Department of Computer Science & Engineering, Faculty of Engineering & Technology, Jain (Deemed-to-be University), Bengaluru, India
Krishna Kant Singh
Graduate School, Faculty of Information Technology, Duy Tan University, Da Nang, Vietnam
Anand Nayyar
Department of Computer Science and Engineering, Nirma University, Ahmedabad, India
Sudeep Tanwar
Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI, USA
Mohamed Abouhawwash

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bhaumik, S., Jana, P., Mohanta, P.P. (2021). Event and Activity Recognition in Video Surveillance for Cyber-Physical Systems. In: Singh, K.K., Nayyar, A., Tanwar, S., Abouhawwash, M. (eds) Emergence of Cyber Physical System and IoT in Smart Automation and Robotics. Advances in Science, Technology & Innovation. Springer, Cham. https://doi.org/10.1007/978-3-030-66222-6_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-66222-6_4
Published: 05 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66221-9
Online ISBN: 978-3-030-66222-6
eBook Packages: Earth and Environmental ScienceEarth and Environmental Science (R0)

Publish with us

Policies and ethics