Abstract
Due to the increase in surveillance systems, there is a massive increase in surveillance data. As of now, the key challenge for video surveillance systems is analyzing these large video clips. It, therefore, has an enormous demand for intelligent video analysis systems capable of identifying activities and events. Since many researchers have emphasized the role of contextual knowledge and how the performance of video content analysis has improved in several ways, we have looked at different approaches in this study that can extract semantic information to human-level perception in the video. We also addressed open problems in semantics that come from event detection and irregular activity detection. Most methods/models are too coarse to accurately extract a complete set of information. Thus, we need to use a machine-readable format to view, process, store and extract meaningful information from the video data. In this paper, we discussed the methods/approaches for extracting low-level features, mid-level features, and high-level video features and their representation using Semantic Technologies. A taxonomy of hierarchical feature generation approaches is also provided. Some evaluation metrics for evaluating video activity and measuring the performance of the extraction features are explored. Community-approved benchmark datasets are also thoroughly surveyed and presented. The paper provides a complete framework of video research to develop an intelligent surveillance system.
Similar content being viewed by others
Notes
https://wordnet.princeton.edu/
References
Aafaq N, Mian A, Liu W, Gilani SZ, Shah M (2019) Video description: a survey of methods, datasets, and evaluation metrics. ACM Comput Surv (CSUR) 52(6):1–37
Ahmed SA, Dogra DP, Kar S, Roy PP (2018) Trajectory-based surveillance analysis: a survey. In: IEEE Transactions on Circuits and Systems for Video Technology 29(7):1985–1997
Ahsan U, Sun C, Hays J, Essa I (2017) Complex event recognition from images with few training examples, In: Proc. of IEEE Winter Conf. Appl. Comput. Vision, WACV 2017, pp. 669–678
Akdemir U, Turaga P, Chellappa R (2008) An ontology based approach for activity recognition from video. In: ACM international conference on Multimedia, pp. 709–712
Ali H, Sharif M, Yasmin M et al (2020) A survey of feature extraction and fusion of deep learning for detection of abnormalities in video endoscopy of gastrointestinal-tract. Artif Intell Rev 53:2635–2707
Aljaloud AS, Ullah H (2021) IA-SSLM: Irregularity-Aware Semi-Supervised Deep Learning Model for Analyzing Unusual Events in Crowds. IEEE Access 9:73327–73334
Anjulan A, Canagarajah N (2009) A unified framework for object retrieval and mining. IEEE Trans Circuits Syst Video Technol 19(1):63–76
AR Z, MS Khurram Soomro (2012) UCF101: A dataset of 101 human action classes from videos in the wild
Arbeláez P, Pont-Tuset J, Barron JT, Marques F, Malik J (2014) Multiscale combinatorial grouping. In: IEEE conference on computer vision and pattern recognition, pp. 328–335
Arroyo R, Yebes JJ, Bergasa LM, Daza IG, Almazán J (2015) Expert video-surveillance system for real-time detection of suspicious behaviors in shopping malls. Expert Syst Appl
Babenko A, Slesarev A, Chigorin A, Lempitsky V (2014) Neural codes for image retrieval, in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Bai L, Lao S, Jones GJF, Smeaton AF (2007) Video semantic content analysis based on ontology, in International Machine Vision and Image Processing Conference, IMVIP 2007, 2007
Baradel F, Wolf C, Mille J, Taylor GW (2018) Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points. In: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 469–478
Bell S, Zitnick CL, Bala K, Girshick R (2016) Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks.In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Bellamine I, Tairi H (2014) Motion detection using the space-time interest points. J Comput Sci 10(5), 828
Bellamine I, Tairi H, (2015) Motion detection using color structure-texture image decomposition. In: Intell. Comput. Vision, ISCV, Syst, p 2015
Ben Mabrouk A, Zagrouba E (2017) Spatio-temporal feature using optical flow based distribution for violence detection, Pattern Recognit. Lett., vol. 92, pp. 62–67
Ben Mabrouk A, Zagrouba E (2018) Abnormal behavior recognition for intelligent video surveillance systems: a review. Expert Syst Appl 91:480–491
Bermejo Nievas E, Deniz Suarez O, Bueno García G, Sukthankar R (2011) Violence detection in video using computer vision techniques. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Bernardin K, Stiefelhagen R (2008) Evaluating multiple object tracking performance: The CLEAR MOT metrics. Eurasip J Image Video Process
Bewley A, Ge Z, Ott L, Ramos F,Upcroft B (2016) Simple online and realtime tracking, Proc. - Int. Conf. Image Process. ICIP, vol. 2016-Augus, pp. 3464–3468
Bhattacharya S, Kalayeh MM, Sukthankar R, Shah M (2014) Recognition of complex events: Exploiting temporal dynamics between underlying concepts. In: IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 2243–2250
Bizer C, Heath T, Berners-Lee T (2011) Linked data: The story so far. In Semantic services, interoperability and web applications: emerging concepts (pp. 205–227). IGI Global
Bottazzi E, Ferrario R (2009) Preliminaries to a DOLCE ontology of organisations. Int J Bus Process Integr Manag 4(4):225–238
Bouindour S, Hittawe MM, Mahfouz S, Snoussi H (2018) Abnormal Event Detection Using Convolutional Neural Networks and 1-Class SVM classifier, pp. 1–6
Burl MC (2004) Mining Patterns of Activity from Video Data, In: SIAM Int. Conf. Data Min., pp. 532–536
Cao Z, Simon T, Wei SE, Sheikh Y (2017) Realtime multi-person 2D pose estimation using part affinity fields, In: 30th IEEE Conference on Computer Vision and Pattern Recognition
Carreira J, Zisserman A, Vadis Q (2017) action recognition? A new model and the kinetics dataset. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 4724–4733
Caruccio L, Polese G, Tortora G, Iannone D (2019) EDCAR: A knowledge representation framework to enhance automatic video surveillance. Expert Syst Appl
Cavaliere D, Senatore S, Vento M, Loia V (2016) Towards semantic context-Aware drones for aerial scenes understanding. In: IEEE Int. Conf. Adv. Video Signal Based Surveillance, AVSS 2016, no. August, pp. 115–121
Cong Y, Yuan J, Liu J (2013) Abnormal event detection in crowded scenes using sparse representation. In: Pattern Recognit 46(7):1851–1864
Chen L, Nugent C (2009) Ontology-based activity recognition in intelligent pervasive environments. Int J Web Inf Syst
Chen K, Zhang D, Yao L, Guo B, Yu Z, Liu Y (2021) Deep Learning for Sensor-based Human Activity Recognition: Overview, Challenges, and Opportunities. ACM Comput Surv (CSUR) 54(4):1–40
Choudhary A, Chaudhury S, Banerjee S (2008) A framework for analysis of surveillance videos. In: 2008 Sixth Indian Conf. Comput. Vision, Graph. Image Process., pp 344–351
Cisco Visual Networking Index: Forecast and Methodology (2016–2021). In: Cisco Public White Pap, pp. 2016–2021
Cortes C, Vapnik V, Support-Vector Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Crowley JL, Reignier P, Pesnel S (2005) CAVIAR Context Aware Vision using Image-based Active Recognition
Cutler R, Davis LS (2000) Robust real-time periodic motion detection, analysis, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 22(8):781–796
Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. In: Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16). Curran Associates Inc., Red Hook, NY, USA, 379–387
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05). San Diego, vol 1, pp 886–893
Dendorfer P, Rezatofighi H, Milan A, Shi J, Cremers D, Reid I, Roth S, Schindler K, Leal-Taixé L (2020) MOT20: A benchmark for multi object tracking in crowded scenes. arXiv:2003.09003[cs], (arXiv: 2003.09003)
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) ImageNet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition
Dhiman C, Vishwakarma DK (2019) A review of state-of-the-art techniques for abnormal human activity recognition. Eng Appl Artif Intell 1(77):21–45
Du M, Yuan X (2021) A survey of competitive sports data visualization and visual analysis. J Vis 24(1):47–67
Duong TH, Nguyen NT, Truong HB, Nguyen VH (2015) A collaborative algorithm for semantic video annotation using a consensus-based social network analysis. Expert Syst Appl 42(1):246–258
Elleuch N, Zarka M, Ben Ammar A, Alimi MA (2011) A fuzzy ontology: based framework for reasoning in visual video content analysis and indexing. In: Proc. Elev. Int. Work. Multimed. Data Min., p. 1
Erhan D, Szegedy C, Toshev A, Anguelov D (2014) Scalable Object Detection Using Deep Neural Networks In: IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, pp. 2155–2162
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338
Everingham M, Eslami SMA, Van Gool L, Williams CKI, Winn J, Zisserman A (2015) Int J Comput Vis 111(1):98–136
Fan J, Zhu X, Hacid MS, Elmagarmid AK (2002) Model-based video classification toward hierarchical representation, indexing and access. Multimed Tools Appl 17(1):97–120
Fan J, Luo H, Gao Y, Jain R (2007) Incorporating concept ontology for hierarchical video classification, annotation, and visualization. IEEE Trans. Multimed. 9(5):939–957
Felzenszwalb PF, Society IC, Girshick RB, Member S, Mcallester D, Ramanan D (2010) Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645
Feng W, Zhihao H, Wei W, Junjie Y, Wanli O (2019) Multi-object tracking with multiple cues and switcher-aware classification. arXiv preprint arXiv:1901.06129
Ferryman J (2006) PETS 2006 Benchmark Data, In: Conjunction with IEEE Conference on Computer Vision and Pattern Recognition 2006 New York, USA - 18 June 2006. [Online]. Available: http://www.cvg.reading.ac.uk/PETS2006/data.html
Freund Y (1997) Schapire RE. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting 139:119–139
Fiaz M, Mahmood A, Jung SK (2018) Tracking noisy targets: A review of recent object tracking approaches. arXiv preprint arXiv:1802.03098
Fu CFC, Li GLG, Dai KDK (2005) A framework for video structure mining. In: 2005 Int. Conf. Mach. Learn. Cybern., vol 3, no August, pp 1524–1528
Fu CY, Liu W, Ranga A, Tyagi A, Berg AC, Dssd: Deconvolutional single shot detector, arXiv preprint arXiv:1701.06659. 2017 Jan 23
G A, A B, K C, Y L, J F, A G, A D, J Z, E G, L D, AF S, Y G, W K, Quénot G (2019) An evaluation campaign to benchmark Video Activity Detection. Video Captioning and Matching, and Video Search & retrieval, in Proceedings of TRECVID 2019
Gan C, Wang N, Yang Y, Yeung DY, Hauptmann AG (2015) DevNet: A Deep Event Network for multimedia event detection and evidence recounting. In: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 07-12-June, pp. 2568–2577
Gao Y, Liu H, Sun X, Wang C, Liu Y (2016) Violence detection using Oriented VIolent Flows. Image Vis Comput 48-49:37-41
García A, Bescós J, Video object segmentation based on feedback schemes guided by a low-level scene ontology. In: Proceedings of the 10th international conference on advanced concepts for intelligent vision systems, Springer, Berlin, ACIVS ’08, pp 322–333
Garcia-Garcia A, Orts-Escolano S, Oprea S, Villena-Martinez V, Martinez-Gonzalez P, Garcia-Rodriguez J (2018) A survey on deep learning techniques for image and video semantic segmentation. Appl Soft Comput 70:41–65
Géczy P, Izumi N, Akaho S, Hasida K (2008) Advances in data mining. Medical Applications, E-Commerce, Marketing, and Theoretical Aspects, vol 5077
Girshick R (2015) Fast R-CNN, In: IEEE International Conference on Computer Vision (ICCV), Santiago, pp. 1440–1448
Girshick R, Donahue J, Darrell T, Berkeley UC (2012) J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, pp 2–9
Girshick R (2015) Fast R-CNN. In: IEEE Int. Conf. Comput. Vis. pp. 1440–1448
Girshick R, Donahue J, Darrell T, Malik J (2016) R-CNN: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587
Gömez-Romero J, Patricio MA, García J, Molina JM (2011) Ontology-based context representation and reasoning for object tracking and scene interpretation in video. Expert Syst Appl 38(6):7494–7510
Grassi M, Morbidoni C, Nucci M (2012) A Collaborative Video Annotation System Based on Semantic Web Technologies. Cognit Comput 4(4):497–514
Greco L, Ritrovato P, Saggese A, Vento M (2016) Abnormal Event Recognition: A Hybrid Approach Using SemanticWeb Technologies, In: IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work 1:1297–1304
Greco L, Ritrovato P, Saggese A, Vento M (2016b) Improving reliability of people tracking by adding semantic reasoning. In: IEEE international conference on advanced video and signal based surveillance (AVSS), pp 194–199
Gruber TR (1995) Toward principles for the design of ontologies used for knowledge sharing. Int J Hum Comput Stud
Guntuboina C, Porwal A, Jain P, Shingrakhia H (2021) Deep Learning Based Automated Sports Video Summarization using YOLO. Electronic Letters on Computer Vision and Image Analysis 20(1):99–116
Hamid R, Maddi S, Bobick A, Essa I (2007) Structure from statistics - Unsupervised activity analysis using suffix trees, In: Proc. IEEE Int. Conf. Comput. Vis
Harikrishna N, Satheesh S, Sriram SD, Easwarakumar KS (2011) Temporal classification of events in cricket videos. In: 2011 Natl. Conf. Commun. NCC 2011, pp 14–18
Hassan MM, Ullah S, Hossain MS, Alelaiwi A (2021) An end-to-end deep learning model for human activity recognition from highly sparse body sensor data in internet of medical things environment. The Journal of Supercomputing 77:2237–2250
Hauptmann A, Yan R, Lin WH, Christel M, Wactlar H (2007) Can high-level concepts fill the semantic gap in video retrieval? A case study with broadcast news. IEEE Trans. Multimed. 9(5):958–966
He K, Zhang X, Ren S, Sun J (2015) SppNet. IEEE Trans Pattern Anal Mach Intell
He K, Zhang X, Ren S, Sun J (2015) Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904-1916
He K, Zhang X, Ren S, Sun J (2016) ResNet. In: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask R-CNN. In: IEEE International Conference on Computer Vision (ICCV), Venice, pp. 2980–2988
He D, Li F, Zhao Q, Long X, Fu Y, Wen S (2018) Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition
Himanshu R, Maheshkumar H,Kolekar, Keshav N, Mukherjee JK (2015) Trajectory based unusual human movement identification for video surveillance system. In Progress in Systems Engineering, pp. 789–794. Springer, Cham
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks, Science (80-. )
Hinton GE, Krizhevsky A, Wang SD (2011) Transforming auto-encoders. In: International conference on artificial neural networks, pp. 44–51. Springer, Berlin, Heidelberg
Hongeng S, Bremond F, Nevatia R (2000) Representation and optimal recognition of human activities. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 818–825
Huang JF, Chen SL (2014) Detection of violent crowd behavior based on statistical characteristics of the optical flow. In: 2014 11th Int. Conf. Fuzzy Syst. Knowl. Discov. FSKD 2014, pp 565–569
Huang JH, Murn L, Mrak M, Worring M, (2021) GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization. arXiv preprint arXiv:2104.12465
Hunter J (2001) Adding multimedia to the semantic web: building an MPEG-7 ontology. In: Proceedings of the First International Conference on Semantic Web Working (SWWS’01), CEUR-WS.org, Aachen, DEU, 261–283
Hussain T, Muhammad K, Ding W, Lloret J, Baik SW, de Albuquerque VHC (2021) A comprehensive survey of multi-view video summarization. In: Pattern Recognition 109:107567
Ji X, Zuo X, Wang C, Wang Y (2015) A simple human interaction recognition based on global gist feature model. International conference on intelligent robotics and applications. Springer, Cham, pp 487–498
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22nd ACM international conference on Multimedia (MM’14). Association for Computing Machinery, New York, NY, USA, 675–678
Joao Carreira AZ, Noland E, Hillier C (2019) A Short Note on the Kinetics-700 Human Action Dataset
Jordan Michael I, Zoubin Ghahramani, Jaakkola Tommi S, Saul Lawrence K (1999) An introduction to variational methods for graphical models. Mach Learn 37(2):183–233
Kavukcuoglu K, Ranzato M, Fergus R, LeCun Y (2009) Learning invariant features through topographic filter maps, 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, pp. 1605-1612
Kavukcuoglu K, Sermanet P, Boureau Y, LeCun Y, Gregor K, Mathieu M (2010) Learning Convolutional Feature Hierarchies for Visual Recognition, NIPS
Kim J, Grauman K (2009) Observe locally, infer globally: a space-time MRF for detecting abnormal activities with incremental updates. In: 2009 IEEE computer society conference on computer vision and pattern recognition workshops. CVPR Workshops 2009
Kliper-Gross O, Hassner T, Wolf L (2012) The action similarity labeling challenge. IEEE Trans Pattern Anal Mach Intell
Kompatsiaris I, Mezaris V, Strintzis MG (2005) Multimedia content indexing and retrieval using an object ontology. Multimedia content and semantic web-methods, standards and tools. Wiley, Hoboken, pp 339–371
Kong T, Yao A, Chen Y, Sun F (2016) HyperNet: Towards accurate region proposal generation and joint object detection, In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Kotsiantis S, Kanellopoulos D, Pintelas P (2004) Multimedia mining. WSEAS Trans Syst 3(10):3263–3268
Krishna R et al (2017) Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int J Comput Vis 123(1):32–73
Krizhevsky A, Sutskever I (2012) Hinton GE (2012) AlexNet. Neural Inf. Process. Syst p Adv
Krizhevsky A, Sutskever I, GE H (2012) ImageNet Classification with Deep Convolutional Neural Networks, Advances in neural network.pp. 1–9
Kuehne H, Jhuang H, Stiefelhagen R, Serre Thomas T (2013) Hmdb51: a large video database for human motion recognition, in High Performance Computing in Science and Engineering 12: Transactions of the High Performance Computing Center, Stuttgart (HLRS) 2012
Kuo W, Hariharan B, Malik J (2015) Deepbox: Learning objectness with convolutional networks. In: IEEE international conference on computer vision, pp. 2479–2487
Leach M, Baxter R, Robertson N, Sparks E (2014) Detecting social groups in crowded surveillance videos using visual attention, IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work., pp. 467–473
Leal-Taixé L, Milan A, Rei I, Roth S, SchindlerK (2015) MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking. arXiv:1504.01942 [cs], (arXiv: 1504.01942)
Lecun Y, Bengio Y, Hinton G (2015) Deep learning. nature 521(7553): 436–444
Lee SC, Nevatia R (2014) Hierarchical abnormal event detection by real time and semi-real time multi-tasking video surveillance system. Mach Vis Appl 25(1):133–143
Leo M, Furnari A, Medioni GG, Trivedi M, Farinella GM (2019) Deep learning for assistive computer vision. In: Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11134 LNCS, pp. 3–14
Li Y, Huang C, Nevatia R (2009) Learning to associate: Hybridboosted multi-target tracker for crowded scene. In: IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work. CVPR Work. 2009, vol. 2009 IEEE, pp. 2953–2960
Li C, Han Z, Ye Q, Jiao J (2013) Visual abnormal behavior detection based on trajectory sparse reconstruction analysis. Neurocomputing 119:94–100
Li X, Zhao B, Lu X (2017) A general framework for edited video and raw video summarization. IEEE Transactions on Image Processing 26(8):3652–3664
Li T, Chen X, Zhu F, Zhang Z, Yan H (2021) Two-stream deep spatial-temporal auto-encoder for surveillance video abnormal event detection. Neurocomputing 439:256–270
Liao W, Yang C, Ying Yang M, Rosenhahn B (2017) Security event recognition for visual surveillance. ISPRS Ann Photogramm Remote Sens Spat Inf Sci 4(1W1):19–26
Lienhart R, Maydt J (2002) An extended set of Haar-like features for rapid object detection. In: International conference on image processing. Proceedings, Rochester, NY, USA, pp I–I
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, pp. 740–755. Springer, Cham
Lin T, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature Pyramid Networks for Object Detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, pp. 936–944
Liu H, Chen S, Kubota N (2013) Intelligent video systems and analytics: a survey. IEEE Transactions on Industrial Informatics 9(3):1222–1233
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: Single shot multibox detector. In: European conference on computer vision, pp. 21–37. Springer, Cham
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60:91–110
Mahmood, K, Takahashi H (2015) Cloud based sports analytics using semantic Web tools and technologies. In 2015 IEEE 4th Global Conference on Consumer Electronics (GCCE), pp. 431–433. IEEE
Markowska-Kaczmar U, Kwasnicka H (2018) Deep learning: a new era in bridging the semantic gap. Bridging the semantic gap in image and video analysis 2018, Springer, Cham, pp 123–159
Meditskos G, Kompatsiari, iknow: ontology-driven situational awareness for the recognition of activities of daily living. Pervasive Mobile Comput 40:17–41. In the same way, Meditskos and Kompatsiaris (2017)
Meditskos G, Dasiopoulou S, Efstathiou V, Kompatsiaris I (2013) SP-ACT: A hybrid framework for complex activity recognition combining OWL and SPARQL rules, 2013 IEEE Int. Conf. Pervasive Comput. Commun. Work. PerCom Work. 2013, no. March, pp. 25–30
Miao Y, Song J (2014) Abnormal event detection based on SVM in video surveillance. In: Proc. - 2014 IEEE Work. Adv. Res. Technol. Ind. Appl. WARTIA 2014, pp 1379–1383
Milan A, Leal-Taixé L, Reid I, Roth S, Schindler K (2016) MOT16: A Benchmark for Multi-Object Tracking. arXiv:1603.00831 [cs], (arXiv: 1603.00831)
Mitra S, Acharya T (2003) Data Mining: Concepts and Algorithms From Multimedia to Bioinformatics. 2003
Monfort M et al (2018) Moments in Time Dataset: one million videos for event understanding. CoRR abs-1801.0:1–11
Muneeb ul Hassan (2018) VGG16 - Convolutional Network for Classification and Detection, Neurohive
Nabati M, Behrad A (2020) Multi-Sentence Video Captioning using Content-oriented Beam Searching and Multi-stage Refining Algorithm. Inf Process Manag 57(6):102302
Najibi M, Rastegari M, Davis LS (2016) G-cnn: an iterative grid based object detector. In: IEEE conference on computer vision and pattern recognition, pp. 2369–2377
Nallaivarothayan H, Fookes C, Denman S, Sridharan S (2014) An MRF based abnormal event detection approach using motion and appearance features. In: 11th IEEE Int. Conf. Adv. Video Signal-Based Surveillance, AVSS 2014, pp 343–348
Naphade M et al (2006) Large-scale concept ontology for multimedia. IEEE Multimed. 13(3):86–91
Nevatia R, Hobbs J, Bolles B (2004) An ontology for video event representation. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: Proceedings of the 28th International Conference on International Conference on Machine Learning (ICML’11). Omnipress, Madison, WI, USA, 689–696
Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation, in Proceedings of the IEEE International Conference on Computer Vision,pp. 1520–1528
OM P, A V, A Z, C V (n.d.) Jawahar, The Oxford-IIIT Pet Dataset. Available: https://www.robots.ox.ac.uk/vgg/data/pets/
Oquab M, Bottou L (2014) Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks. In: IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1717–1724
Oquab M et al (2015) Weakly supervised object recognition with convolutional neural networks, HAL Id: hal-01015140
Ouyang W, Wang X, Zeng X, Qiu S, Luo P, Tian Y, Li H et al (2015) Deepid-net: Deformable deep convolutional neural networks for object detection. In: IEEE conference on computer vision and pattern recognition, pp. 2403–2412
Pan J-Y, Faloutsos C (2002) GeoPlot: Spatial data mining on video libraries. In:Proc. Elev. Int. Conf. Inf. Knowl. Manag. (CIKM 2002), pp. 405–412
Pantoja C, Ciapetti A, Massari C, Tarantelli M (2015) Action recognition in surveillance videos using semantic web rules. In: 6th international conference on imaging for crime prevention and detection (ICDP-15), pp 1–6
Papadopoulos GT, Mezaris V, Kompatsiaris I, Strintzis MG (2007) Semantic multimedia: second international conference on semantic and digital media technologies, SAMT 2007, Genoa, Italy, December 5–7, 2007, Proceedings. Ontology-driven semantic video analysis using visual information objects. Springer, Berlin, pp 56–69
Pareek P, Thakkar A (2021) A survey on video-based human action recognition: recent updates, datasets, challenges, and applications. Artif Intell Rev 54(3):2259–2322
Patel AS, Merlino G, Bruneo D, Puliafito A, Vyas OP, Ojha M (2021) Video representation and suspicious event detection using semantic technologies. Semantic Web 12(3):467–491
Patino L, Cane T, Vallee A, Ferryman J (2016) PETS 2016: Dataset and Challenge, IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work., pp. 1240–1247
Patino L, Ferryman J (2014) PETS 2014: Dataset and challenge, in 11th IEEE International Conference on Advanced Video and Signal-Based Surveillance, AVSS 2014
Petrucci G, Ghidini C, Rospocher M (2016) Ontology learning in the deep. In: European Knowledge AcquisitionWorkshop EKAW2016: Knowledge Engineering and Knowledge Management, pp. 480–495
Pinheiro PO, Lin TY, Collobert R, Dollár P (2016) Learning to refine object segments. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Qiu Z, Yao T, Mei T (2017) Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks, Proc. IEEE Int. Conf. Comput. Vis., vol. 2017-Octob, pp. 5534–5542
Quack T, Ferrari V, Van Gool L (2006) Video mining with frequent itemset configurations, Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 4071 LNCS, pp. 360–3696
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition pp. 779–788
Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. In: IEEE conference on computer vision and pattern recognition, pp. 7263–7271
Redmon J, Farhadi A (2018) YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767
Ren X, Ramanan D (2013) Histograms of Sparse Codes for Object Detection. In: IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3246–3253
Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Ristani E, Solera F, Zou R, Cucchiara R, Tomasi C (2016) Performance measures and a data set for multi-target, multi-camera tracking, Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 9914 LNCS, no. c, pp. 17–35
Ryoo MS, Matthies L (2013) First-person activity recognition: What are they doing to me?. In: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 2730–2737
SanMiguel JC, Martínez JM, García Á (2009) An Ontology for Event Detection and its Application in Surveillance Video, IEEE Int. Conf. Adv. Video Signal-Based Surveill., pp. 220–225
Sanmiguel JC, Martínez JM (2012) A semantic-based probabilistic approach for real-time video event recognition. Comput Vis Image Underst 116(9):937–952
Sanmiguel JC, Martínez JM (2013) A semantic-guided and self-configurable framework for video analysis. Mach Vis Appl 24(3):493–512
Saini R, Ahmed A, Dogra DP, Roy PP (2018) Proceedings of 2nd International Conference on Computer Vision & Image Processing, vol. 703, pp. 261–271
Saravanan D, Srinivasan S (2010) Data mining framework for video data. Recent Adv. Sp. Technol. Serv. Clim. Chang. 2010 (RSTS CC-2010), pp 167–170
Sermanet P, Kavukcuoglu K, Chintala S,Lecun Y (2013) Pedestrian detection with unsupervised multi-stage feature learning, In: IEEE Conference on Computer Vision and Pattern Recognition
Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2013) Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229
Shen J, Tao D, Li X (2008) Modality mixture projections for semantic video event detection. IEEE Transactions on Circuits and Systems for Video Technology 18(11):1587–1596
Shen J, Wang M, Chua TS (2016) Accurate online video tagging via probabilistic hybrid modeling. Multimedia Systems 22(1):99–113
Shen Z, Liu Z, Li J, Jiang Y-G, Chen Y, Xue X (2017) Dsod: Learning deeply supervised object detectors from scratch. In: Proceedings of the IEEE international conference on computer vision, pp. 1919–1927
Si Z, Pei M, Yao B, Zhu SC (2011) Unsupervised learning of event AND-OR grammar and semantics from video, In: Proc. IEEE Int. Conf. Comput. Vis., pp. 41–48
Sikos LF, Powers DMW (2015) Knowledge-Driven Video Information Retrieval with LOD: From Semi-Structured to Structured Video Metadata, Proc. Eighth Work. Exploit. Semant. Annot. Inf. Retr., pp. 35–37
Sikos LF (2016) A Novel Approach to Multimedia Ontology Engineering for Automated Reasoning over Audiovisual LOD Datasets, Springer-Verlag Berlin Heidelb, 9621:3–12
Sikos LF (2017) Description logics in multimedia reasoning. In: Springer, Cham, ISBN: 978-3-319-54066-5
Sikos LF (2018) VidOnt: a core reference ontology for reasoning over video scenes scenes. J Inf Telecommun 1–13
Sigari MH, Soltanian-Zadeh H, Pourreza HR (2016) A framework for dynamic restructuring of semantic video analysis systems based on learning attention control. Image Vis Comput 53:20–34
Sivic J, Zisserman A (2004) Video data mining using con .gurations of viewpoint invariant regions, Proc. 2004 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognition, 2004. CVPR 2004., pp. 488–495
Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(12):1349–1380
Snidaro L, Belluz M, Foresti GL (2007) Representing and recognizing complex events in surveillance applications, 2007 IEEE Conf. Adv. Video Signal Based Surveillance, AVSS 2007 Proc., pp. 493–498
Snoek CGM, Huurnink B, Hollink L, De Rijke M, Schreiber M, Worring M (2007) Adding semantics to detectors for video retrieval. IEEE Transactions on multimedia 9(5): 975-986
Sobhani F, Straccia U Towards a forensic event ontology to assist video surveillance-based vandalism detection. arXiv preprint arXiv:1903.09012
Son J, Baek M, Cho M, Han B (2017) Multi-object tracking with quadruplet convolutional neural networks. In: 30th IEEE Conf. Comput. Vis. Pattern Recognition, pp. 3786–3795
Sreeja MU, Kovoor BC (2021) A unified model for egocentric video summarization: an instance-based approach. Comput Electr Eng 1(92)
Sreenu G, Durai MS (2019) Intelligent video surveillance: a review through deep learning techniques for crowd analysis. Journal of Big Data 6(1):48
Stavropoulos TG, Meditskos G, Kompatsiaris I, Demaware 2:integrating sensors, multimedia and semantic analysis for the ambient care of dementia. Pervasive Mobile Comput 34:126–1
Suresh V, Mohan CK, Kumaraswamy R, Yegnanarayana B (2005) Combining multiple evidence for video classification. In: Proc. - 2005 Int. Conf. Intell. Sens. Inf. Process. ICISIP’05, vol 2005, pp. 187–192
Szegedy C, Toshev A, Erhan D (2013) Deep neural networks for object detection. In Advances in neural information processing systems, pp. 2553–2561
Szegedy C et al. (2014) Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9
Tani MYK, Lablack A, Ghomari A, Bilasco IM (2015) Events detection using a video-surveillance ontology and a rule-based approach, Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), 8926:299–308
Tani MYK, Ghomari A, Lablack A, Bilasco IM (2017) OVIS: ontology video surveillance indexing and retrieval system. Int J Multimed Inf Retr 6(4):295–316
Tasnim N, Islam MK, Baek JH (2021) Deep Learning Based Human Activity Recognition Using Spatio-Temporal Image Formation of Skeleton Joints. Appl Sci 11(6):2675
Town C (2006) Ontological inference for image and video analysis. Mach Vis Appl 17(2):94–115
2014 TRECVID Multimedia Event Detection & Multimedia Event Recounting Tracks (2011) Available: http://nist.gov/itl/iad/mig/med14.cfm
Turaga PK, Veeraraghavan A, Chellappa R (2007) From videos to verbs: Mining videos for activities using a cascade of dynamical systems, In:Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognition
Uijlings JRR, Van De Sande KEA, Gevers T, Smeulders AWM (2012) Selective Search for Object Recognition
Ullah A, Muhammad K, Ding W, Palade V, Haq IU, Baik SW (2021) Efficient activity recognition using lightweight CNN and DS-GRU network for surveillance applications. Appl Soft Comput 103:107102
Vallet D, Castells P, Fernández M, Mylonas P, Avrithis Y (2007) Personalized content retrieval in context using ontological knowledge. IEEE Trans. Circuits Syst. Video Technol. 17(3):336–345
Van de Sande K, Gevers T, Snoek C (2010) Evaluating Color Descriptors for Object and Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(9):1582–1596
Vijayakumar V, Nedunchezhian R (2012) A study on video data mining. Int J Multimed Inf Retr 1(3):153–172
WADLEY FM (2006) Probit Analysis: A Statistical Treatment of the Sigmoid Response Curve. 2nd ed. D. J. Finney. New York-London: Cambridge Univ. Press, 1952. 318 pp. $7.00, Science (80-. )
Wang H (2015) Semantic Deep Learning, University of Oregon, pp. 1–42
Wang T, Snoussi H (2014) Detection of abnormal visual events via global. IEEE Trans Inf Forensics Secur 9(6):988–998
Wang B, Li W, Yang W, Liao Q (2011) Illumination normalization based on weber’s law with application to face recognition. IEEE Signal Process Lett
Wang M, Hong R, Li G, Zha ZJ, Yan S, Chua TS (2012) Event driven web video summarization by tag localization and key-shot identification. IEEE Transactions on Multimedia 14(4):975–985
Wang X, Ji Q (2015) Video event recognition with deep hierarchical context model. In:Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 07-12-June, pp. 4418–4427
Wang L et al (2016) Temporal segment networks: Towards good practices for deep action recognition. In: Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 9912 LNCS, pp. 20–36
Wang H, Dou D, Lowd D (2016) Ontology-based deep restricted boltzmann machine. In: 27th International Conference on Database and Expert Systems Applications, DEXA 2016, Porto, Portugal, September 5–8, 2016, Proceedings, Part I, pp. 431–445. Springer International Publishing
Wang X, Girshick R, Gupta A, He K (2018) Non-local Neural Networks. In: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 7794–7803
Wojke N, Bewley A, Paulus D (2018) Simple online and realtime tracking with a deep association metric, Proc. - Int. Conf. Image Process. ICIP, vol. 2017-Septe, pp. 3645–3649
Wu Z et al (2015) Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: proceedings of the 23rd ACM international conference on Multimedia
Wu G, Liu L, Guo Y, Ding G, Han J, Shen J, Shao L (2017) August. Unsupervised deep video hashing with balanced rotation, IJCAI
Xie L, Sundaram H, Campbell M (2008) Event mining in multimedia streams. In: Proc. IEEE 96(4):623–647
246 Xu Z, Mei L, Liu Y, Hu C (2013) Video structural description: a semantic based model for representing and organizing video surveillance big data. In: 2013 IEEE 16th international conference on computational science and engineering (CSE), IEEE, pp 802–809
Xu Z, Liu Y, Mei L, Hu C, Chen L (2015) Semantic based representing and organizing surveillance big data using video structural description technology. J Syst Softw 102:217–225
Xu D, Zhu Y, Choy CB, Fei-Fei L (2017) Scene graph generation by iterative message passing. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5410–5419)
Xuan Wang HC, Song H (2017) Pedestrian abnormal event detection based on multi-feature fusion in traffic video. Optik (Stuttg) 11(3):29–38
Xue J, Li J, Gong Y (2013) Restructuring of deep neural network acoustic models with singular value decomposition, In: Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 2365–2369
Yao BZ, Yang X, Lin L, Lee MW, Zhu SC (2010) I2t: image parsing to text description. In: Proc IEEE 98(8):1485–150
Yoo D, Park S, Lee J-Y, Paek AS, Kweon IS (2015) Attentionnet: Aggregating weak directions for accurate object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2659–2667
Yu, J, Lee Y, Yow KC, Jeon M, Pedrycz W (2021) Abnormal event detection and localization via adversarial event prediction. IEEE Transactions on Neural Networks and Learning Systems
Zablocki M, Gosciewska K, Frejlichowski D, Hofman R (2014) Intelligent video surveillance systems for public spaces-a survey. Journal of Theoretical and Applied Computer Science 8(4):13–27
Zaidenberg S, Boulay B, Brémond F (2012) A generic framework for video understanding applied to group behavior recognition, Proc. - 2012 IEEE 9th Int. Conf. Adv. Video Signal-Based Surveillance, AVSS 2012, pp. 136–142
Zeiler MD, Krishnan D, Taylor GW, Fergus R (2010) Deconvolutional networks. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, pp. 2528–2535
Zhang T, Yang Z, Jia W, Yang B, Yang J, He X (2016) A new method for violence detection in surveillance scenes. Multimed Tools Appl 75(12):7327–7349
Zhang T, Jia W, Yang B, Yang J, He X, Zheng Z (2017) MoWLD: a robust motion image descriptor for violence detection. Multimed Tools Appl 76(1):1419–1438
Zhao Y, Qiao Y, Yang J, Kasabov N (2015) Abnormal activity detection using spatio-temporal feature and Laplacian sparse representation, in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Zhao ZQ, Xie BJ, Cheung Y, Wu X, (2015) Plant Leaf Identification via a Growing Convolution Neural Network with Progressive Sample Learning. In: Cremers D., Reid I., Saito H., Yang MH. (eds) Computer Vision - ACCV, (2014) ACCV 2014, vol 9004. Lecture Notes in Computer Science. Springer, Cham
Zhang Y, Lin W, Zhang G, Luo C, Jiang D, Yao C (2014) A new approach for extracting and summarizing abnormal activities in surveillance videos, in 2014 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2014
Zhang Y, Sohn K, Villegas R, Pan G, Lee (2015) Improving object detection with deep convolutional networks via bayesian optimization and structured prediction. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 249–258
Zhang X et al (2018) Qiniu Submission to Activity Net Challenge. pp 1–4
Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal Relational Reasoning in Videos. In: Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11205 LNCS, pp. 831–846
Zhu X, Wu X, Elmagarmid AK, Feng Z, Wu L (2005) Video data mining: semantic indexing and event detection from the association perspective. IEEE Trans Knowl Data Eng 17(5):665–667
Zitnick CL, Dollár P (2014) Edge boxes: Locating object proposals from edges. In: European conference on computer vision, pp. 391–405. Springer, Cham
Acknowledgements
We thank Dr. Vivek Tiwari (Department of Computer Science and Engineering at International Institute of Information Technology Naya Raipur) for improving the technical writing and flow of the paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Patel, A.S., Vyas, R., Vyas, O.P. et al. A study on video semantics; overview, challenges, and applications. Multimed Tools Appl 81, 6849–6897 (2022). https://doi.org/10.1007/s11042-021-11722-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11722-1