Introduction

In the field of computer vision, action recognition is a domain that has gained much attention since the advancement of convolution neural networks (CNNs) as a tool for solving complex computer vision tasks and has gained more traction in the research community over the past few years [1]. Action recognition has been used in several real-life applications such as safety and security AI [2, 3], healthcare AI [2, 4, 5], and media AI [6, 7]. Developing algorithms that can intuitively detect actions in video stream presents an opportunity to advance research frontier using AI for human action recognition. Action recognition cuts across three major activities: feature detection, action representation, and action classification. Detecting actions in sequence of images or video streams presents a unique challenge based on cluttered background, occlusion, and difficulties labeling human actions that are distinct from one person to another [8].

Action recognition with a single action class in a video stream lags in practical applications. However, in a multi-class AR task, action localization in the untrimmed video is tedious as it involves developing architectures that can accurately set boundaries and train end-to-end algorithms to recognize different action classes [9]. Pose estimation algorithms have been widely proposed in action recognition problems to help recognize and understand how each action happens [10]. Pose estimation has been successful in multiple human action recognition tasks such as [11,12,13,14]. However, the algorithm is not beneficial for vehicle accident detection due to the specificity of the problem and the differences between humans and vehicles in terms of physical behaviors and constructions.

Transfer learning is a commonly used technique in extracting features of deep neural networks that have been trained on a specific domain with robust dataset to a new domain/area of application with reduced computational resources. Previous research has leveraged transfer learning approach to improve video stream action localization. Iqbal et al. [15] experimented with action localization on pre-selected frames by leveraging transfer learning from the existing model. The overarching goal was to simplify the complex architectures, expensive computation costs, and inefficient inferencing in existing methodologies.

The current research trend in action recognition is focused on deep neural networks with two stream architectures Optical flow and RGB (red, green, and blue) [16]. Transferring features from the pre-trained model on small action classes significantly improves AR models’ performance, while other areas of focus have been on temporal localization and segmentation of actions in untrimmed video. Hidden Markov’s model has been used to capture long-range dependencies in frame-wise action recognition [17]. In contrast, the spatiotemporal convolution and semi-hidden Markov model were used in capturing multiple action transitions in untrimmed video [18]. Iqbal et al. [15] utilized transfer learning technique with the I3D network on temporarily untrimmed video to localize all action class instances in a video stream. Their experimental research using deep vanilla temporal convolutions network on features extracted from I3D yielded state-of-the-art results with a lightweight model and simple convolution network to extract features from the existing model without multiple layers and gated convolutions [15].

The main focus of this paper is the investigation of seminary articles on accident detection and the review of methods that have been explored by researchers who use computer vision and action recognition techniques for detecting traffic accidents. Based on the previous background information on action recognition, accident detection requires pattern matching and capturing spatial–temporal information along with other road structure artifacts in order to detect traffic accidents. Based on the previous background information on action recognition, accident detection requires pattern matching and capturing spatial–temporal information along with other road structure artifacts in order to detect traffic accidents. The research presented in [19, 20] combined sensors and AR methods in detecting traffic accidents anomalies in traffic flows. Recent deep learning techniques employed transformer models, graph-based and attention mechanisms for accident detection [21, 22]. Traffic accident detection is beneficial for managing urban traffic and providing adequate information to motorists regarding alternate routes while aiding emergency responders in taking quick action. Yu et al. [23] proposed an accident detection method at the road level that integrates internal factors (road type, road structure, environment) with external factors such as driver behavior, weather, and road congestion. We reviewed an extensive number of seminary articles on accident detection, including but not limited to [21, 24,25,26]. Road traffic accident is one of the leading causes of non-natural death. Artificial intelligence plays a significant role in detecting accidents and recognizing scene activity in autonomous transportation. Many research advancements focused on developing algorithms for detecting accidents and modeling spatiotemporal information found in road structures. However, previous studies have not extensively addressed different techniques and criteria for establishing new benchmarks datasets for accident detection in smart cities. In an effort to develop a consistent benchmark, our study examined seminary articles on accident detection that have been published in the past ten years. This approach aims to provide a more comprehensive understanding of the performance of each model. This paper provides a comprehensive review of action recognition focusing on accident detection and autonomous transportation in smart city transportation systems. This review includes the state-of-the-art techniques researchers have proposed, accident detection algorithms, the application of AR/accident detection in smart cities, and transfer learning approach from complex architectures. Furthermore, we identified gaps in the existing literature on accident detection and formulated research questions to stimulate further research on public traffic safety using an accident detection model integrated into automated smart city traffic monitoring and safety technologies. The main contribution of this paper is summarized below:

  • Provided a comprehensive comparison of different action recognition techniques used in smart city transportation systems and synthesize the state-of the-art research findings within the past ten years on autonomous transportation and accident detection.

  • Interpreted and analyze benchmark datasets, algorithms, and metrics used by relevant research work on traffic control and accident detection domain.

  • Explored literature gaps in existing methodology that can be addressed by current technological advancement.

  • Identified potential future research questions that leverage existing methodology with reduced model complexities and computation resources.

The structure of this paper is organized as follows: “Action recognition applications” section presents background and existing literature review on the domain mentioned above. The literature search, methodology, inclusion, and exclusion criteria are discussed in “Literature search” section. The research findings and detailed analysis are discussed in “Results” section. Finally, “Limitation” and “Conclusion” sections elaborated on the limitations and conclusion of the study.

Action recognition applications

Action recognition is a revolutionary topic in machine learning and computer vision that has been utilized in intelligent systems such as human-assisted AI (e.g., surgery [27, 28], sports [29, 30], education [6]), smart cities [31], safety and security [32, 33], smart home [34], crisis informatics [35], medical imaging [36, 37], and robotics [38, 39]. Considering the wide application area of AR, in this research, we limit our scope to the application of action recognition addressing accident detection in smart city autonomous transportation.

Action recognition in smart city

A futuristic direction in computer vision is the application of intelligent systems in autonomously performing human activities that are repetitive in nature and capital intensive.

In a smart city surveillance system, violence can be efficiently spotted to alert appropriate enforcement agencies with automated analysis of video contents in surveillance cameras [40]. For example, SenSquare is a mobile crowd-sensing framework for smart cities that involves users’ participation in large data gathering [41]. The SenSquare system was implemented using crowd-sensing heterogeneous data sources for gathering data and developing classification algorithms in order to detect potential hazardous behavior in the environment [41,42,43]. The community-based monitoring paradigm focuses on tracking users, monitoring emergencies, and responding to them. Law enforcement agencies continuously face an uphill battle in controlling the increase in crime rates and gun violence. Deploying intelligent surveillance cameras can assist in the automatic detection of firearms and alert security agencies in near real time when firearm has been detected. Romero et al. [44] developed an object detection model that can detect firearms and crime scenes in dangerous situations based on Yolo’s object detection framework using surveillance cameras. Jamil et al. [45] proposed human action recognition system utilizing spatial–temporal weighted BILSTM-CNN framework for accurate firefighter’s activity recognition during hazardous scenarios, integrating 1D-CNN and a contextaware-enhanced BILSTM in three-stream architecture.

Human behavior and specific actions can be analyzed and classified using imaging and AI technologies. Patil et al. [46] research demonstrated the feasibility of using visual-based methods for facial emotion recognition (FER), leveraging both visual and physiological biosignals, which has potential applications in areas such as lie detectors and human–machine interfaces on portable hardware. Similarly, the application of AR models in understanding human behavior offers possibilities for smart city safety, especially in tracking drivers’ behavior. The National Highway Traffic Safety Administration (NHTSA) report highlighted an increase in the number of fatalities caused by distracted drivers between 2019 and 2020, which is higher than the number of fatalities caused by total accidents in 2017. The number of fatalities caused by distracted drivers rose to more than 8.5% of total fatalities during 2017 [47]. Celaya et al. [48] proposed a deep convolution neural network for detecting texting and driving behavior using a car-mounted wide-angle camera with a pre-trained Inception v3 model. Emerging technologies like the AR model can be integrated with CCTV cameras to reduce fire accidents in smart cities. As described in [49] on fire detection method in smart city environments using the Yolo4 algorithm, a robust model based on augmented data (different weather environments) and a reduced network structure demonstrated excellent performance and is highly effective for detecting fire disasters. In this paper, we focus on accident detection using data obtained from different types of surveillance cameras in smart city transportation safety and monitoring systems.

Action recognition in autonomous transportation and accident detection

Robotics and auto navigation systems have also benefited from using AR for autopilot, specifically in obstacle detection, accident prevention, and lane departure assistance [40]. Accident detection in autonomous transportation systems is essential for tracking vehicles and identifying anomalies in traffic patterns. Cai et al. [50] discussed the detection of abnormal traffic flow using clustering techniques on main flow direction vectors and a k-means clustering algorithm to identify outliers that deviates from normal trajectory pattern or motion flow in highways. Previous research explored intelligent visual descriptions of scenes with connected image points using spatiotemporal dynamics in the hidden Markov model [51]. Recent research approached this challenge using machine learning algorithms and deep learning techniques [52,53,54]. Robles-Serrano et al. [54] combined convolution layers and long short-term memory LSTM architectures in capturing spatiotemporal features from a sequence of images in video streams that have proven to achieve better performance [55, 56]. Due to the capability of convolutional layers to extract features from each image in video stream and the capability of LSTM to learn temporal information between images in video sequence [57,58,59]. Obstacle detection is an integral part of intelligent transportation systems, the research of Liang et al. [60] presented a refined multi-object detection algorithm by combining DarkNet-53 with the enhanced features of DenseNet. The proposed system evaluated on benchmark datasets (KITTI and Pascal VOC) showed notable improvements in model adaptability, especially in addressing the challenges of occlusion, underscoring its value in intelligent transportation obstacle detection [60]. Accident detection task includes the detection of spatiotemporal dependencies in multiple frames from video surveillance. Hence, correctly classifying video input as an accident is more challenging in developing accident detection models and requires highly voluminous data. Carreira et al. [61] introduced a new two-stream inflated 3D ConvNet (I3D) based on a 2D ConvNet inflation. The authors seek to unravel the correlation between increase in performance and complex networks by inflating the pooling kernel image classification architectures to an inflated two-stream ConvNets (I3D). The results of their proposed framework suggested that there is always a boost in performance by pre-training a model. However, the extent of the boost varies significantly with the type of architecture.

Accident detection methods

Most researchers proposed their own datasets and evaluation criteria in action recognition tasks, making it challenging to identify the most appropriate datasets and compare results. Performance metrics also vary across multiple research works; developing a standardized evaluation technique will lead to more robust research in the application of AR for accident detection tasks. Current methods allow some data samples to be repeated/duplicated in train/test data which directly causes bias in actual performance when evaluating a new research work [62]. Stisen et al. [63] examined the effects of heterogeneous devices on human activity recognition (variations in training and test device hardware) on model performance using hand-crafted features and employed popular classifiers such as nearest-neighbor, support vector machines, and random forest. They noticed sampling instabilities occurred across various devices. Dataset source also plays a crucial role in designing accident detection models because Videos captured by dashcams hold different video trajectories and street vision than highway or traffic lights surveillance cameras. The dashcams capture the traffic video from a horizontal view. In such captured videos, both the camera and surrounding objects are moving. These increase the problem’s complexity, especially when classifying objects approaching the dashcam and the objects the car is moving toward with a stationed dashcam. Traffic light and highway cameras record the scene in a vertical view, with the camera in a fixed position, while moving objects are recorded at a fixed point. Therefore, addressing each type of video content plays a significant role in calculating the trajectories, the acceleration of objects, and the moving directions. Sayed et al. [64] highlighted challenges in AI-based traffic flow prediction, such as the scarcity of high-quality training data and computationally effective methods. These issues, coupled with underutilized spatiotemporal correlations in deep learning methods restrict advancements in traffic flow predictions [64].

Machine learning and statistical models

Most machine-learning algorithms focus on vehicle trajectory, motion, acceleration, and car position in detecting car accidents. Singh et al. [64] combined two algorithms using object detection and anomaly algorithm detection to identify accidents. Singh et al. [64] proposed a framework that extracts deep representation using autoencoders and an unsupervised model (SVM) to detect the possibility of an accident. The vehicle’s trajectories at the intersection points were used to increase the proposed architecture’s precision and reliability. Joshua et al. [19] proposed mathematical relationships obtained through multiple linear and Poisson regression analyses to identify factors contributing to significant truck accidents on the highway using accident dataset from Virginia highway traffic in combination with other geometric variables to model the percentage of trucks involved in road accident. Arvin et al. [20] leveraged the availability of extensive data from interconnected devices in making correlations between erratic driving volatility and historical crash datasets from intersections in Michigan. Statistical variables such as fixed parameters, random parameters, geographically weighted Poisson regressions, and longitudinal and lateral acceleration were used to identify road accident crash hotspots.

Deep machine learning models

Most deep learning algorithms focus on vehicle trajectory, motion/acceleration, and car position for detecting car accidents. Chan et al. [65] proposed a dynamic-spatial attention (DSA) recurrent neural network (RNN) for anticipating accidents in dashcam videos based on vehicle trajectory and motion. The developed algorithm contains an object detector to dynamically gather subtle cues and temporal dependencies of all cues to predict accidents two seconds before they occur with a recall of 80% and low precision of 56.14%. The model’s generalizability in detecting accidents in varying weather conditions was not measured based on limited videos with rain, snow, and day/night, among other weather conditions. Robles-Serrano et al. [54] explored deep neural networks for accident detection using a three-stage approach by firstly segmenting visual characteristics of objects in the dataset, building on Inception V4 model architecture to extract temporal components of the dataset used in detecting accidents followed by temporal video segmentation. A structural similarity index was applied to the dataset at preprocessing time to accurately select image frames within the data representing an accident or no accident as part of the temporal video segmentation to eliminate frames that do not contain event occurrence or repetition of a selected event. During preprocessing, pixel-to-pixel comparisons were made to select a certain number of consecutive frames that contained features to train the model based on a specified threshold. Finally, the framework was designed to detect accidents automatically using Convolution LSTM (ConvLSTM) layers to capture spatial and temporal dependencies in input data [66, 67]. This type of neural network has proven to perform better than LSTM and CNN architectures when dealing with datasets that have both spatial and temporal structures. One of the potential limitations of model bias is based on vehicle types and other environmental conditions, such as vehicle variety and the absence of pedestrians and cyclists.

Social network and geosocial media data

The enormous amount of information being constantly shared daily across various social media platforms contains artifacts that can be analyzed to generate meaningful insights for traffic events [68]. However, the ability to manually monitor and analyze exploding information seems impossible based on the high volume and unstructured formats of information being presented [69]. Monitoring traffic-related information on social media has been proven to be beneficial in detecting traffic events. Xu et al. [70] provided a synthesis of research work that explored the usage of geosocial media data for detecting traffic events. Events such as road accidents, road closures, and traffic conditions are typically shared among a network of people through social media platforms. Such events can be tracked with the aid of GPS in getting first responders to the event location and often contain information that triggered such events. Xu et al. [71] utilized Twitter data for mining and filtering noisy data by association rules among words related to traffic events. The proposed framework achieved 81% accuracy in classifying data into non-traffic events, traffic accidents, roadwork, and severe weather conditions. Similarly, Salas et al. [72] developed a framework leveraging social media data to crawl, process, and filter social media data for implying traffic incidents and real-time detection of traffic events with text classification algorithm [73].

Literature search

The literature search process consists of four steps, including (i) selecting eligibility criteria (Inclusion and Exclusion criteria), (ii) formulating research objectives, (iii) identifying search strategy, and (iv) data extraction [74, 75]. This study employed systematic review methodology to address the research questions posited through a systematic and replicable process [76]. Specifically, the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) Statement was used as a model for this review [77, 78]. The papers selected were analyzed and synthesized based on established eligibility criteria to address the research questions postulated in the following subsection.

Research questions and objectives

Developing an AR model for specific tasks will enhance the use of AI systems in automating human actions and autonomously detecting actions in live feeds. Once the inclusion selection process has been carried out, based on pre-established criteria, the main results of selected works are codified, extracted, and synthesized to guide this research. The following research questions were addressed in this comprehensive review:

  • RQ1: What are the main action recognition techniques/applications in accident detection and autonomous transportation?

  • RQ2: What are the main taxonomies and algorithms used in action recognition for accident detection and autonomous transportation?

  • RQ3: What are the main datasets, features, and metrics used in action

Recognition for accident detection tasks?

Selecting eligibility criteria

This review includes research articles related to action recognition. The topics include (autonomous transportation, traffic control, and accident detection using computer vision, published in peer-reviewed journals. Based on continuously evolving advancements in the technical field, we limited our scope to research articles published in the last ten years before this review (from 2012 to 2022). Only research articles published in English language were used. The inclusion and exclusion criteria were detailed in “Inclusion criteria” and “Exclusion Criteria” sections, respectively. This systematic review is based primarily on computer vision tasks using the AR model in autonomous transportation and smart city accident detection.

Inclusion criteria

In our inclusion criteria, the publications needed to meet the following characteristics in order to be included:

  1. 1.

    Articles should be in action recognition and computer vision research domain.

  2. 2.

    Studies that include validation of the proposed techniques.

  3. 3.

    Published within the last ten years (i.e., between 2012 and 2022).

  4. 4.

    Peer-reviewed full research papers.

  5. 5.

    Contain analysis of spatial/temporal information in the paper.

Exclusion criteria

In our exclusion CRITERIA, the following exclusions were implemented:

  1. 1.

    Does not contain video/motion analysis.

  2. 2.

    Published before 2012.

  3. 3.

    Not peer-reviewed research paper.

  4. 4.

    Paper does not provide clear findings and analysis of results.

  5. 5.

    Written in other languages excluding english.

  6. 6.

    Duplicated studies.

Information sources and search strategy

The papers included in our review were identified by searching electronic databases that were published in English. The databases in Table 1 were used as the primary source of articles for this review. These databases provide impactful articles from full-text journals and conferences relevant to Action Recognition tasks in smart city automation, autonomous transportation, and accident detection. The first phase includes searching the databases in Table 1 with advanced search and filtering techniques to limit search results to only relevant studies. The number of articles retrieved from each database and the final number of papers selected is showcased in Fig. 1. Combining the following keywords with conjunctions “AND” and disjunctions “OR” resulted in a total of 2,030 papers in an automated search, as shown in Table 1. The most common terms used for our search were:

  1. 1.

    Action Recognition.

  2. 2.

    Transportation.

  3. 3.

    Traffic control.

  4. 4.

    Accident Detection.

Table 1 Article data source
Fig. 1
figure 1

Proportion of selected studies

The results of our search and the corresponding query that has been used are as follows:

  • IEEE Xplore: We received 299 papers from IEEE using the search string:[((“All Metadata”:Action Recognition) AND (“All Metadata”: Transportation) OR (“All Metadata”:Action Recognition) AND (“All Meta-data”:Traffic) OR (“All Metadata”:Action Recognition) AND (“All Meta-data”:Accident Detection))] between 2013 and 2022

  • ACM: We received 181 papers from ACM using the search string: [All-Field:( “Action Recognition”) AND AllField:(“Transportation”) OR All-Field:( “Action Recognition”) AND AllField:(“Traffic”) OR AllField:( “Action Recognition”) AND AllField:(”Accident Detection”)]

  • Web of Science: We received 445 papers from Web of Science using the search string:[((ALL = (Action Recognition) AND ALL = (Transportation OR Traffic OR Accident Detection))) AND (PY =  = (“2022” OR“2021” OR“2020” OR“2019” OR“2018” OR“2017” OR“2016” OR“2015” OR“2014” OR“2013”))]

  • Springer Link: We received 572 papers from Springer Link using the search string: [(”Action Recognition”) AND ((”Transportation”) OR (”Traffic”) OR (”Accident Detection”))] between 2013 and 2022

  • Science Direct: We received 533 papers from Science Direct using the search string: [(”Action Recognition” AND’Transportation’) OR (”Action Recognition” AND’Traffic’) OR (”Action Recognition” AND’Accident Detection’)] between 2013 and 2022

Study selection

The articles were evaluated and selected according to the mentioned criteria in “Literature search” section. After a preliminary database search using the approved search strategy conducted by student researchers and eliminating duplicates, a total of 1829 articles were screened by two faculty researchers and one student researcher independently who are domain experts. The abstracts, titles, and keywords from selected articles were reviewed for relevance based on the inclusion and exclusion criteria. Articles that did not meet the eligibility criteria or were not relevant to address the research question were removed. The independent researchers rated each article based on the inclusion criteria and eligibility criteria. The painstaking protocol observed in the selection process ensures that all articles included are relevant to this study. A total of 1650 papers were excluded because they do not contain video analysis or employ AR techniques in detecting accidents. Thirty-three papers were excluded because they lacked validation techniques for the proposed methodology, 108 papers identified as review papers were excluded, and 17 papers contained only abstracts. Duplicate studies that cover the same issues are excluded from the study. Figure 1 showcases the proportion of initial articles and final articles selected from each of the five online data sources listed in Table 1. Finally, only 21 papers were selected for analysis, as shown in Fig. 2.

Fig. 2
figure 2

Preferred reporting items for systematic reviews and meta-analyses (PRISMA) flow chart of the systematic review

Results

Following PRISMA guidelines, 2030 publications were identified through five databases included, and the results of the 21 papers selected for review are presented in this section. Figure 3 showcases the publication year for selected papers. It is noteworthy that the majority of the selected papers were published between 2019 and 2021. Taking advantage of the advancement in technology and smart city automation, recent research employs deep learning techniques to model traffic-related activities in the smart city utilizing computers equipped with high-performance GPU processors.

Fig. 3
figure 3

The number of papers published per year surveyed

RQ1: main action recognition techniques/applications in accident detection and autonomous transportation

The first research question of our study is to examine the main AR techniques and applications within smart cities and autonomous transportation, as shown in   "Research questions and objectives" section. Many researchers have proposed other methods to model traffic management and traffic prediction, including vector auto-regression, support vector regression, auto-regressive integrated moving average (ARIMA), Kalman filter, RNN, and transformer models [79, 80]. In time series data, such as traffic control data, the approaches have not been able to capture both spatial and temporal information concisely. Recent efforts have improved the accuracy performance of GNN and GaAN [81, 82]. Ijjina et al. [83]. proposed a supervised deep learning framework to detect and identify road-side vehicular accidents by extracting feature points such as car trajectory, weather conditions (daylight variations), and velocity in detecting traffic anomalies in real time. Fernandez-Llorc et al. [84] study utilized a disjoint two-stream convolutional network and spatiotemporal multiplier network with the visual cues extracted from the camera to detect lane change or vehicle maneuvers. You et al. [85] discovered that time segmentation methods such as SS-TCN and MS-TCN were more successful at higher IoU thresholds. Their experiment also suggests that the region convolutional 3D Network (R-C3D) algorithm has a comparable result when compared to segmentation-based approaches. However, newer methods like R(2+1)D and SlowFast network have improved accuracy. Most techniques fail to capture traffic anomalies accurately on DoTA datasets, suggesting that traffic anomaly classification is a challenging task. Yao et al. [86] suggest that distant objects and occluded objects are difficult to classify because of their low visibility. Collisions with moving vehicles present a similar problem because the vehicle ahead is substantially obscured by the vehicle it impacts. There may be instances when a vehicle hits obstacles that are not detected, such as bumpers or traffic cones. Most often, anomalous vehicles are responsible for occluding the obstacles. It is hard to detect horizontal vehicle collisions due to their vertical trajectory making traffic anomalies subtle and thus hard to detect. The joint sparse modeling (JSM) method extracts motion trajectory to evaluate traffic scenes but ignores traffic events that occurred in an unusual manner [86, 87]. Srinivasan et al. [24] developed a scalable algorithm (DETR) for high-speed object detection, with less complex architecture and higher accuracy compared to other object detection algorithms using correlation techniques between objects in video data. Tables 2 and 3 address the research question on main action recognition techniques and applications in autonomous transportation. The notation “–” indicates that the corresponding research paper did not address our research question.

Table 2 Studies were used to address the research question on main action recognition techniques and applications in autonomous transportation
Table 3 Keynote of studies that were used to address the research question on main Action Recognition techniques and applications in autonomous transportation

RQ2: Algorithms and taxonomies in accident detection and autonomous transportation

In order to answer our second research question, we have identified the most critical taxonomies and algorithms used in the AR systems for autonomous transportation and accident detection, respectively. Table 4 shows the models, architecture, features used by other researchers, and the evaluation metrics for evaluating the performance of proposed models. It is noteworthy that most novel research work with novel algorithms employs different metrics to evaluate their algorithm’s performance. Yao et al. [86] proposed a new metric for computing traffic anomaly scores using the spatial–temporal area under the curve (STAUC) with a future object localization (FOL) method for unsupervised video anomaly detection (VAD). More than 60% of the reviewed paper evaluated their algorithms using mean absolute percentage error (MAPE), mean absolute error (MAE), mean average precision (MAP), intersection over union (IOU) [97], and detection rate (DR). Reddy et al. [26] developed a spatiotemporal graph neural network for managing and predicting traffic accidents, while RNN, LSTM, and other architectures could not fully capture both spatial and temporal information relevant to accident detection. Their study combined GNN, RNN, and a transformer layer to model complex topological and temporal relationships in traffic video data, including adjacent traffic flows. Yu et al. [23] proposed a new graph-based spatiotemporal model to predict future traffic accidents. The integration of spatial, temporal, and external features in predicting accidents achieved a performance improvement of around 5% over the spatial autoencoder (SAE). Ali et al. [22] developed a Graph Convolutional Network coupled with a dynamic deep hybrid spatiotemporal neural network (DHSTNet) called GCN-DHSTNet, which is an enhanced GCN model for learning spatial dependencies of dynamic traffic flow. The LSTM was used to capture dynamic temporal correlations with other external features. In terms of RMSE and MAPE, the proposed model is 27.2% and 11.2% better than the current state of the art (AAtt-DHSTNet). Wang et al. [80] study focused on accident prediction that considers spatiotemporal dependence and other external factors in anticipating accident occurrence. Reddy et al. [26] proposed a hybrid method for detecting stationary objects, moving vehicles, traffic lights, and road signs using deep Q-learning with YOLOv3. Bortnikov et al. [92] study developed an HRNN for detecting accidents from CCTV surveillance by exploring temporal and spatial features of video footage. Yang et al. [94] proposed tracking-based object detection (TDO) technique and feature-fused SSD. TDO significantly improved detection results over state-of-the-art and established vehicle datasets for highway scene analysis. Huang et al. [52] developed a supervised learning algorithm to detect crash patterns from historical traffic data. They examined different prediction methods to estimate crash risk/accident occurrence. You et al. [85] also created a cause-and-effect-based traffic accident benchmark dataset with temporal intervals in each traffic accident event. The dataset provides atomic cues for reasoning in a complex environment and planning future actions, including mitigating ambiguity in traffic accidents. The framework developed by Tang et al. [91] can classify traffic data into different categories, such as detecting vehicle turning directions, bicycle lanes, and pedestrians within two seconds of traffic footage. In order to correctly predict accidents and classify external factors leading to accident occurrence, Wang et al. [80] take into account spatiotemporal dependence in their proposed methodology.

Table 4 Identifying the main taxonomies and algorithms used in AR for autonomous transportation based on relevant studies to our second research question

RQ3: Main dataset, features, and metrics for action recognition for accident detection

Our third research question focused on exploring the dataset used for accident detection. Table 5 showcases the dataset features, type of sensors/video data, and the link to publicly available datasets for accident detection in a smart city. Yao et al. [86] developed a benchmark dataset to access the quality of traffic accident detection and anomaly detection for nine action classes. Based on the limited annotated real life accident datasets, Bortnikov [92] utilized simulated game video data with varied weather and scene conditions. The method yielded comparable results to real-life traffic videos on YouTube, as shown in Table 5. The majority of dataset used in accident detection and autonomous vehicles are collected from dashcams, traffic surveillance cameras, drones such as HighD, InD, or Interaction datasets [98, 99] and cameras installed on buildings. For example, NGSIM HW101 and NGSIM I-80 datasets [100, 101] contain 45 min of images recorded from a building for eight synchronized cameras at 10 Hz. Fernandez-Llorca et al. [84] suggest that this dataset (NGSIM HW101) is not fully applicable for onboard detection applications even though it is beneficial for understanding and assessing the motion and behavior of vehicles and drivers under different traffic conditions. PKU dataset includes more than 5700 environmental trajectories collected using multiple horizontal 2-D LiDAR covering 360°, including vehicles trajectory data over 64 km and 19 h of footage [102]. The Prevention dataset includes data from three radars, two cameras, and one light detection and ranging (LiDAR), covering a range of 80 m around an ego-vehicle to support the development of intelligent systems for vehicle detection and tracking [103]. Similarly, the apolloscape dataset was developed to support automatic driving and navigation in smart cities. The dataset contains about 100 K image frames and 1000 km trajectories collected using four cameras and two laser scanners with 3D perception LiDAR [104]. Ijjina et al. [83] compiled surveillance videos at 30 frames per second (FPS) trimmed down to 20 s video chunks collected from CCTV videos recorded at road intersections from different parts of the world with diversified ambient conditions such as harsh sunlight, daylight hours, snow and night hours.

Table 5 Overview of datasets used in AR for autonomous transportation, features of the datasets, and download links to the datasets

Discussion

Key summary and recommendations RQ1

From the analysis of the selected papers. The main action recognition techniques in accident detection and autonomous transportation includes supervised deep learning framework for detecting and identifying road-side vehicular accidents by extracting feature points, as proposed by Ijjina et al. [83]. Disjoint two-stream convolutional network and spatiotemporal multiplier network for detecting lane change or vehicle maneuvers, as studied by Fernandez-Llorc et al. [84]. The research of You et al. [85] demonstrates that time segmentation methods such as Single-Stream Temporal Convolutional Network (SS-TCN) and Multi-Stream Temporal Convolutional Network (MS-TCN) performed better with at a higher intersection over union (IoU) threshold indicating the effectiveness of the method in capturing fine-grained temporal and complex patterns in action recognition tasks. Region convolutional 3D network (R-C3D) algorithm shows a similar result when compared to segmentation-based approaches and newer methods like Residual 2D + 1D convolutional network-R(2+1)D and SlowFast network. Current state-of-the-art techniques still have limitations in accurately capturing traffic anomalies on the DoTA datasets, especially for occluded and distant objects. Further exploration of AR techniques such as the R(2 + 1)D, SlowFast network, and DETR algorithm has shown promising results in terms of accuracy, performance, and detection speed. Additionally, it would be beneficial to research methods that can better handle occluded objects, distant objects, and horizontal vehicle collisions. Improving the performance of these techniques and addressing their limitations will enhance the overall safety and efficiency of accident detection and autonomous transportation systems within smart cities.

Key summary and recommendations RQ2

Yao et al. [86] proposed methodology is beneficial in unsupervised scenarios and can be deployed to detect traffic anomalies in real time especially with the lack of annotated datasets publicly available. Unsupervised video anomaly detection calculates traffic anomaly scores using the spatial–temporal area under the curve (STAUC), and the future object localization (FOL) method in detecting anomalous events in videos. Another technique by Reddy et al. [26] combines GNN, RNN, and a transformer layer to model complex topological and temporal relationships in traffic video data. By capturing both spatial and temporal information, the spatiotemporal graph neural network outperforms RNN and LSTM-based methods in predicting traffic accidents. Similarly, Yu et al. [23] utilized a graph-based spatiotemporal model for predicting future traffic accidents by integrating spatial, temporal, and external features to improve the overall accuracy of accident prediction. The model’s performance is around 5% better than spatial autoencoder (SAE). An other methodology for capturing external features combined graph convolutional network with a dynamic deep hybrid spatiotemporal neural network (DHSTNet) to capture spatial dependencies of dynamic traffic flow and uses LSTM cells to capture temporal correlations with external features [22]. Other taxonomies in accident detection and autonomous transportation includes deep Q-learning with YOLOv3. This hybrid approach combines the strengths of deep Q-learning and YOLOv3 for efficient object detection in traffic scenes [26]. Hierarchical recurrent neural network (HRNN) focused specifically on detecting accidents from CCTV surveillance. The method explores temporal and spatial features of video footage to identify accident occurrences effectively [92]. Yang et al. [94] proposed tracking-based object detection (TDO) and feature-fused SSD technique for improving detection results over state-of-the-art methods and established vehicle datasets for highway scene analysis. Based on our findings from RQ2, we recommend exploring the potential benefits of combining hybrid taxonomies and multiple algorithms to better capture the complex spatial and temporal relationships in traffic video data. For example, the spatiotemporal graph neural network by Reddy et al. [26] demonstrates the effectiveness of such an approach. However, it is essential to consider the potential trade-offs associated with these methods, such as increased computational costs and difficulties in real-time deployment. The implementation of complex models can pose challenges in terms of deployment on edge devices, such as cameras or other IoT devices due to their limited processing capability. Model compression or pruning techniques can optimize algorithms that require high processing power and memory capabilities.

Key summary and recommendations RQ3

Our analysis highlights the importance of various datasets for accident detection and autonomous vehicles. These datasets are collected from different sources such as dashcams, traffic surveillance cameras, drones, and cameras installed on buildings. Examples of popular datasets include HighD, InD, NGSIM HW101, NGSIM I-80, PKU, Prevention dataset, and Apolloscape dataset. Features in these datasets vary but include environmental trajectories, vehicle trajectory data, and footage captured under diverse ambient conditions. To improve the generalizability of action recognition models for accident detection, it is recommended that future research should consider utilizing diverse datasets that encompass various traffic conditions, weather conditions, and geographical locations as this has demonstrated improved model performance [23, 92] and also ensure that developed models can perform well in real-world scenarios. Furthermore, there is a need for benchmark datasets that can help assess the quality of traffic accident detection across different action classes, facilitating a fair comparison of the performance of various models and techniques. While simulated game video data has been shown to yield comparable results to real-life traffic videos [92], it is essential to prioritize real-life data to ensure effectiveness of the developed models in real-world situations. Limited annotated real-life accident datasets pose a challenge for researchers, we believe that investing in annotating and sharing these datasets will encourage more researchers to develop sophisticated methodologies and algorithms specific to accident detection and help improve the performance of action recognition models. This will ultimately contribute to advancement of the field but also support the goal of interconnected smart city automation, enhance traffic safety and efficiency in urban environments.

Limitation

Our research focused on research papers relevant to action recognition, accident detection, and autonomous transportation. Our systematic literature review has some limitations due to the inclusion and exclusion criteria we applied during the search process. The time constraint of including only articles published within the last ten years (between 2012 and 2022) might lead to the exclusion of relevant research that was published before 2012. This could potentially limit our understanding of the evolution of action recognition techniques and their application in accident detection. Based on our criteria to exclude articles written in other languages other than English, we may have excluded research findings and advancements in action recognition and accident detection from non-English-speaking research communities. This language barrier may limit the comprehensiveness of our review and result in potential biases in our findings. Furthermore, the requirement for studies to include validation of their proposed techniques and contain analysis of spatial/temporal information may have led to the exclusion of some potentially relevant studies that focused on theoretical developments, proposed novel techniques without immediate validation, or used alternative methods for action recognition. These limitations could affect the overall comprehensiveness and generalizability of our systematic literature review.

Conclusion

This systematic literature review aims to determine state-of-the-art action recognition for accident detection and autonomous transportation in smart cities. We used the PRISMA guideline for selecting seminary articles related to our topic domain, and this guideline was based on the inclusion and exclusion criteria discussed in “Literature search” section. We selected 21 papers from an initial list of 2030 publications, and we categorized and analyzed relevant papers based on the three pillars of our research question. This paper discussed the leading techniques and applications of action recognition in autonomous transportation. The study also explored the main taxonomies and algorithms used in AR for autonomous transportation. Finally, we presented an overview of datasets used in AR for autonomous transportation, features of the datasets, and download links to the datasets are embedded in Table 5 accessibility column.

In the quest for a smart city, automating city traffic by capturing spatial and temporal information from DNN is a significant step in smart city automation. Bao et al. [88] developed a model to handle the challenges of relational feature learning and uncertainty anticipation from traffic video to predict accident occurrence within 3.53 s with an average precision of 72.22% using graph convolution network (GCN) and Bayesian neural networks (BNNs). Several factors are involved in traffic accident detection, including driver behavior, weather conditions, traffic flow, and road structure. Yu et al. [23] examined spatial–temporal relationships on heterogeneous data to develop a road-level accident prediction system. Besides sequential patterns in the temporal dimension, traffic flow is strongly affected by other road networks in the spatial dimension. Studies have been conducted on traffic flow prediction; however, many of them cannot account for spatial and temporal dependencies [80]. Reddy et al. [26] aimed to extract roadway characteristics that are relevant to the trajectory of an autonomous vehicle from real-world road conditions using deep Q-learning. Analyzing and forecasting dynamic traffic patterns within smart cities is necessary for planning and managing transportation. Forecasting traffic flow is more difficult because of the volatility of vehicle flow in the temporal dimension and the uncertainty related to accident occurrence and traffic movements. Ali et al. [22] proposed a hybrid model composed of GCN and DHSTNet, which can forecast short-term traffic patterns in urban areas for improved traffic management. Similarly, Alkandari et al. [89] developed a methodology for determining how long a vehicle stays in traffic based on traffic flow and congestion.

Automation of accident detection using AI systems based on traffic cameras will be a step towards the security of more lives. It will also support the transformation of traffic cameras to support smart city automation in providing first responders and law enforcement agencies with information about road accidents. Based on the foregoing we recommend;

  • Experimental research study to combine action recognition techniques for objects and human action classification since they both have been developed using similar model architectures.

  • Future reviews in this area should consider addressing the limitations of this study by including a broader range of publication years, languages, and publication types to ensure a more comprehensive understanding of action recognition techniques and their application in accident detection.

  • Future research should focus on scaling up accident detection systems that can be integrated into smart city automation for alerting first responders about traffic accidents.

Finally, adopting automated accident detection system will support first responders in providing a quick response to victims thereby reducing human error and response time.