Prediction and Analysis of the Severity and Number of Suburban Accidents Using Logit Model, Factor Analysis and Machine Learning: A case study in a developing country

The purpose of this study is to investigate and determine the factors affecting vehicle and pedestrian accidents taking place in the busiest suburban highway of Guilan Province located in the north of Iran and provide the most accurate prediction model. Therefore, the effective principal variables and the probability of occurrence of each category of crashes are analyzed and computed utilizing the factor analysis, logit, and Machine Learning approaches simultaneously. This method not only could contribute to achieving the most comprehensive and efficient model to specify the major contributing factor, but also it can provide officials with suggestions to take effective measures with higher precision to lessen accident impacts and improve road safety. Both the factor analysis and logit model show the significant roles of exceeding lawful speed, rainy weather and driver age (30–50) variables in the severity of vehicle accidents. On the other hand, the rainy weather and lighting condition variables as the most contributing factors in pedestrian accidents severity, underline the dominant role of environmental factors in the severity of all vehicle-pedestrian accidents. Moreover, considering both utilized methods, the machine-learning model has higher predictive power in all cases, especially in pedestrian accidents, with 41.6% increase in the predictive power of fatal accidents and 12.4% in whole accidents. Thus, the Artificial Neural Network model is chosen as the superior approach in predicting the number and severity of crashes. Besides, the good performance and validation of the machine learning is proved through performance and sensitivity analysis.


Introduction
With the growing economy in developing countries, suburban traffic plays a crucial role in the country's comprehensive transportation system. The increase in road transport in comparison to less progress in other types of transportation systems and insufficient infrastructures in Iran, has significantly increased the urban pollution, road users wasted time and above all the damages caused by traffic accidents [1,2]. The high death rate of traffic accidents in suburban roads is considered as one of the challenging safety issues in developing countries like Iran. According to World Health Organization (WHO), there are more than 20,000 fatalities, and around 300,000 injuries in road traffic accident occurred in Iran each year, which 69% of them belongs to suburban roads' crashes [3,4]. Therefore, the analysis and investigation of suburban road accidents and providing solutions to reduce them due to local traffic and environmental characteristics are essential to be investigated. It is obvious that such recognition will lead to the feasibility of developing traffic safety programs of engineers and will enable them to better understand the factors that have a positive or negative impact on the severity of crashes. The ultimate goal of analyzing and studying the data gathered by experts is to reach the most accurate and comprehensive method to forecast type and number of accidents considering the given characteristics such as geographical, physical, and human factors of studied road. Limited available accidents data especially in short-term period or pedestrian crashes is deemed one of the main challenges of engineers. On the other side, dealing with limited number of accidents is the nature of road accidents analysis. Due to numerous preventive and corrective measures utilized by governments, the number of accidents should be reduced as many as possible. Therefore, due to this limitation, it is decided to utilize the statistical approaches (factor analysis and logit model) and machine learning in order to investigate the occurred accidents in one of the busiest suburban highways of Guilan Province located in North of Iran. The final goal is to determine and analyze the most effective parameters on increasing the severity of accidents and present the most accurate prediction model for vehicle and pedestrian accidents separately.
This paper is organized as follows. Section 2 describes the past studies about the application of statistical and artificial neural network approaches in generating and analyzing the prediction model of accidents. Section 3 introduces the study route and the utilized methodology in this paper. Section 4 describes the details of the factor analysis, logit, and Machine Learning approaches and presents the obtained results. Finally, Sect. 5 and 6 demonstrate the differences in results using different modeling methods and present the main conclusions of this study.

Previous Studies
A review of past studies in the field of predicting number and severity of crashes indicates that each of them has examined the relationship between effective parameters in accidents with the severity that are classified into different categories [5]. Most of the previous studies in this area have been conducted in two categories of statistical and artificial neural network approaches. Firstly, many researchers had focused on generating models based on a statistical methods to predict the crash numbers. The significant difference in their conducted researches was between the type of model and the number of parameters or independent variables influencing the severity of crashes. Thus, various models of logit or probit had been utilized according to their proportion. In studies that the severity of accidents is divided into two categories, binary logit or probit models have been necessarily utilized, and in studies with more categories of severity, the multiple logit or probit models have been used. Jason and Shanker [6] studied the impact of the fixed roadside objects on the entire urban state route system in Washington State. The utilized models in this research were multivariate nested logit models of injury severity and the severity of collisions were classified into five categories: property damage, minor injury, moderate injury, severe and fatal collisions. The proposed model showed that the utilization of well-designed leading ends of guardrails decreases the number of fatal accidents. The model also indicated the importance of protecting vehicles from collisions with trees stumps and rigid poles that cause severe injury or death. Yan et al. [7] studied the multiple logistic regression model and Quasi-induced exposure concept for rear-end accidents occurring at signalized intersections. In order to study the characteristics of accidents, parameters related to the road environment, striking and struck role were investigated. The most important factors were influencing these types of accidents included number of lanes, divided/undivided highway, accident time, road surface condition, highway character, urban/rural, and speed limit, vehicle type, driver age, alcohol/drug use and driver residence. Deng et al. [8] investigated the severity of head-on collisions in Connecticut State utilizing a sequential probit model. Their studies showed that the wet surface of the pavement and the time of the collision at night are highly correlated with the severity of the collision, while the increase in the width of the lane decreases the severity of the collisions. Kim et al. [9] investigated the severity of bicycle injuries in bicycle-motor vehicle accidents and the factors affecting it. The utilized multinomial logit model could predict the probability of four categories of collisions severity, including fatal, incapacitating, non-incapacitating, and possible or no injury. The results of their modelling showed that a lot of factors such as a truck involving in a collision, high speed, consuming alcohol by driver or cyclist, the age of over 55 years old for the cyclist, inclement weather and head-on collisions lead to an increase in the severity of injuries leading to death. In a study by Peter Savolainen and Fred Mannering in 2007, modeling was utilized once for single-vehicle crashes and once for multi-vehicle crashes, which Nested logit and standard multinomial logit model were used for modeling [10]. The results showed that the parameters such as age, roadway characteristics, alcohol consumption, helmet use, unsafe speed were the most prominent factors which increase the severity of crashes. Pengfei Liu et al. [11] studied the contributing factors that affect the severity of head-on crashes in North Carolina in United States utilizing mixed logit model. Results of their studies maintained that adverse weather condition, two-way divided road, traffic control, young drivers, and pickups would decrease the injury severity of head-on crashes.
The majority of statistical methods have their assumptions and predefined relations between independent and dependent variables, and if these assumptions are violated, the model will provide incorrect prediction of accidents. Machine learning tool seem to be one of the most reliable and efficient approaches dealing with everyday human challenges with the capability of skipping theoretical assumptions. [12] Machine learning approaches using Artificial Neural Network (ANN) could be utilized in various areas such as environment and business sectors to develop prediction models of ambient temperature, energy production and consumption. [13][14][15] Demirezen et al. proved the competence of artificial neural network (ANN) as a dependable and powerful predicting approach of outdoor temperature with minimum error in two different studies. [16,17] Banan et al. utilized deep learning neural network as a smart and real-time approach to present an automate identification process of fish species [18]. Fan et al. adopted the multilayer perceptron (MLP) together with spatiotemporal model and the long shortterm memory (LSTM) network to make an estimation of temperature distributions during the thermal process. [19] Wu et al. selected ANN to present a rainfall prediction model due to its high efficiency in training large-size samples. [20] Since crashes are directly related to the human lives, the artificial neural network will have widespread application in making major decisions including prediction of the type and severity of collisions and proposing alternatives in order to reduce it, without the requirement for any predefined assumptions and relations, and with higher accuracy than statistical methods [21,22]. Nonlinear relationship between variables can be modelled with various types of ANN in order to recognize the effect of influential factors in an event occurred and predict the future events [23][24][25][26]. Chang utilized two models of artificial neural network and negative binomial regression for analyzing and modeling road crashes. Comparing these two methods, he concluded that the artificial neural network model is a more accurate and influential method for analyzing freeway accidents [27]. Akgungor and Dogan [28] proposed two models to estimate number of accidents, injuries and fatalities by making us of artificial neural networks and nonlinear regression. Their study showed that the artificial neural network model could present the prediction model with the lowest error. They used the acquired results to evaluate the performance of proposed model for the future of road safety programs in Turkey. In another study in 2009 [29], they presented an artificial neural network and genetic algorithm (GA) model to acquire prediction model for the number of fatal and injury accidents in Ankara, Turkey. The results showed that the artificial neural network model have the least error in training and testing data, resulting in a more reliable and better prediction model for crashes comparing to GA model. Cansız [30] modelled the accidents with the help of Smeed equation and ANN to estimate the number of fatalities in accidents. This study proved their model accuracy and competency of dead prediction's numbers.The artificial neural network along with log-normal regression models were utilized in a freeway accidents prediction studied by Bagheri et al. [31]. They considered three-year accident data and parameters such as average daily traffic volume, percentage of heavy vehicle, average speed and pavement condition as input variables. At the end of their study, they proved the ANN model efficiency over log-regression model and concluded that the average speed of vehicles and average daily traffic volume are the most influential factors in freeway accidents. Khair et al. [32] predicted crashes that occurred under Jordanian local conditions, utilizing novel artificial neural network model. They asseted that the estimated collisions based on sufficient data were close to the actual number of crashes and thus considered the proposed model reliable for forecasting number of occurred accidents. Afandizadeh et al. [33] started modeling the role of human factors in collisions utilizing the artificial neural network. In this study, they considered accident-prone violations in the suburban highways to select the effective variables in the model designing process. Afterward, they categorized the collision into three levels of severity, property damage, injury, and death, then different structures were built using the artificial neural network, and eventually the model was validated using new data, and the results of the optimal network parameters showed a high accuracy of the neural network in building the model. E. Contreras et al. [34] utilized a model by using ANN to predict traffic accidents in urban zones of Nuevo León city. In this study Scilab development software was used to validate the maximum sensitivity of intended Neural Network. The satisfactory mean square gradient error of the presented model demonstrated the validation of the prediction model.

Study route and methodology
In this research, the study route (Chaboksar-Lahijan) is a busy road in the north of Iran, which is known as the most accident-prone suburban highway in the Guilan province. The length of this route is 61 km, and due to lots of accesses, commercial and residential land uses in many parts, especially at the city entrances in which urban texture overcomes suburban texture; therefore, highway traffic performance in this route is challenged. This road is categorized as the most traveled highway in this province, so the accidents' frequency and severity analysis are critical to be investigated. Generally, the whole data consists of 1117 accidents which 56 of them have some deficiencies. Eventually, 1061 accidents (956 vehicle accidents and 105 pedestrian accidents) are obtained for analysis.
In this paper, the dependent variable is the different levels of accident severity, which have been divided into three categories of fatal, injury, and property damage only (PDO) accidents. Since the number of fatal accidents is few compared to total accidents and by considering the three levels of dependent variables, the independent variables significance and goodness-of-fit of a model have not been achieved, therefore, in the case of vehicle accidents, fatal accidents has been merged with accidents leading to injury and the dependent variable is divided into two categories. It should be noted that, in many cases, traffic polices consider the injured persons just in accident scenes; however, the injured may die after being transferred to the hospital or on the way of the hospital; so it leads to an inconsistency in the accidents fatalities statistics. Therefore, merging these two categories is practically sensible, and there is no interference in the study's objective, which is understanding the most effective factors on the severity of accidents. Furthermore, for the analysis of pedestrian accident severity, the dependent variable has been divided into two levels of injury and fatal.
Independent variables affecting the severity of accidents have been categorized for both vehicle and pedestrian accidents according to Table 1. The data should be converted to nominal variables to be used in the modeling process; therefore, all variables have become nominal in a way that number 1 indicates the variable intervention in the accident, and zero indicates the variable non-intervention in the accident. After preparation of data and converting the dependent and independent variables into dummy variables, vehicle and pedestrian accidents will be separately modelled and analyzed using factor analysis, logit and machine learning approaches.

Exploratory factor analysis
In studies with large number of variables, researchers are looking to reduce the number of variables and form a new structure for more practical and accurate data analysis. Therefore, factor analysis is used to identify the principal variables in order to explain the correlation pattern between the observed variables. Factor analysis plays a very important role in identifying hidden variables or factors through observed variables.
The results of Kaiser-Meyer-Olkin (KMO) indexes and the Bartlett tests for vehicle and pedestrian accidents are shown in Table 2. Since the KMO index for vehicle accidents in 2018 and pedestrian accidents are less than 0.5, the factor analysis results would not be reliable for these two mentioned cases. Moreover, the significance value of Bartlett's test for all cases is less than 5%, which rejects the assumption of the known correlation matrix.
The eigenvalues and remaining factors in the analysis should be recognized in order to perform factor analysis. The factors with an eigenvalue of less than one should be excluded from the analysis. Table 3 shows the eigenvalues of sum of three years vehicle accidents occurred between 2017 and 2019.
According to Table 3, factors one to six have an eigenvalue more than one and remain in the analysis. Therefore, Table 4 represent rotated component matrix, which contain estimates of the correlations between each of the variables and the estimated components. The higher coefficients in each row represents the more importance of that variable.
According to factor analysis on the 13 variables affecting the vehicle accidents (2017-2019), six factors are recognized as principal factors. The factor analysis shows that collision with, type of collision and the main cause variables are considered as the first factor affecting the severity of accidents. In addition, the variables of the road surface and weather condition are considered as the second factor. Moreover, accident time and lighting conditions are categorized as the third factor. The at-fault vehicle, age of driver and driver's gender are the fourth factor and road geometric characteristic is regarded as the fifth factor. Finally, the season and day of the accident are considered as the sixth factor. In a nutshell, the importance of "collision with, type of collision and main cause" as the first influential factors on increasing the severity of accidents asserts further attention to details of these sub-variables. The frequency analysis of accidents shows the large share of light vehicle, rear-end and side-impact, lack of attention and driving too close to the car in front in total number of accidents. All of these behaviors are the direct result of careless driving and they are extremely dangerous. They may also result in a serious car crash that has a long-lasting influence on innocent people, drivers, pedestrians, and cyclists alike. Therefore, imposing more penalties such as dramatically increase of insurance rates and driving license suspensions for novice drivers would seem reasonable. It is also suggested to alert inattentive drivers of potential danger by implementing pavement warning methods such as alert strips (sleepy bumps) or installing speed humps, especially at the city entrances along this road in which urban texture overcomes suburban texture. Due to the large number of rainy days in this highway, and considering the surface, weather, time and lighting condition as second and third most influential factors, both these mentioned issues indicate the importance of implementing corrective actions such as increasing highway lighting condition and improving pavement surface quality in addition to preventive measures such as installing more VMS (Variable Message Signs) and traffic speed cameras especially in bad weather conditions.

Modeling using logit model
In order to analyze data and obtain a prediction model using logit model, there are three ways to enter variables, including Entering, Backward and Forward approaches. Since all variables are entered simultaneously into the equation in the first method (Enter), this model does not have the opportunity to process data appropriately and extract the most significant variables, so it cannot be a suitable method. Therefore, the Backward and Forward methods are used to enter data into the logit equation. The one with higher accuracy in predicting the number of accidents will be recognized as the superior method. Table 5 summarizes logit models in forward and backward methods. As it is mentioned earlier, prediction accuracy determines the superior model, so the Backward method with a higher percentage of accurate predictions is chosen as the best method for making models in all cases. Besides, Tables 6 indicates the chi-square, degree of freedom (df ), and significance (sig) of the Backward method in the modeling process. Since the significance of two backward models used to predict vehicles and pedestrians' accidents are zero, the capability of the model to predict accidents is confirmed.    Tables 7 and 8 indicate the effective variables on making a prediction model for both vehicles and pedestrian accidents. Since the factor analysis specified collision with, type of collision and the main cause variables as the first factor affecting the severity of accidents, the logit model proved this result. According to Table 7, after "collision with" variables, the most effective variables increasing the severity of vehicle crashes are respectively exceeding  lawful speed, rainy weather, driver age (30-40), driver age (40-50). In addition to implementing corrective and preventive actions mentioned in factor analysis sector about rainy weather condition and exceeding lawful speed due to poor visibility and violation of speed limit, this result asserts the role of drivers at the age between 30 and 50 years on the rise of the severity of accidents. It could be related to the tendency of these drivers to higher speeds considering their more skills at this age range. It clearly shows that the new drivers and also the more experienced drivers with age of older than 50 are more cautious in driving. Therefore, government should provide more applicable education by focusing on this age group to warn them about careless driving behaviors. According to the variables' coefficients for pedestrian accidents (Table 8) through the logit model, the three most effective variables influencing the severity of pedestrian crashes are respectively rainy weather, heavy vehicle and lighting condition. The repetition of rainy weather and lighting condition as effective variables on the severity of pedestrians accident, maintains the significant role of these factors.

Modeling using artificial neural network
Several types of neural networks can be used to make an artificial neural network prediction model. Considering that the qualitative data used in this study, a neural network with pattern recognition capability is used to make the prediction model. Pattern recognition is an important component of neural network applications in computer vision, radar processing, speech recognition, and text classification. It works by classifying input data into objects or classes based on key features, using either supervised or unsupervised classification. The input attributes and output labels used in the machine learning approach are the same as the mentioned variables in Table 1. It is worth mentioning that, as it is clarified in "study route and methodology" section, the dependent variable (output class) is the different levels of accident severity. It has been divided into two categories of fatal/injury, and property damage only (PDO) for vehicle accidents, and two levels of injury and fatal for pedestrian accidents. Then it is time to build a neural network by software. In this study, the used ANN is an application of an existing algorithm. The neural network's input data is divided into three categories: • Training: These are presented to the network during training for learning process, and the network is adjusted according to its error. • Validation: These are used to measure network generalization, and to halt training when generalization stops improving. • Testing: These have no effect on training and so provide an independent measure of network performance during and after training. In other words, it is the main criterion to realize how much the neural network's findings are similar to the actual result.
The details of the accident data entry to the software and its Mean Squared Error and Percent Error are shown in Table 9. Since the number of occurred accidents separately between 2017 and 2019, as well as the sum of three years, is sufficient for the network training process, 70% of the data is used for network training and 15% of the data is used for validation process and the remaining 15% is considered as a test of the built network. Furthermore, due to fewer pedestrian occurred accidents during three years, 80% of data is used for network training, 10% as validation, and the remaining 10% as testing. Figure 1 indicates the confusion matrix of three modes of training, testing and validation of the created neural network of vehicle-pedestrian accidents. This matrix helps to show the accuracy of the network in the prediction of accidents (PDO, injury and fatal). The squares (1.1) and (2.2) indicated in green squares are the cases which correctly predicted by the network and the squares (1, 2) and (2, 1) indicated in red squares are the cases which present an false prediction of the network. Finally, the blue square shows the total predictive power of the network. As an illustration, Fig. 1d demonstrates the confusion matrix of vehicle accidents for sum of three years. According to the this matrix, which represents the result of the three processes of training, validation and testing of the network, out of 514 property-damage accidents, 452 cases, and out of 442 injury/fatal accidents, 341 cases are predicted correctly by the model. The prediction accuracy of property-damage accidents in the model is 87.9% and the prediction accuracy of injury/fatal accidents is 77.1%. For a more accurate explanation of the squares of the matrix, the square (1.1) indicates that 452 accidents are correctly predicted as PDO and square (1.2) denotes  that 101 accidents leading to fatality or injury are wrongly predicted as PDO. In addition, square (2.1) indicates that 62 fatal or injury accidents are also mistakenly predicted as PDO and square (2.2) suggests that 341 accidents are correctly predicted as fatal or injury accidents. Finally, the blue square represents the overall vehicle accidents predictive power of the network is 82.9%. Figure 2 indicates the performance of neural network training process of vehicle and pedestrian accidents. The indicated circle on these figures shows that since that point on, the answers do not improve and after repeating the process to a given value, which has specified in the horizontal axis, the training process has stopped. The mentioned point is the compromise point with specific mean squared error indicates the best point for the completion of the calculation and the creation of an artificial neural network for the given data.

Sensitivity and specificity analysis of the neural network for the given accident data
The Receiver Operating Characteristic (ROC) curve is a plot of the true positive rate (sensitivity) versus the false positive rate as the threshold is varied. A perfect test would show points in the upper-left corner, with 100% sensitivity and 100% specificity. Figure 3 analyzes the sensitivity of the network to the correct prediction for vehicle-pedestrian accidents. Class 1 on the diagram indicates network accuracy for existing accidents and class 2 indicates the accuracy of network prediction for future accidents. As the curve goes more upper-left corner, the network is more powerful to predict and estimate correctly the answers which perform very well in this network for both passenger and vehicle accidents.

Discussion on the differences in results using different modeling methods
Using factor analysis and logit model simultaneously could contribute to achieving the most comprehensive and efficient model to specify the major contributing factors and their effects on accidents. This being the case, using these results together with the ANN approach as a strong predictive solution provide officials with suggestions to take effective measures to lessen accident impacts and improve road safety. According to Fig. 4, considering two types of accidents leading to damage and the ones leading to death or injury in the vehicle accidents and the accidents leading to death or injury in pedestrian accidents, the artificial neural network modelling has higher accuracy than the statistical methods such as Logit. Therefore, the machine learning model's competence in accident modelling and prediction has been proved as one of the metaheuristic methods compared to statistical methods.

Conclusions
In this study, data of accidents in one of the suburban and most accident-prone highways of Guilan province in the north of Iran were collected from 2017 to 2019, including both vehicle and pedestrian accidents. Firstly, the factor analysis was utilized to obtain the classification of the most significant factors affecting the severity of accidents. The combined result of factor analysis and logit model proved the substantial roles of exceeding lawful speed, rainy weather, driver age (30-50) variables in the severity of vehicle accidents. The repetition of rainy weather and lighting condition as influential variables on the severity of pedestrians accident asserted the significant roles of these factors in the whole accident numbers. Thus, the officials should pay more attention to corrective measures such as increasing highway lighting condition and improving pavement surface quality in addition to preventive measures such as installing more VMS (Variable Message Signs) and traffic speed cameras, especially in bad weather conditions. Finally, machine learning was used to build a prediction model of accidents. Comparing the accuracy of logit and the utilized ANN model in this study, the results showed that the machine learning as a metaheuristic approach could lead to better prediction power, particularly in pedestrians' accident in all cases. By using this approach, the effect of corrective measures in reducing the number and severity of accidents would be predictable with high precision. For future studies, it is suggested that other significant data relevant to accidents, including pedestrian clothing color, traffic characteristics of the road, and geographic coordinates of the accident location, would be considered for analyzing the accidents. Implementing these valuable data by Geographical Information System alongside statistical analysis and machinelearning approaches could provide valuable information for a more precise and comprehensive analysis of accident severity.

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.

Consent for publication
The authors declare that the contents of this article have not been published previously. All the authors have contributed to the work described, read and approved the contents for publication in this journal.
Human or animal rights All the authors have been certified by their respective organizations for human subject research.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.