Real-time crash prediction on freeways using data mining and emerging techniques

Recent advances in intelligent transportation system allow traffic safety studies to extend from historic data-based analyses to real-time applications. The study presents a new method to predict crash likelihood with traffic data collected by discrete loop detectors as well as the web-crawl weather data. Matched case–control method and support vector machines (SVMs) technique were employed to identify the risk status. The adaptive synthetic over-sampling technique was applied to solve the imbalanced dataset issues. Random forest technique was applied to select the contributing factors and avoid the over-fitting issues. The results indicate that the SVMs classifier could successfully classify 76.32% of the crashes on the test dataset and 87.52% of the crashes on the overall dataset, which were relatively satisfactory compared with the results of the previous studies. Compared with the SVMs classifier without the data, the SVMs classifier with the web-crawl weather data increased the crash prediction accuracy by 1.32% and decreased the false alarm rate by 1.72%, showing the potential the potential value of the massive web weather data. Mean impact value method was employed to evaluate the variable effects, and the results are identical with the results of most of previous studies. The emerging technique based on the discrete traffic data and web weather data proves to be more applicable on real-time safety management on freeways.


Introduction
In recent years, dynamic safety management systems for freeways have been emerging. There is a growing trend to investigate the relationship between crash mechanism and traffic operating characteristics such as traffic state, road environment and weather condition. Several data mining techniques were utilized to integrate historical operating data on freeways into crash risk prediction models. Lee et al. [1] proposed the concept 'crash precursor' and developed an aggregate log-linear model with crash data of the upstream detectors, showing that high-speed variation and high traffic density play a key role in the crash occurrence. Various traffic datasets from different traffic surveillance systems were obtained to estimate the crash likelihood, such as dual loop detectors [1][2][3][4][5], single loop detector [6][7][8] and automatic vehicle identification system [9][10][11].
Although previous models have been proven to be capable of predicting crash likelihood in order to proactively improve traffic safety on freeways, various modeling techniques result in different prediction accuracies. For instance, Hossain et al. [4] developed a Bayesian belief network for real-time crash prediction and achieved a crash prediction accuracy 66% with the false alarming rate less than 20%. Ahmed et al. [9] employed a Bayesian updating approach and increased the crash prediction accuracy up to 72% with a relatively high false alarming rate 42.01%. Xu et al. [12] utilized the Bayesian inference method and the developed model achieved 36.8% crash prediction accuracy with a low false alarm rate of 5%. Moreover, these models show various limitations. Traditional generalized linear models such as logistic regression model [2,5,7,13] could evaluate each contributing factors effectively and efficiently. Even though, some studies [14] find that a generalized linear model-based approach may lead to biased estimates when the independent variables demonstrate strong nonlinear features. The commonly used techniques in predicting real-time crash likelihood are Neural Network (NN) [3,15,16] and Support vector machines (SVMs) [17,18].
However, the NN models work as a black box, and this strategy may raise over-fitting and local extremum issues [17]. Furthermore, the traditional method to select samples in case-control studies often applies a crash/non-crash ratio as 1:4. This method would create an imbalanced dataset. Mujalli et al. [19] investigated that data mining algorithms tend to produce lower prediction accuracy over the minority class in an imbalanced dataset. Meanwhile, the SVMs approach works better than NN when dealing with small sample size [17]. The SVMs solve the over-fitting issues by introducing kernel function and try to get the global optimal solution by solving the convex optimization problems. Despite the convincing results by utilizing SVMs to evaluate real-time crash risk [17], the SVMs have difficulty in dealing with the imbalanced dataset as well, and additional efforts should be made to optimize the function parameters of the model and preprocess the raw data to achieve a higher prediction accuracy. In general, with the requirement of high-quality traffic flow data, majority of existing models cannot be applied in other regions where limited detectors or surveillance devices are installed on the freeways, though transferability of the models has been validated [8]. Moreover, without consideration of the human factors and traffic patterns, the validity of the models cannot be verified.
In China, the use of traffic flow detectors is generally not as common as that in the USA and in Europe. The detectors or surveillance devices are often installed in the road sections with frequent congestion or between two interchanges. Due to neglecting the potential value of the data collected by these devices, the limited data have not been fully utilized. A previous study conducted by You et al. [20] shows that it is applicable to predict real-time crash with discrete ultrasonic detectors and it achieved a crash prediction accuracy of 61.9%.
Due to the limited data sources in many developing regions, proactive safety management cannot be implemented to alleviate crash risk. Multiple data sources of crash occurrence and traffic flow have shown promising effects on dynamic real-time crash prediction. Recent studies present great practicability to evaluate traffic incidents by data mining of social media data on the web [21,22] or the mobile phone usage data [23]. Schulz et al. [21] proposed a supervised learning technique and the trained SVMs models for specific event types. Results indicated that the method could detect multiple labels with a match rate of 84.35%. It provides a significant opportunity for researchers to investigate the crash mechanism with the diverse and complex datasets with the rapid development of information technology. However, these studies mostly focused on the incident detection and evaluation after occurrence, but seldom investigated proactive safety countermeasures to optimize the traffic condition prior to the incidents.
The weather factor significantly impacts the road safety, especially bad weather condition such as snow and heavy fog. However, it is common in China that real-time weather information is not available in the Department of Traffic Management due to the lack of weather detectors installed on the freeways. This work tries to crawl the history weather data from the Internet with the web-crawling method. Meanwhile, once the method proves to be valid and helpful to predict crash risk, the real-time dynamic weather data could be crawled from the website of the weather institute or department and it can be utilized to evaluate the real-time crash risk on certain freeway segments by developing a real-time traffic managing system.
The objective of this paper is to develop a comprehensive real-time crash prediction model based on the data mining and emerging techniques. This paper includes five sections. Section 2 discusses the data preparation and the web-crawling process. Section 3 explains the methods including the over-sampling technique, the SVMs modeling technique and Random Forest technique. The estimation results and further discussion are presented in Sect. 4.

Data collection and preparation
The test area in this study is a part of mainline on the G60 Freeway in Shanghai, China. The total length of the road segment is 48.7 km with 6-10 lanes (3-5 lanes for each direction). As factors causing crashes mainly include human factors, vehicle factors, road geometric factors, traffic factors and weather factors, it is necessary to obtain as much as related data as possible to explore the crash mechanism. However, human factors and vehicle factors cannot be detected in real time due to the lack of data feedback mechanism. Road geometric data can be obtained from the geographic information system such as Google Earth with a brief description. As the alignment of the study area is relatively flat, the descriptive data were of little use in this study. Historical traffic data and crash data are provided by the Department of Freeway Operation and Management. The historical weather data are collected from the massive web data by web data crawling technique.

Traffic data
The primary traffic dataset includes data of nine pairs of loop detectors along the G60 Freeway. Five pairs are located on the mainline (as shown in Fig. 1), and four pairs are installed on the ramps. The average distance between the loop detectors on the mainline is approximately 6.6 km. Several traffic flow characteristics associated with crash occurrence were collected by the loop detectors, such as vehicle type, vehicle speed for corresponding vehicle type and vehicle occupancy every 20 seconds on each lane. The database also stores the device working state, data validity and timing record, etc.
The next step in data preparation is the data aggregation. Considering the random noise issue, Ahmed et al. [9] recommended to aggregate the raw traffic data to 5-min level. In this study, the extracted raw data were selected 5-10 min prior to crash occurrence time in order to avoid confusing pre-and post-crash conditions. The original 20-s traffic data, flow (Q n ), speed (V n ), occupancy (C n ) on each lane(L n ) are aggregated into a 5-min level.

Crash data
The target crash dataset includes 913 crashes that occurred in the study area between January 2014 and September 2015. Only rear-end crashes and sideswipe crashes (the total number is 551) were utilized in this study.

Weather data
In this study, a web data crawling strategy was conducted with a python script based on the PyCharm software. Webcrawling method can retrieve data faster and in greater depth. Many historical weather data pages hide in the deep or invisible web. These pages are typically only accessible by submitting dynamic queries of certain regions and a certain time to a background database. The data were extracted with regular expressions by parsing the structured HTML pages. The extracted data columns include time, region, temperature and weather. As the obtained weather data record the text information such as sunny, cloudy, rainy and snowy with different degrees, effect coding was applied to allow for nonlinear effects in the levels of attributes. For instance, sunny weather was coded as 0 and cloudy weather was coded as 1. Moreover, the weather data extracted by the web crawlers were stored into the Mysql database as a separate dataset for the following data emerging procedure.

Matched case-control method and data filtering
To eliminate the seasonal factors and day factors, the matched case-control method was utilized to avoid possible bias resulting from dissimilar traffic patterns on different days of the month and the week. A 4:1 control-case Fig. 1 Locations of the loop detectors on the mainline of G60 Freeway in Shanghai ratio was recommended by some existing study [24]. For each specific crash case, four non-crash samples were selected. The four control samples were selected by the crash recorded time, respectively, 14 days before, 7 days before, 7 days after and 14 days after. The control samples with invalid traffic data would also be removed from the control dataset.
Due to the discrete loop detectors, traffic data from a freeway segment between one on-ramp and the next offramp were utilized to predict the crashes based on the hypothesis that the crash potential was highly relevant with the traffic condition of a certain segment. Several variables were collected by loop detectors and may be relevant to the model. A pre-analysis was performed to minimize the number of potential explanatory variables. The variable definitions and formulations are listed in Table 1.
A few data filtering rules were applied to avoid possible bias. 'No data' or 'invalid data' are defined if there is no loop detector on certain segments on the G60 freeway, or the loop detectors fail. The dataset consists of the traffic flow data corresponding to each crash record. To summarize, the final crash dataset includes 138 observations and the final control dataset includes 549 non-crash samples. The statistics of variables are listed in Table 2. 3 Methodology and modeling technique

Over-sampling technique
The data mining algorithms often find it difficult to deal with imbalanced dataset. Under-sampling and over-sampling are data analysis techniques used to adjust the class distribution. As under-sampling technique often leads to the loss of the potential information of the samples, in this study over-sampling technique was utilized to create artificial samples. The adaptive synthetic sampling technique (ADASYN) is one of commonly used over-sampling techniques. It uses a weighed distribution for different minority class samples according to their levels of difficulty in learning. More synthetic data are generated for minority class samples that are harder to learn, thus   reducing the bias introduced by the imbalanced data distribution [25]. The final crash dataset includes 537 samples, and the ratio of crash/non-crash approaches 1:1.

Support vector machines
Support vector machines (SVMs) modeling technique has been widely applied in text classification, image recognition, voice recognition in machine learning. The method can often be employed for data with high dimensions and linearity problems. SVMs models have also been employed in some aspects of transportation field, such as traffic flow prediction, incident detection and crash frequency studies. Based on structural risk minimization theory, SVMs generate an optimal classification hyperplane, which maximizes the margin between the hyperplane and the nearest samples of the classified sample categories and sets an equal margin. SVMs attempt to achieve the global optimal solutions, which help the classifiers obtain better generalization ability. Meanwhile, SVMs outperform other machine learning techniques when dealing with small sample size. The C-SVC (C-Support Vector Classification) model was employed in this study with the punitive coefficient C, and RBF (Radial Basis Function) kernel function was utilized to deal with the high-dimension variables. The RBF function form is shown as the following: where c denotes the width parameter, x i a point in the space, and x the central point.
The function of the decision function based on the RBF kernel function can be deduced as where y i denotes the classified label, a i the Lagrange multiplier, and b the intercept of the hyperplane. Grid search method was employed to select the optimized parameters (C, c) for the decision function. The modeling process was based on the LIBSVM tool developed by Chang and Lin [26] in the MATLAB software.

Random forest
Despite the advantages, SVMs models lack the capability of detecting the contributing factors and the use of all the variables as input makes the estimation inefficient. As suggested by Yu et al. [17], variable selection procedure is needed prior to the SVMs estimation. Meanwhile, by selecting variables it is able to solve the over-fitting issues. Hence, random forest was employed to select the contributing factors, as it is well known for selecting significant contributing variables from a set of factors [27]. The strategy of random forest is that every tree is built with several factors, so a particular tree grows from a bootstrap aggregate sample, part of the cases is discarded and they will not be used in the development of the trees. The leftout cases are called Out-Of-Bag (OOB) data. The OOB data turn to validate the built trees with an unbiased error estimate as well as the important level estimations of variables. To test whether the attempted numbers of trees are sufficient to reach relatively stable results, the plot of OOB error rate against various tree numbers is developed. The optimal number of trees is the one having the minimum OOB error rate along with a nearly constant error rate. A wrapper MATLAB file interface to C code used in R package random forest [28] was employed to select the contributing factors. The tool provides the 'mean decrease in Gini index' method to select contributing variables. A higher magnitude implies a higher variable importance. Hassan et al. [27] chose several variables with higher scores (approximately 50% of the scores for all variables in total) than the remaining variables. In this paper, the variables which score higher than the mean score of all the variables were selected for the modeling process.

Results and conclusions 4.1 Contributing factors by random forest
Random Forest technique was employed to select the contributing factors before the first SVMs modeling process. The results are shown in Fig. 2. Figure 2b indicates that the optimal number of trees is approximately 330 for the forest with 330 trees, which has the minimum OOB error rate along with a constant error rate 0.16. From Fig. 2a, six key factors (Q, C M , Q D etc.) can be selected by the mean magnitude value, which will be utilized in the SVMs modeling process afterward.

SVMs classifier performance
In the SVMs modeling process, the full sample was divided into two parts randomly, training data and test data with a ratio of 7:3. Three indexes were employed to evaluate the classifier performance, accuracy, True Positive Ratio (TPR) and False Positive Ratio (FPR). Additionally, the area under the Receiver Operating Characteristic (ROC) curve (AUC) was utilized to evaluate the classifier performance. The larger the AUC value is, the better the classifier performs.
To evaluate the potential value of the web-crawl weather data, a comparative study was conducted. First, the modeling process was conducted with the selected variables including the weather data. Next, the modeling process was conducted without the weather data. The SVMs classifier performances of both methods are listed in Tables 3 and 4.
Results in Table 3 display that the SVM approach has a satisfying performance since the AUC value is relatively high and the TPR value of the overall dataset is 87.52%, which indicates better crash prediction performance. From the results of the test dataset, it can be concluded that despite the limited data from discrete loop detectors, an available approach for real-time crash prediction can be implemented by the SVMs modeling technique. The TPR value is 76.32%, which indicates more than two-thirds of the crashes can be detected with a reasonable FPR value 33.91%. The results are relatively satisfactory when compared with the results of the previous studies. Results of several previous studies are listed in Table 5. The results show the potential value of traffic data from discrete loop detectors.
Results in Tables 3 and 4 show that the SVMs classifier with web-crawl weather data has a relatively better performance than that without web-crawl weather data. From the results of the test dataset, the TPR value increased 1.32% and the FPR value decreased 1.72% after the SVMs modeling with the added weather parameter. The same conclusion can be deducted as demonstrated by the AUC value. In general, the weather factors have a certain impact on the safety performance on the G60 Freeway. It illustrates the potential value of the web-crawl data as well. To implement the technique, massive real-time weather data can be crawled from the website of meteorological bureau. Thus, the comprehensive analysis on the real-time crash   Real-time crash prediction on freeways using data mining and emerging techniques 121 risk on certain segments can be done for proactive safety management based on the emerging techniques of the traffic data and weather data.

Importance analysis for variable effects
SVMs was blamed for being a black-box technique as the variable effects cannot be evaluated. To unveil the variable effects, the mean importance value (MIV) method has been employed, which has been widely employed in neutral network to evaluate the relative effects of the variables. The method calculates the value of the changed probability of each variable by changing the value of variables with 10% increase and 10% decrease. The results are listed in Table 6. From Table 6, it can be concluded that the lower values of Q, C M and Q DL are probable to increase the crash risk and the higher values of Q D , V D and W ea will lead to higher crash possibility as well. It seems that crash risk is more associated with the flow of the corresponding segment. In real traffic operation environment, the drivers tend to be more aggressive when the traffic is smooth, thus leading to higher rear-end or sideswipe crash risk. Meanwhile, several other conclusions can be made such as the crash risk increases when the deviation of flow and speed increases. These conclusions are identical with the results of most of previous studies.
The study presents a comprehensive SVM model with the traffic data collected by discrete loop detectors and the web-crawl weather data. It is common in China that realtime weather information is not available in the Department of Traffic Management due to the lack of weather detectors installed on the freeways. A new method was proposed to crawl the real-time weather data from the Internet, where massive weather forecasting data and real-time weather data are available. Once the potential crash segment is identified in real time, measures for traffic guidance may be implemented to warn the drivers of the potential risk ahead.