Keywords

1 Introduction

The safe driving of vehicles has always been an important research direction in the field of transportation. There are about 8 million traffic accidents every year, causing about 7 million injuries and about 1.3 million deaths. Traffic problems cause the global domestic productivity to drop by 2% [1, 2]. The annual cost of personal automobile transportation (excluding commercial and public transportation) in the United States is about 3 trillion US dollars, of which 40% of the cost comes from parking, vehicle collisions, etc. [2, 3]. The research on vehicle collision prediction is a important topic in the field of traffic safety.

Traditional vehicle collision prediction mainly relies on the equipment carried by the vehicle itself, generally including millimeter wave radar, sensors, and cameras. These equipment are used to perceive and recognize objects around the vehicle. Collect the information of surrounding objects for input, rely on its own algorithm to calculate, thereby judging whether the vehicle is in an emergency state [4]. The traditional method is based on the information collected by the single vehicle itself for early warning, which has certain limitations. In bad weather or harsh environmental conditions, the vehicle-mounted sensor may have errors in the collected information or errors that deviate from the real situation. These deviations are often unacceptable in real traffic scenarios.

The Internet of Vehicles provides a new direction for the development of automotive technology by integrating global positioning system technology, vehicle-to-vehicle communication technology, wireless communication and remote sensing technology [5]. At present, some scholars have conducted research on vehicle collision prediction based on the Internet of Vehicles. Gumaste et al. [6] used V2V (vehicle-to-vehicle) technology and GPS positioning technology to predict the potential collision position of the vehicle, generate the vehicle collision area, and design the vehicle collision avoidance system to control the movement of the vehicle to avoid collision. Sengupta et al. [7] proposed a cooperative collision avoidance system based on the acquired pose information of their own vehicle and neighboring vehicles, which used the collision time and collision distance to determine whether a collision occurred. Yang Lan et al. [8] constructed a highway collision warning model based on a vehicle-road collaboration environment. The simulation results show that the model can effectively warn the occurrence of rear-end collision and side collision accidents. X.H.XIANG et al. [9] use DSRC (Dedicated Short Range Communication) technology, based on the neural network, established a collision prediction model to solve the problem of high false alarm rate in the rear-end collision system and invalid early warning in emergency situations. C.M.HUANG et al. [10] proposed an ACCW (advanced vehicle collision warning) algorithm to correct the errors caused by speed and direction changes. The results show that ACCW algorithm has a higher early warning accuracy rate at intersections and curved roads.

By analyzing the existing vehicle collision prediction model, we proposed an active collision prediction model based on the Internet of Vehicles, using the algorithm combined with SMOTE (Synthetic Minority Oversampling Technique) and LightGBM (A Highly Efficient Gradient Boosting Decision Tree), and using big data calculations on the cloud computing platform to predict whether the vehicles may collide and the collision time. If a collision is predicted, proactively send an early warning signal to vehicles that may have a collision.

2 Background

2.1 Internet of Vehicles Platform Architecture

The Internet of Vehicles platform [11] mainly includes OBU(onboard unit) and mobile communication network. Vehicles are required to have the ability to broadcast and receive V2N (Vehicle to Network) messages, that is, the vehicles communicates with the cloud computing server, as shown in Fig. 1.

Fig. 1.
figure 1

Schematic diagram of communication network based on Internet of Vehicle

The OBU is carried by the vehicle and is equipped with a mobile communication network interface. The communication network base station can ensure a wide range of network coverage and ensure the communication between the vehicle and the cloud computing server. At the same time, the vehicle-mounted OBU can be connected to the surrounding vehicles that also carry the OBU. Each vehicle-mounted OBU has a unique electronic tag, and the vehicle can receive early warning information directly. The vehicle information will be uploaded to the database module of the cloud computing server in real time, and the data will be processed and calculated. The processed information will be fed back to the vehicle in real time.

2.2 Task Description

On the cloud computing server, real-time information of a large number of vehicles is obtained through the Internet of Vehicles, to identify whether the vehicle has a collision, and to predict the time of the collision. Therefore, the prediction model is divided into two layers, the first layer is to predict the state of vehicle collision, and the second layer model performs accurate time prediction of vehicle collision on the basis of the first layer.

The vehicle prediction model in our research mainly predicts vehicle collision state and collision time via a large amount of vehicle information obtained from the cloud computing server, and then verifies the proposed model. After completing the prediction, transmitting the signal to the vehicle in advance through the communication network for warning, which will no longer be the main focus of our research.

3 Methodology

3.1 Sampling

The problem of category imbalance often leads to large deviations in the model training results. Therefore, for the case where the number of samples in the positive and negative categories is relatively large, sampling techniques are generally used to add or delete the original data to build a new data set. Doing so can make the training results of the model more stable.

SMOTE.

The SMOTE algorithm [12, 13] is to generate new samples by random linear interpolation between the minority samples and its neighbors to achieve the purpose of balancing the data set. The principle of the algorithm is as follows:

  1. 1)

    For each minority sample Xi(i = 1,2,3,…,n), calculate the nearest neighbor M minority samples (Y1,Y2,Y3,…,Ym) according to the Euclidean distance.

  2. 2)

    Several samples are randomly selected from the M nearest neighbor samples, and random linear interpolation is performed between each selected sample Yj and the original sample Xi to generate a new sample Snew. The interpolation method is shown in Eq. (1), where rand (0,1) is expressed as a random number in the interval (0,1).

    $$ S_{new} = X_{i} + {\text{rand}}\left( {0,{1}} \right)*\left( {Y_{j} - X_{i} } \right) $$
    (1)
  1. 3)

    Add the newly generated samples to the original data set.

The SMOTE algorithm is an improved method of random oversampling, it is simple and effective, and avoids the problem of over-fitting.

3.2 LightGBM

LightGBM [14, 15] is a framework of GBDT (Gradient Boosting Decision Tree) based on decision tree algorithm. Compared with XGBoost (eXtreme Gradient Boosting) algorithm, it is faster and has lower memory usage.

An optimization of LightGBM based on Histogram, which is a decision tree algorithm, is to discretize continuous eigenvalues into K values and form a histogram with a width of K. When traversing the samples, the discrete value is used as an index to accumulate statistics in the histogram, and then the discrete value in the histogram is traversed to find the optimal split point.

Another optimization of LightGBM is to adopt a leaf-wise decision tree method with depth limitation. Different from the level-wise decision tree method, the leaf-wise method finds the leaf with the largest split gain from all the current leaves and then splits it, which can effectively improve the accuracy, while adding the maximum depth limit to prevent over-fitting.

The principle of LightGBM algorithm is to use the steepest descent method to take the value of the negative gradient of the loss function in the current model as the approximate value of the residual, and then fit a regression tree. After multiple rounds of iteration, the results of all regression trees are finally accumulated to get the final result. Different from the node splitting algorithm of GBDT and XGBoost, the feature is divided into buckets to construct a histogram and then the node splitting calculated. For each leaf node of the current model, it is necessary to traverse all the features to find the feature with the largest gain and its division value, so as to split the leaf node. The steps of node splitting are as follows:

  1. 1)

    Discrete feature value, divide the feature value of all samples into a certain bin.

  2. 2)

    A histogram is constructed for each feature, and the histogram stores the sum of the gradient of the samples in each bin and the number of samples.

Traverse all bins, take the current bin as the split point, and accumulate the gradient sum SL from the bin on the left to the current bin and the number of samples nL. According to the total gradient sum Sp on the parent node and the total number of samples np, by using the histogram to make the difference, the gradient sum SR of all bins on the right and the number of samples nR are obtained. As Eq. (2) calculate the gain value, take the maximum gain value in the traversal process, and take the feature and the feature value of bin at this time as the feature of node splitting and the value of the split feature.

$$ gain = \frac{{S_{L}^{2} }}{{n_{L} }} + \frac{{S_{R}^{2} }}{{n_{R} }} - \frac{{S_{P}^{2} }}{{n_{p} }} $$
(2)

3.3 Prediction Model

Firstly, the predictive model in this paper preprocesses the data set, secondly, extracts features to build the training set, and then generates new samples through SMOTE algorithm, and adds them to the original training set to balance the data set, after that uses LightGBM algorithm on the new training set to train according to the features constructed by feature engineering, and finally establish SMOTE-LightGBM predictive model.

The prediction modeling process is shown in Fig. 2, and the specific implementation process is as follows:

  1. 1)

    Input data set D, and preprocess the data set, including clearing vacant values, deleting invalid data, and processing abnormal values to form a new data set D1.

  2. 2)

    Feature engineering 1 selects new features to form a new data set D2.

  3. 3)

    Apply SMOTE algorithm to the data set D2 to synthesize new minority samples, and add them to the original data set to form a new data set D3.

  4. 4)

    The LightGBM algorithm is used to train the new data set D3, and the Bayesian algorithm is used to determine the best parameter combination for model optimization, and obtain the prediction model of the vehicle state.

  5. 5)

    In order to better complete the prediction of vehicle collision time in Feature Engineering 2 have revised the features from Feature Engineering 1, and the prediction of collision time is mainly for the collision vehicles, so the features of the collision vehicles form a new data set D4.The K-means algorithm is used to predict the collision time of the collision vehicle, and the final prediction model is obtained.

  6. 6)

    Test with the test set to verify the effect of the prediction model.

Fig. 2.
figure 2

Predictive model process

4 Experiments

4.1 Data Set

The data used in the predictive model comes from Internet of Vehicles of a Chinese automobile company. The data mainly includes vehicle state information and vehicle movement information. Each CSV file corresponds to a vehicle. The following Table 1 gives specific information of the vehicle.

Table 1. Vehicle information data format.

The data set is divided into training data and testing data: There are 120 CSV files for training data, each file contains 2–5 days of data, and the total number of data for each file is between 4324 and 114460. There are 90 CSV files for testing data, each file contains 1–4 days of data, and the total number of data for each file is between 3195 and 116899.

The data set has a label CSV file, which is a label file for collision prediction. “Vehicle number” is the vehicle number corresponding to the previous data file, "Label" column is the label information corresponding to the vehicle (1 means collision, 0 means no collision), and “Collect Time” column is the time when the vehicle collision occurred. The following Table 2 gives the label file format.

Table 2. Label file format.

The training data is trained with the previous data and label data, the test data is used to predict whether the test vehicle will collide, and the time of the collision, and the data in the test set labels is used for evaluation.

4.2 Data Preprocessing and Feature Engineering

First of all, the missing data, data redundancy, and abnormal data values are processed. The data is sorted according to “collect time”, and then the preprocessed data extracts features.

Feature engineering 1, which is for vehicle collision state prediction, is mainly considered from two aspects: vehicle state information and movement information. The following Fig. 3 gives the operation of feature engineering 1 in predictive model process.

Fig. 3.
figure 3

Feature engineering 1

For the state information, the features such as “battery pack negative relay status”, “brake pedal status”, “main driver's seat occupancy status”, “driver demand torque value”, “handbrake status”, “vehicle key status”, “vehicle total current” and “vehicle total voltage” are selected. The most important is the construction of new features “if_off” and “if_on” in the start-stop state. When the relay changes from connection to disconnection, if_off gradually changes from −5 to −1, the rest of the time is 0.when the relay changes from disconnection to connection, if_on gradually changes from −1 to −5, and the rest of the time is 0.

For the vehicle motion information, three features such as “accelerator pedal position”, “steering wheel angle” and “vehicle speed” are selected. The features such as “instantaneous acceleration”, “local acceleration” and “speed difference” are newly constructed. Several important features like “accelerator pedal position”, “vehicle speed” and “speed difference” are carried out for data bucketing. These new features have a strong correlation with collision labels, making subsequent sampling and model construction easier.

Feature Engineering 2 is to predict the time of vehicle collision, which construct the features “current instantaneous acceleration”, “next instantaneous acceleration”, “collision judgment”, and “main driver's seat occupation status”.

In order to convert the time prediction into a two-class model, add “time_label”, and mark the time if_off = −5 in the data set label as 1, and the other time labels as 0. The following Fig. 4 gives the operation of feature engineering 2 in predictive model process.

Fig. 4.
figure 4

Feature engineering 2

4.3 Sampling

Because the positive and negative samples of the vehicle collision label data are extremely unbalanced. Therefore, the SMOTE algorithm is used to oversample the small number of negative samples. After sampling, the number of positive and negative samples of the data is close, which improves the generalization ability of the model prediction.

4.4 Model Evaluation Index

Classification Evaluation.

For the evaluation of the vehicle collision state results, the four basic indicators of the classification results are used: TP (true positive example), FP (false positive example), TN (true negative example), FN (false negative example). These four basic indicators are mainly used to measure the number of correct and incorrect classifications of positive and negative samples in the prediction results.

Precision represents the proportion of correct predictions by the model among all positive example by predicting, which is shown in Eq. (3).

$$ P = \frac{TP}{{TP + FP}} $$
(3)

Recall rate represents the proportion of correct predictions by the model among all real positive example, which is shown in Eq. (4).

$$ P = \frac{TP}{{TP + FN}} $$
(4)

F1 can be regarded as a weighted average of precision P and recall R. Its maximum value is 1, and its minimum value is 0. F1 is used as the evaluation index to predict the collision classification result, which is shown in Eq. (4).

$$ F_{1} = \frac{2 \cdot P \cdot R}{{P + R}} $$
(5)

Evaluation of Collision Time Prediction Results.

The evaluation standard for predicting the collision time is the absolute difference MAE, which is shown in Eq. (6). Among them, abs is the function to calculate the absolute value, f is the predicted collision time, and y is the real collision time.

$$ MAE = abs(f - y) $$
(6)

The difference MAE has a corresponding relationship with Score, as shown in the following Table 3.

Table 3. The corresponding relationship of MAE and Score.

F2 is the evaluation standard for evaluating of predicting collision time, which is shown in Eq. (7). Among them, sum is the function to calculate the sum value.

$$ F_{2} = \frac{sum(score)}{{{\text{(total number of samples)}} \cdot {10}}} $$
(7)

Final Evaluation.

The standard for comprehensive evaluation of vehicle collision state and collision time is Eq. (8).

$$ F = \frac{{F_{1} + F_{2} }}{2} $$
(8)

4.5 Experimental Results and Analysis

The experiment process is implemented using python, using a five-fold cross-validation method, and the final results Are averaged. In the prediction model in Fig. 1, after preprocessing and feature engineering of the data set, firstly, GBDT, XGBoost, and LightGBM algorithms are verified, and then LightGBM algorithm after SMOTE sampling operations is compared. These algorithms mainly predict the collision state of vehicles, and take the earliest time of the predicted collision as the result, and obtain the values of F1, F2 and F respectively. The experimental results are shown in the following Table 4.

Table 4. Experimental results.

A single LightGBM model is better than other models in results, and the LightGBM model that uses sampling technology has three indicators higher than other models. The model results are the best, but it can also be seen that the prediction results for the vehicle collision time are not very good.

Since the prediction result of the vehicle collision time is not ideal, refer to the prediction model in Fig. 1, after predicting the vehicle collision state, perform feature engineering 2 again, convert the time prediction into a two-class model, and use the K-means algorithm to predict the collision time. The experimental results in Table 4 show that the best experimental results are obtained by using sampling, LightGBM algorithm to predict collision status, and K-means algorithm to predict collision time.

5 Conclusion

The vehicle collision prediction model is proposed in this paper, data preprocessing improves the data quality; sampling improves the accuracy of collision label prediction; Feature engineering and the LightGBM model improve the robustness of the model; the K-nearest neighbor model prediction time improves the collision time prediction accuracy. The running result of the whole model is stable, and the total running time of the data set code is only 60–90 s.

In the next step, we will optimize the model according to the importance of different features, perform more detailed processing of the feature space, and further improve the results of the model. In the current data, the vehicles that have collided are more obvious. Consider more types of collisions, it is necessary to increase the amount of data in the training set and the test set to enhance the generalization ability of the model.