Complementary parametric probit regression and nonparametric classification tree modeling approaches to analyze factors affecting severity of work zone weather-related crashes

Identifying risk factors for road traffic injuries can be considered one of the main priorities of transportation agencies. More than 12,000 fatal work zone crashes were reported between 2000 and 2013. Despite recent efforts to improve work zone safety, the frequency and severity of work zone crashes are still a big concern for transportation agencies. Although many studies have been conducted on different work zone safety-related issues, there is a lack of studies that investigate the effect of adverse weather conditions on work zone crash severity. This paper utilizes probit–classification tree, a relatively recent and promising combination of machine learning technique and conventional parametric model, to identify factors affecting work zone crash severity in adverse weather conditions using 8 years of work zone weather-related crashes (2006–2013) in Washington State. The key strength of this technique lies in its capability to alleviate the shortcomings of both parametric and nonparametric models. The results showed that both presence of traffic control device and lighting conditions are significant interacting variables in the developed complementary crash severity model for work zone weather-related crashes. Therefore, transportation agencies and contractors need to invest more in lighting equipment and better traffic control strategies at work zones, specifically during adverse weather conditions.


Introduction
The widespread aging of the US roads and the increase in traffic demand have raised the need for more transportation maintenance projects; hence, affecting both safety and operations of roadways. Over 27% of 67,523 work zone crashes in 2013 involved injuries or fatalities. In fact, over 579 fatalities were reported due to work zone-related crashes in 2013 [1]. A number of studies have been conducted on work zone crashes. Results from these studies showed that crash rate and frequency are increased by the presence of work zones [2][3][4][5][6]. Weng et al. [7] used association rules to investigate factors that might have a significant effect on work zone crash casualties. They found that driving under the influence, the speed limit over 40 mph and work zones without traffic control devices have the most significant effects on work zone casualty risk. Debnath et al. [8] identified the most frequent hazards at work zones from road-worker perspective. They found that speeding vehicles are the common work zone hazard. Wang and Qin [9] investigated the severity of single-vehicle crashes at work zones. They concluded that speeding is one of the key factors affecting the severity of work zone crashes. Characteristics of work zone rear-end crashes were analyzed by Qi et al. [10], and they found that driving under the influence, lighting condition, the presence of pedestrians, and roadway defects have the highest effects on severity of work zone rear-end crashes.
It is worth mentioning that there is no agreement among different studies about the severity of work zone crashes. More specifically, some studies reported that work zone crashes were significantly more severe than non-work zone crashes [11]. On the other hand, some studies claimed that work zone crashes were less severe than non-work zone crashes [12,13]. There is another group of studies, which mentioned that there is no significant difference between the severity of work zone and non-work zone crashes [14,15]. While these studies are important and useful to understand factors affecting work zone crashes, it is largely unknown whether the harmful effects of adverse weather conditions can exacerbate the severity of work zone crashes or not.
Prior experience from road safety studies suggested that adverse weather conditions significantly affect the operations and safety of roadways [16][17][18][19][20]. More than 7,400 people are killed, and over 673,000 people are injured due to weather-related crashes every year on the US roadways [21]. Qin et al. [22] found that heavy rain leads to a fewer number of crashes but more severe crashes. Yu et al. [23] found that crashes are extremely affected by weather conditions, especially those that occurred in mountainous freeways. Huang et al. [24] found that the severity of fogand smoke-related crashes is higher compared with crashes in clear weather condition. Eisenberg et al. [25] analyzed effects of snowfalls on crashes injury severity. They found that snowy days have less fatal crashes than dry days; however, they have more injury and property damage only (PDO) crashes. Ahmed et al. [26] analyzed the interaction between roadway geometry and real-time weather and traffic data on mountainous freeways. They found that the probability of being involved in a crash could be doubled during the snowy seasons. Another study depicted that both injury and non-injury crash rates increase during the winter season; however, injury crashes are more severe during snowy winter season in comparison with other non-snow seasons [27].
Nevertheless, there is still a lack of studies examining the impact of weather conditions on work zone injury severity. This paper utilized data from the second Strategic Highway Research Program (SHRP2) Roadway Information Datasets (RID) to shed some lights on the different factors affecting work zone crash frequency and severity in different weather conditions. The contributions provided in this paper are emphasized as follows. First, considering the fact that weather data extracted from weather stations are not that accurate, this study utilized weather information provided in police crash reports. It is worth mentioning that a previous study by the authors showed about 60% accuracy in identifying weather-related crashes using the weather data obtained from the weather stations [28]. Second, a new methodology, which is based on combining the traditional probit regression analysis and data mining decision tree technique, was utilized to overcome the limitations of each standalone modeling technique. Third, factors affecting the severity of work zone weather-related crashes were identified and discussed.

Data source
This study utilized 8 years (2006-2013) of crashes including 3,028 weather-related crashes (1,887 PDO and 1,141 fatal ? injury) to determine factors affecting work zone weather-related crashes in Washington State. The crash data were extracted from the Strategic Highway Research Program 2 (SHRP2) Roadway Information Dataset (RID).
RID consists of roadway data collected from mobile data collection project, government, public, and private parties, in addition to supplemental crash history data which is used in this study. For more information about RID, see [29].

Methodology
In this study, factors affecting work zone weather-related crashes were identified using a complementary parametric and nonparametric crash severity model. In fact, both parametric and nonparametric models have their own advantages and disadvantages. For example, parametric models such as probit and logistic regression can provide the relationship between a response variable and predictors, and the results obtained from them are easy to interpret and understand [30]. However, the problem with these models is that there are many pre-assumptions in parametric models, which might negatively affect the accuracy of the results. Risk factors can also exhibit various exposure effects in different circumstances in parametric models (hidden effects problem). These shortcomings cannot be addressed using the common parametric models such as logistic and probit regression models [31]. One of the major solutions to address the hidden effects problem is to split the full sample data into several sub-datasets using the nonparametric classification tree method [31]. This method might be an effective way to see the effects of risk factors' hidden exposure in different circumstances. In addition, using nonparametric models is beneficial as these models have the ability to provide high prediction accuracy [31]. Therefore, this study utilized the probit-classification tree method, which is a complementary method utilizing both nonparametric (classification tree) and parametric (probit regression) models, to analyze the effect of different risk factors on work zone weather-related crashes. The dependent variable in the model is the crash severity levels, and the explanatory variables are the factors which influence crash severity levels. In this study, the crash severity is defined as a binary variable, which has two levels including severe crashes (injury and fatal crashes) and non-severe crashes (PDO crashes).

Decision tree
A decision tree can be used for both continuous and nominal target variables. When a decision tree is used to predict a continuous target variable, it is called a regression tree, and when it is used to classify a nominal target variable, it is called a classification tree [32]. Two main components of decision trees are the ''root node'' and the ''leaf node.'' The root node is the node located at the top of the tree and contains all the data, and the ''leaf node'' refers to the termination node and has the lowest impurity. More specifically, based on the independent variable (splitter) that creates the best homogeneity, the root node is divided into two child nodes. This procedure (partitioning the target variable recursively) will be continued until all the data in each node reach their highest homogeneity; then, tree growing will stop. This node, which does not have any branches, is called the ''leaf node,'' and each path from the top of the tree (the root node) to each leaf (the terminal node) can be considered as a rule. It is worth mentioning that data in each child node are purer (more homogenous) than the data in the upper parent node [33]. In order to find possible splits among all variables, a splitting criterion (test) is performed. Splitting criterion is the main design component of a decision tree [34]. More specifically, in the decision tree learning algorithm, the splitting criterion's role is to measure the quality of each possible split among all variables. Two common splitting criteria that can be used to grow a decision tree are Chi-square and Gini reduction. In this study, Gini splitting criterion is used to select which variable and split pattern will be used to best split the node.
Gini impurity shows the level of data impurity. More specifically, it shows the incorrect classification probability of a randomly chosen record from the specific node in the subset. The procedure of selecting variables and split scheme that can be used to best split the parent node is as follows [35]: 1. Determine the node impurity: Considering the t as a parent node, the node impurity i(t) can be calculated using the Gini index definition, which is provided in Eq. (1).
where M represents the number of classes, n j represents the number of class j elements, and N depicts all elements in the node. If a node is homogeneous, then the value obtained from Eq. (1) would be minimal. However, the value would be higher for the less pure nodes. 2. Determine the impurity reduction (Di): For all possible splits in the values for the variable x, the impurity reduction on the parent node t caused by a split s is calculated as follows: where n t is the number of child nodes of the parent node t; F j is the proportion of class j elements divided by all elements in the node (the proportion mentioned in the parenthesis in Eq. (1)). Di [ 0 indicates that elements in the child nodes are purer than elements in the parent node, and Di 0 denotes that elements in the child nodes are not purer than the parent node. The best split which can be shown by s* associated with the variable x can be identified by comprehensively searching all possible splits related to the variable x. More clearly, s* causes the maximum impurity reduction.

Determine the global maximum impurity reduction:
The previous step would be repeated, and the best splits for all variables would be determined. Among all the best splits, the split s* associated with the variable x * determines the global maximum impurity reduction. 4. Determine the leaf node: If Di(x*, s*, t) [ 0, then choose the variable x* and split s* to split the parent node. If Di(x*, s*, t) B 0, then the parent node will be considered as the leaf node. 5. Stop the splitting: By satisfying one of these two criteria, the splitting would be stopped: (a) Any nodes cannot be further split (Di(x*, s*, t) B 0), or the number of elements in the node is less than the preset minimum number. (b) The current tree depth reaches the preset maximum tree depth. Otherwise, go to Step 1.
For more information about the decision tree method, see [32].

Probit regression
Probit regression is a common statistical method that can estimate the coefficients of predictors [36]. This method was chosen to develop a separate probit regression for each group identified by the decision tree model. The forward Complementary parametric probit regression and nonparametric classification tree modeling… 131 selection method was used to find the best fitting model. The advantage of using probit regression model is the ability to estimate marginal effects. Probit regression is accordingly implemented to each data group to find the coefficient of each explanatory variable. Probit regression has the S-shape and can be used for dealing with binary response variables. Probit model (probit where a is the probability of response when explanatory variables are the reference level (or when x = 0) and b represents the change in the probability per unit change in x, a parameter to be estimated. The cumulative distribution function (CDF) F(x) for a random variable x could be defined as FðxÞ ¼ PðX xÞ in a way that x could be varied between À1 and 1. By increasing the x, F(x) increases from 0 to 1 because the probability of having success would be increased. Let us consider x as a continuous random variable and plot the cumulative distribution function as a function of x; the result looks like an S-shape when b [ 0 [36].
The best definition for the relation of CDF, normal distribution and probit model provided by Agresti [36] is ''when F is the CDF of a normal distribution, the model provided in Eq. (3) is equivalent to the probit model.'' This statement is clearly shown in Eqs. (4), (5), and (6) as the normal density function provided in Eq. (6) and the CDF of normal distribution depicted in Eq. (5) [36].
As mentioned, assuming the F follows a normal cumulative distribution: where /ðzÞ is the normal density function and defined as in Eq. (6).
where z follows the normal distribution with the mean of 0 and standard deviation equal to 1.
For more information about probit regression, see [36]. Table 1 shows the selected variables for developing the work zone crash severity model in adverse weather conditions. The dependent variable is crash severity with two levels: severe crashes (fatal and injury crashes) and non-severe crashes (PDO crashes). Explanatory variables can be considered as driver behavior, roadway conditions, environmental conditions, and crash characteristics.

Results and discussion
The dataset was divided into three sub-datasets (training, validation, and testing). Among them, 60% of the data were assigned to the training sub-dataset, 30% to the validation sub-dataset, and 10% to the testing sub-dataset. More clearly, 3023 observations were used and divided among mentioned sub-datasets (i.e., 1813 observations were assigned to the training sub-dataset, 906 observations were assigned to the validation sub-dataset, and 302 observations were assigned to the testing sub-dataset). In addition, the maximum tree depth of 10, which is suitable considering the number of variables, was selected. The minimum number of data per node was selected as 40 so that all the crash severity classes have enough data for the subsequent probit analysis in each leaf node. The Gini reduction test found any node at level four at the significance level of 0.05 to be insignificant based on the obtained results. Hence, the tree stops growing at this level, and the tree comprising four leaf nodes is selected. Optimal decision tree size ensures mitigating the overfitting issue. Two criteria were considered: First, the misclassification rate for the training and validation datasets should be similar to some extent; and second, the misclassification rate and average squared error should be as low as possible. Misclassification rates for the training and validation were obtained as 0.32 and 0.35, respectively. The average square errors obtained were 0.21 and 0.22 for the training and validation datasets, respectively. Therefore, both criteria were satisfied based on the obtained results. Figure 1 shows the selected decision tree structure for the probit-classification tree model. As can be seen, collision type, the presence of a traffic control device, and lighting conditions were found to be significant interacting variables. Based on the classification shown in Fig. 1, the training dataset was divided into four groups representing four leaf nodes, including Group (1): collision type = 3 and 4 (sideswipe and other crashes); Group (2): collision type = 1 and 2 (rear-end and angle crashes) and traffic control = 2 (none); Group (3): collision type = 1 and 2 (rear-end and angle crashes) and traffic control = 1 (stop/signal/yield sign/ other) and lighting condition = 2 (other); Complementary parametric probit regression and nonparametric classification tree modeling… 133  (4): collision type = 1 and 2 (rear-end and angle crashes) and traffic control = 1 (stop/signal/yield sign/ other) and lighting condition = 1 (daylight).
For each above-mentioned group, a separate probit regression model was developed, and the marginal effect of risk factors for work zone weather-related crashes was calculated. For instance, in case of Group (2), 19 variables are available for developing a probit model, since the lighting condition is equal to 2, which means that this variable is fixed in this model. It is worth mentioning that for Group 2, collision type still has two categories (rear-end and angle crashes), and therefore, this variable is not fixed and will be used for further analysis in the probit model. Table 2 shows the results obtained from developing a separate probit model for each classified group using the estimated decision tree algorithm. In fact, four models were estimated for work zone weather-related crashes using the probit-classification tree procedure. To confirm the suitability and fitness of the models, three statistics including Akaike information criterion (AIC) statistic, the Schwarz criterion (SC) statistic, and the -2 log-likelihood statistic were used, and the results are given in Table 3.
As can be seen in model 1, the marginal effects of risk factors on the crash severity of a driver who is involved in collision types 3 and 4 (sideswipe and other) in work zone weather-related crashes can be identified. Speed limit is one of the factors found to be significant in model 1. In fact, a higher speed limit dramatically increases the severity of work zone weather-related crashes. More specifically, drivers in work zone weather-related crashes occurred in a segment with less than 55 mph posted speed limit were 13% less likely to be involved in severe crashes (the marginal effect is -0.13 in Table 2) in comparison with the reference level (drivers who were driving in a work zone segment with a posted speed limit greater than 55 mph). In fact, driving over the speed limit could be risky because it provides insufficient time for suitable response to control and handle unexpected situations, and this might be exacerbated during adverse weather conditions such as heavy rain or fog. The findings indicated that among those drivers who had sideswipe and other types of crashes (all crashes except rear-end and angle crashes), female drivers had a higher casualty risk in comparison with male drivers. Two factors have been identified for the higher risk of being injured associated with the female drivers at work zone crashes in the literature, i.e., risk-taking behavior, which is higher in younger female drivers, and increased number of female drivers (exposure effect) [31,37]. Based on the identified marginal effects for model 1, male drivers were 5% less likely to be involved in severe crashes in comparison with female drivers.
Vehicle age is another factor that turned out to be a significant predictor in model 1. Results showed that older vehicles were severely impacted in work zone weatherrelated crashes. This result confirms previous studies, which depicted the impact of vehicle age on driver injury severity [38,39]. More specifically, vehicles less than 5 years old were 13% less likely to be involved in severe work zone weather-related crashes in comparison with those over 10 years old. It was also found that 5-10-yearold vehicles were 2% less involved in severe work zone weather-related crashes in comparison with older vehicles.
Driving under the influence is another factor that came out to be significant in model 1. Particularly, sober drivers were 13% less likely to be involved in severe work zone weather-related crashes in comparison with drivers under the influence, which supports previous studies [40,41].
Roadway characteristics (presence of a curve) were also found to have a significant effect in this model. More specifically, drivers who were involved in a work zone weather-related crash at non-curved sections were 6% less likely to be involved in severe crashes in comparison with drivers who were involved in a work zone weather-related crash at a curve. The presence of curves at work zone segments could adversely affect the sight distance and handling of the vehicle in critical situations. The adverse weather might exacerbate this negative effect. Risky overtaking might be another reason for severe crashes at work zones on road curves.
The model also estimated the effect of collision type (sideswipe crashes) on the severity of crashes. More clearly, drivers who were involved in sideswipe crashes were 7% less likely to be involved in severe crashes in comparison with other collision types. It is worth mentioning that most of the sideswipe crashes were not severe in comparison with head-on crashes at work zones, which will be discussed in model 2 interpretation.
Model 2 indicates the results of the probit model for those drivers who were involved in rear-end and angle crashes, in locations with no traffic control devices. Particularly, model 2 showed that route type is a contributing factor in crash severity model. Drivers who were involved in work zone weather-related crashes in rural or secondary routes were 37% more likely to be involved in severe Vehicle type was found to be a significant factor in model 2. Drivers who were driving passenger cars and involved in work zone weather-related crashes were 2.5 times less likely to be involved in severe crashes in comparison with those driving other vehicles.
Crash location was found to be another significant factor in this model. The results showed that drivers who were involved in crashes within a work zone area were 11% less likely to be involved in severe crashes in comparison with drivers who were involved in crashes in external traffic backup caused by work zone. This finding is supported by other studies which noted lane closure might increase the traffic conflicts, as a road work mostly requires closing a part of the roadway [42]. This situation might be worse in adverse weather conditions as the low visibility for drivers could prevent them to perform appropriate maneuver in critical circumstances.
The model also showed the significant effect of surface conditions on the severity of crashes. Drivers who were involved in work zone crashes were 29% more likely to be involved in severe crashes if the surface is wet in comparison with a dry surface condition. The marginal effect of rear-end crashes (collision type = 1) is given in Table 2, which indicated that drivers who were involved in rear-end crashes (most common crashes in work zones) were 18% less likely to be involved in severe crashes in comparison with angle crashes during adverse weather conditions. Model 3 specifies the results of the probit model for those drivers who were involved in rear-end and angle crashes, with the presence of traffic control devices and in the absence of daytime lighting conditions. Similar to model 1, sobriety came out to be significant, which means that the crash severity is affected by this factor. The marginal effect of a risk factor on crash severity showed that drivers equipped with airbags were 13% less likely to be involved in severe crashes, which is consistent with a previous study [31].
In addition, results showed that crashes with fewer vehicles involved were 21% less likely to be severe. This could be because of the severity of chain crashes. Asking drivers to keep the safe following distance and avoid tailgating might be helpful. Installing dynamic message signs at work zones could be beneficial to provide advisory messages for drivers.
Finally, model 4 reveals the results of the probit model for those drivers who were involved in rear-end and angle crashes, with the presence of traffic control devices and in daytime lighting conditions. Vehicle action before crash, speed limit, weather conditions, and land use came out to be significant factors affecting crash severity model. Indeed, vehicles making a left or right turn before a crash were 38% less likely to be involved in severe crashes in comparison with vehicles going straight (category 5 in Table 1).
It was also found that drivers who were driving on roads with posted speed limits less than 55 mph were 12% less likely to be involved in severe crashes. In addition, drivers who were involved in rainy weather condition crashes were 13% more likely to be involved in severe crashes. Finally, drivers who were driving in a rural area were 10% more likely to be involved in severe work zone weather-related crashes in comparison with those driving on urban roadways. This might be due to the fact that risky behaviors such as not wearing seatbelts are more common among rural drivers [43].
As mentioned earlier, another conventional probit model with all variables was developed to compare the results with complementary probit-classification tree models. All the 20 variables used for developing probit-classification tree were considered in developing the conventional probit model as well. Forward stepwise selection method was used to find the best model fit. The results of the probit regression model are given in Table 3.
As can be seen, 11 variables came out to be significant in the conventional probit model. However, some important contributing factors including vehicle age, vehicle type, crash location, airbag, number of vehicles involved, lighting, and weather conditions are not represented in this model. The conventional probit model fit statistics are provided in Table 4.
Receiver operating characteristic (ROC) curve was used to visualize models' performance. The predictive capability of the probit-classification tree models and conventional probit (base model) is shown in Fig. 2 using ROC curves. ROC curve shows the models' performance by plotting sensitivity versus 1-specificity. More clearly, if there are no relevant information and bias results, the ROC curve would be closer to the diagonal line. On the other hand, the best performed model would be close to the upper left corner. Figure 2 shows that for training dataset which contains most of the data, all developed probit-classification tree models performed better than the conventional probit model. It is also worth mentioning that even though small portions of dataset were assigned to validation and testing datasets (in comparison with training dataset), still using probit-classification tree method showed a better performance in comparison with a conventional probit model. This study explored the effects of vehicle, driver behavior, and environmental factors on work zone weather-related crashes severity using a method that combines both data mining technique and conventional parametric modeling. In addition, the obtained results from both conventional probit model and proposed probit-classification tree model were compared to better understand the advantages of the proposed model using the 8 years of work zone weatherrelated crashes.
The complementary methodology utilized in this study has the advantage of being used in safety analysis, as this model can compensate for the weak points of parametric and nonparametric models.
In comparison with the probit-classification tree model, some important contributing factors, such as vehicle age, vehicle type, crash location, airbag, number of the vehicles involved, lighting, and weather conditions, were excluded in the conventional probit model, which shows the capability of the developed model in identifying risk factors affecting work zone weather-related crashes.
The most interesting finding of this study is that both the presence of traffic control device and lighting conditions were found to be significant interacting variables in the developed complementary crash severity model for work zone weather-related crashes. This finding shows that more attention should be paid to the effect of weather conditions in different stages of work zone projects by transportation agencies and contractors. It is highly recommended to transportation agencies and contractors to invest in  adequate lighting equipment for nighttime work, which is the only option in most urban areas. In 2013, the American Traffic Safety Services Association developed the first and the most comprehensive set of nighttime lighting guidelines for work zones called ''Nighttime Lighting Guidelines for Work Zones: A guide for developing a lighting plan for nighttime work zones'' [44]. The manual provides a simple procedure for designing a nighttime lighting system for work zones that can be easily adopted by engineers, designers, and contractors without prior experience in illumination. However, more studies are needed to clarify and characterize driver behavior in different lighting, visibility, and weather conditions. This study found that the presence of a traffic control device is a significant contributing factor to crash injury severity. Improving the temporary traffic control (TTC), including but not limited to the portable variable speed limit (VSL), to adjust speed limits during adverse weather conditions can be beneficial in this regard. Portable changeable message signs (PCMS) are also highly recommended to warn drivers about work zones to reduce the severity of work zone crashes.