The dataset was divided into three sub-datasets (training, validation, and testing). Among them, 60% of the data were assigned to the training sub-dataset, 30% to the validation sub-dataset, and 10% to the testing sub-dataset. More clearly, 3023 observations were used and divided among mentioned sub-datasets (i.e., 1813 observations were assigned to the training sub-dataset, 906 observations were assigned to the validation sub-dataset, and 302 observations were assigned to the testing sub-dataset). In addition, the maximum tree depth of 10, which is suitable considering the number of variables, was selected. The minimum number of data per node was selected as 40 so that all the crash severity classes have enough data for the subsequent probit analysis in each leaf node.
The Gini reduction test found any node at level four at the significance level of 0.05 to be insignificant based on the obtained results. Hence, the tree stops growing at this level, and the tree comprising four leaf nodes is selected. Optimal decision tree size ensures mitigating the overfitting issue. Two criteria were considered: First, the misclassification rate for the training and validation datasets should be similar to some extent; and second, the misclassification rate and average squared error should be as low as possible. Misclassification rates for the training and validation were obtained as 0.32 and 0.35, respectively. The average square errors obtained were 0.21 and 0.22 for the training and validation datasets, respectively. Therefore, both criteria were satisfied based on the obtained results.
Figure 1 shows the selected decision tree structure for the probit–classification tree model. As can be seen, collision type, the presence of a traffic control device, and lighting conditions were found to be significant interacting variables. Based on the classification shown in Fig. 1, the training dataset was divided into four groups representing four leaf nodes, including
Group (1): collision type = 3 and 4 (sideswipe and other crashes);
Group (2): collision type = 1 and 2 (rear-end and angle crashes) and traffic control = 2 (none);
Group (3): collision type = 1 and 2 (rear-end and angle crashes) and traffic control = 1 (stop/signal/yield sign/other) and lighting condition = 2 (other);
Group (4): collision type = 1 and 2 (rear-end and angle crashes) and traffic control = 1 (stop/signal/yield sign/other) and lighting condition = 1 (daylight).
For each above-mentioned group, a separate probit regression model was developed, and the marginal effect of risk factors for work zone weather-related crashes was calculated. For instance, in case of Group (2), 19 variables are available for developing a probit model, since the lighting condition is equal to 2, which means that this variable is fixed in this model. It is worth mentioning that for Group 2, collision type still has two categories (rear-end and angle crashes), and therefore, this variable is not fixed and will be used for further analysis in the probit model.
Table 2 shows the results obtained from developing a separate probit model for each classified group using the estimated decision tree algorithm. In fact, four models were estimated for work zone weather-related crashes using the probit–classification tree procedure. To confirm the suitability and fitness of the models, three statistics including Akaike information criterion (AIC) statistic, the Schwarz criterion (SC) statistic, and the − 2 log-likelihood statistic were used, and the results are given in Table 3.
As can be seen in model 1, the marginal effects of risk factors on the crash severity of a driver who is involved in collision types 3 and 4 (sideswipe and other) in work zone weather-related crashes can be identified. Speed limit is one of the factors found to be significant in model 1. In fact, a higher speed limit dramatically increases the severity of work zone weather-related crashes. More specifically, drivers in work zone weather-related crashes occurred in a segment with less than 55 mph posted speed limit were 13% less likely to be involved in severe crashes (the marginal effect is − 0.13 in Table 2) in comparison with the reference level (drivers who were driving in a work zone segment with a posted speed limit greater than 55 mph). In fact, driving over the speed limit could be risky because it provides insufficient time for suitable response to control and handle unexpected situations, and this might be exacerbated during adverse weather conditions such as heavy rain or fog.
The findings indicated that among those drivers who had sideswipe and other types of crashes (all crashes except rear-end and angle crashes), female drivers had a higher casualty risk in comparison with male drivers. Two factors have been identified for the higher risk of being injured associated with the female drivers at work zone crashes in the literature, i.e., risk-taking behavior, which is higher in younger female drivers, and increased number of female drivers (exposure effect) [31, 37]. Based on the identified marginal effects for model 1, male drivers were 5% less likely to be involved in severe crashes in comparison with female drivers.
Vehicle age is another factor that turned out to be a significant predictor in model 1. Results showed that older vehicles were severely impacted in work zone weather-related crashes. This result confirms previous studies, which depicted the impact of vehicle age on driver injury severity [38, 39]. More specifically, vehicles less than 5 years old were 13% less likely to be involved in severe work zone weather-related crashes in comparison with those over 10 years old. It was also found that 5–10-year-old vehicles were 2% less involved in severe work zone weather-related crashes in comparison with older vehicles.
Driving under the influence is another factor that came out to be significant in model 1. Particularly, sober drivers were 13% less likely to be involved in severe work zone weather-related crashes in comparison with drivers under the influence, which supports previous studies [40, 41].
Roadway characteristics (presence of a curve) were also found to have a significant effect in this model. More specifically, drivers who were involved in a work zone weather-related crash at non-curved sections were 6% less likely to be involved in severe crashes in comparison with drivers who were involved in a work zone weather-related crash at a curve. The presence of curves at work zone segments could adversely affect the sight distance and handling of the vehicle in critical situations. The adverse weather might exacerbate this negative effect. Risky overtaking might be another reason for severe crashes at work zones on road curves.
The model also estimated the effect of collision type (sideswipe crashes) on the severity of crashes. More clearly, drivers who were involved in sideswipe crashes were 7% less likely to be involved in severe crashes in comparison with other collision types. It is worth mentioning that most of the sideswipe crashes were not severe in comparison with head-on crashes at work zones, which will be discussed in model 2 interpretation.
Model 2 indicates the results of the probit model for those drivers who were involved in rear-end and angle crashes, in locations with no traffic control devices. Particularly, model 2 showed that route type is a contributing factor in crash severity model. Drivers who were involved in work zone weather-related crashes in rural or secondary routes were 37% more likely to be involved in severe crashes in comparison with drivers who were driving on interstate highways. This could be because most of the rural roads are narrow two-way two-lane undivided roads which do not have enough shoulder width. Even secondary roads do not have the same quality as interstate roadways, which might intensify the severity of crashes in rural and secondary roads.
Vehicle type was found to be a significant factor in model 2. Drivers who were driving passenger cars and involved in work zone weather-related crashes were 2.5 times less likely to be involved in severe crashes in comparison with those driving other vehicles.
Crash location was found to be another significant factor in this model. The results showed that drivers who were involved in crashes within a work zone area were 11% less likely to be involved in severe crashes in comparison with drivers who were involved in crashes in external traffic backup caused by work zone. This finding is supported by other studies which noted lane closure might increase the traffic conflicts, as a road work mostly requires closing a part of the roadway . This situation might be worse in adverse weather conditions as the low visibility for drivers could prevent them to perform appropriate maneuver in critical circumstances.
The model also showed the significant effect of surface conditions on the severity of crashes. Drivers who were involved in work zone crashes were 29% more likely to be involved in severe crashes if the surface is wet in comparison with a dry surface condition. The marginal effect of rear-end crashes (collision type = 1) is given in Table 2, which indicated that drivers who were involved in rear-end crashes (most common crashes in work zones) were 18% less likely to be involved in severe crashes in comparison with angle crashes during adverse weather conditions.
Model 3 specifies the results of the probit model for those drivers who were involved in rear-end and angle crashes, with the presence of traffic control devices and in the absence of daytime lighting conditions. Similar to model 1, sobriety came out to be significant, which means that the crash severity is affected by this factor. The marginal effect of a risk factor on crash severity showed that drivers equipped with airbags were 13% less likely to be involved in severe crashes, which is consistent with a previous study .
In addition, results showed that crashes with fewer vehicles involved were 21% less likely to be severe. This could be because of the severity of chain crashes. Asking drivers to keep the safe following distance and avoid tailgating might be helpful. Installing dynamic message signs at work zones could be beneficial to provide advisory messages for drivers.
Finally, model 4 reveals the results of the probit model for those drivers who were involved in rear-end and angle crashes, with the presence of traffic control devices and in daytime lighting conditions. Vehicle action before crash, speed limit, weather conditions, and land use came out to be significant factors affecting crash severity model. Indeed, vehicles making a left or right turn before a crash were 38% less likely to be involved in severe crashes in comparison with vehicles going straight (category 5 in Table 1).
It was also found that drivers who were driving on roads with posted speed limits less than 55 mph were 12% less likely to be involved in severe crashes. In addition, drivers who were involved in rainy weather condition crashes were 13% more likely to be involved in severe crashes. Finally, drivers who were driving in a rural area were 10% more likely to be involved in severe work zone weather-related crashes in comparison with those driving on urban roadways. This might be due to the fact that risky behaviors such as not wearing seatbelts are more common among rural drivers .
As mentioned earlier, another conventional probit model with all variables was developed to compare the results with complementary probit–classification tree models. All the 20 variables used for developing probit–classification tree were considered in developing the conventional probit model as well. Forward stepwise selection method was used to find the best model fit. The results of the probit regression model are given in Table 3.
As can be seen, 11 variables came out to be significant in the conventional probit model. However, some important contributing factors including vehicle age, vehicle type, crash location, airbag, number of vehicles involved, lighting, and weather conditions are not represented in this model. The conventional probit model fit statistics are provided in Table 4.
Receiver operating characteristic (ROC) curve was used to visualize models’ performance. The predictive capability of the probit–classification tree models and conventional probit (base model) is shown in Fig. 2 using ROC curves. ROC curve shows the models’ performance by plotting sensitivity versus 1-specificity. More clearly, if there are no relevant information and bias results, the ROC curve would be closer to the diagonal line. On the other hand, the best performed model would be close to the upper left corner.
Figure 2 shows that for training dataset which contains most of the data, all developed probit–classification tree models performed better than the conventional probit model. It is also worth mentioning that even though small portions of dataset were assigned to validation and testing datasets (in comparison with training dataset), still using probit–classification tree method showed a better performance in comparison with a conventional probit model.