Background

Concerns exist about high rates of interpersonal violence, mental disorders, and suicidality among U.S. Army soldiers [1,2,3,4]. Although intensive preventive interventions have been developed in the civilian population and shown to reduce risk of some of these outcomes, including those involving physical and sexual violence (e.g., [5, 6]), cost-effective implementation of these interventions would require that they be delivered only to soldiers judged to be high-risk. It has been shown that useful risk targeting systems can be developed for these outcomes based on administrative data available for all U.S. Army soldiers using machine learning methods, with the small proportions of soldiers predicted to be at high risk by these systems accounting for substantial proportions of subsequently observed instances of the outcomes [7,8,9,10,11,12,13]. However, many known risk factors for these outcomes are not assessed in Army administrative records, raising the possibility that risk targeting could be improved by expanding the predictor sets to include information from such additional data sources as self-report surveys [13] and social media postings [14].

Given that the risk of many negative outcomes, including involvement in physical and sexual violence [4, 15, 16], is especially high in the early years of Army service, a survey carried out at the beginning of service might be especially useful in providing information that would help increase the accuracy of risk-targeting beyond the accuracy achieved by exclusively using administrative data as predictors. A New Soldier Survey (NSS) of this sort was administered as part of the Army Study to Assess Risk and Resilience in Servicemembers (Army STARRS) [17]. As reported previously [13], prediction models derived from NSS data found that the small proportions of new soldiers judged to be at high risk based on NSS predictors accounted for relatively high proportions of attempted suicides, psychiatric hospitalizations, positive drug screens, and several types of violent crime perpetration and victimization. For example, the 10% of new male soldiers estimated in cross-validated models to have highest risk of major physical violence perpetration in the early years of service accounted for 45.8% of actual acts of major physical violence in the sample.

To date, no results have been reported about the extent to which information obtained in the NSS could be used to increase the accuracy of predictions based exclusively on administrative data. Such increases might be especially large early in the Army career, when administrative data are sparse, with the predictive power of NSS data decreasing relative to that of administrative predictors over time. The current report presents the results of the first attempt to add data from the NSS survey to previously-developed predicted risk scores based on administrative data. We focus on predicting physical and sexual violence perpetration among males and sexual violence victimization among females during the early years of Army service because these are the three outcomes for which separate risk models based on NSS and administrative data were previously developed.

Methods

Sample

The NSS was administered to representative samples of new U.S. Army soldiers beginning Basic Combat Training (BCT) at Fort Benning, GA, Fort Jackson, SC, and Fort Leonard Wood, MO between April 2011 and November 2012. Recruitment began by selecting weekly samples of 200–300 new soldiers at each BCT installation to attend an informed consent presentation within 48 h of reporting for duty. The presenter explained study purposes, confidentiality, and voluntary participation, then answered all attendee questions before seeking written informed consent to give a self-administered computerized questionnaire (SAQ) and neurocognitive tests and to link these data prospectively to the soldier’s administrative records. These study recruitment and consent procedures were approved by the Human Subjects Committees of all Army STARRS collaborating organizations. The 21,790 NSS respondents considered here represent all Regular Army soldiers who completed the SAQ and agreed to administrative data linkage (77.1% response rate). Data were doubly-weighted to adjust for differences in survey responses among the respondents who did versus did not agree to administrative record linkage and differences in administrative data profiles between the latter subsample and the population of all new soldiers. More details on NSS weighting are reported elsewhere [18]. The sample size decreased with duration both because of attrition and because of variation in time between survey and end of the follow-up period. The sample included 18,838 men (decreasing to 16,479 by 12 months, 15,306 by 24 months, and 3729 by 36 months) and 2952 women (decreasing to 2300 by 12 months, 2094 by 24 months, and 687 by 36 months).

Measures

Outcomes

Outcome data were abstracted from Department of Defense criminal justice databases through December 2014 (25–44 follow-up months after NSS completion). Dependent variables were defined as first occurrences of each of the three outcomes for which predictive models had previously been developed from both administrative data and NSS data: major physical violence (i.e., murder-manslaughter, kidnapping, aggravated arson, aggravated violence, or robbery) perpetration by men, sexual violence perpetration by men, and sexual violence victimization of women, each coded according to the Bureau of Justice Statistics National Corrections Reporting Program classification system [19]. The perpetration outcomes were defined from records of “founded” offenses (i.e., where the Army found sufficient evidence to warrant full investigation). The victimization outcome was defined using any officially reported victimization regardless of evidence.

Predictors

As reported in previous publications, separate composite risk scores for each outcome were developed based on models from either the STARRS Historical Administrative Data System (HADS) [8, 9, 12] or the NSS [13]. The details of building the models that generated these scores are reported in the original papers and will not be repeated here other than to say that they involved the use of iterative machine learning methods [20] with internal cross-validation to predict the outcomes over a one-month risk horizon in a discrete-time person-month data array [21]. The HADS models were developed using all the nearly 1 million soldiers on active duty during the years 2004–2009 and were estimated for all years of service rather than only for the first few years of service, whereas the NSS models were developed using the NSS sample. We then applied the coefficients from these models to the data from the soldiers in the present samples to generate composite prediction scores. Thus, each person-month had a single score from each model representing the predicted log odds of the outcome occurring (note that this score changed each month for the HADS models, but remained the same within each person for the NSS models because the NSS was administered only once). Each score was then standardized by a mean of 0 and variance of 1 in the total sample. These composite prediction scores were used as the input in the current analysis. In other words, for each of the models reported here, there were two possible two independent variables (plus their transformations and interactions): the standardized log odds of the event occurring according to the HADS model and the standardized log odds of the event occurring according to the NSS model.

The potential predictors selected for inclusion in the iterative model-building process for the HADS and NSS models operationalized 8 classes of variables found in prior studies to predict the outcomes: socio-demographics (e.g., age, sex, race-ethnicity), mental disorders (self-reported Diagnostic and Statistical Manual of Mental Disorders, 4th edition [DSM-IV] disorders in the NSS and medically recorded International Classification of Diseases [ICD] disorders in the HADS models), suicidality/non-suicidal self-injury (self-reported in the NSS and medically recorded in the HADS models), exposure to stressors (assessed in detail in the NSS models with questions about childhood adversities, other lifetime traumatic stressors, and past-year stressful life events and difficulties; assessed in the HADS models with a small number of available markers of financial, legal, and marital problems, information about deployment and stressful career experiences, and military criminal justice records of prior experiences with crime perpetration and victimization), military career information (for new soldiers, Armed Forces Qualification Test [AFQT] scores; physical profile system [PULHES] scores used to indicate medical, physical, or psychiatric limitations; enlistment military occupational specialty classifications; and a series of indicators of enlistment waivers; and for the HADS models, increasing information over the follow-up period about promotions, demotions, deployments, and other career experiences), personality (only in the NSS models), and social networks (only in the NSS models). Results of performance-based neurocognitive tests administered in conjunction with the NSS were also included in the NSS models [22]. More detailed descriptions of the HADS and NSS predictors, the final form of each model (i.e., the variables that were ultimately selected for inclusion by the algorithms), and predictive performance are presented in the original reports [8, 9, 12, 13].

Analysis methods

Analysis was carried out remotely by Harvard Medical School analysts on the secure University of Michigan Army STARRS Data Coordination Center server. Given that respondents differed in number of months of follow-up, we began by inspecting observed outcome distributions by calculating survival curves using the actuarial method [23] implemented in SAS PROC LIFETEST [24]. We projected morbid risk to 36 months even though some new soldiers were followed for as long as 44 months because the number followed beyond 36 months was too small for stable projection. Discrete-time survival analysis with person-month the unit of analysis and a logistic link function [21] was then used to estimate a series of nested prediction models for first occurrence of each outcome. Models were estimated using SAS PROC LOGISTIC [24].

The model-testing process involved two steps: first, determining the best model using the HADS risk score only, and then finding the optimal strategy for combining NSS data with the best model from the first step. Specifically, we began with a model including only the composite predicted risk score based on the HADS (expressed as a predicted log odds standardized to have a mean of 0 and a variance of 1), controlling (as in all subsequent models) for time in service; we then estimated models including a quadratic effect of HADS risk score, an interaction of the risk score with time, and their combination. In the second step, we tested the effect of adding the NSS composite predicted risk score to the best HADS model, followed by combinations of a quadratic NSS term, an interaction of NSS score with HADS score, and an interaction of NSS risk score with historical time. Importantly, whereas the values of the NSS composite risk score did not change with time in service because the NSS was administered only once, the values of the HADS composite risk score did change due to the addition of new administrative data each month. We tested the significance of interactions between the composite risk scores and time in service to evaluate the assumption that the HADS composite risk score might become more important over time and the NSS composite risk score less important. Design-based Wald χ2 tests based on the Taylor series method [25] were used to select the best-fitting model for each outcome. This method took into consideration the weighting and clustering of the NSS data in calculating significance tests.

Once the best-fitting model for each outcome was selected, we exponentiated the logistic regression coefficients and their design-based standard errors for that model to create odds-ratios (ORs) and 95% confidence intervals (95% CIs). We then divided the sample into 20 separate groups (ventiles), each representing 5% of respondents ranked in terms of their risk scores in the best-fitting models, and calculated concentration of risk for each ventile: the proportions of observed cases of the outcome in each ventile. If the models were strong predictors, we would expect high concentration of risk in the upper ventiles. Concentration of risk was calculated and compared not only for the best-fitting models but also for the HADS-only models to determine the improvement in prediction strength achieved by adding information from the NSS rather than relying exclusively on HADS risk scores. We also calculated concentration of risk for the NSS-only models for comparative purposes. Finally, we calculated positive predictive value: the proportion of soldiers in each ventile that had the outcome over the follow-up period. As with morbid risk, positive predictive value was projected to 36 months using the actuarial method to adjust for the fact that the follow-up period varied across soldiers.

Results

Outcome distributions

A total of 186 male NSS respondents were accused of founded major physical violence perpetration and 132 of sexual violence perpetration by the end of the follow-up period, and 135 female NSS respondents reported sexual violence victimization over the same time period. These numbers correspond to incidence estimates per 1000 person-years of 4.5 for male physical violence perpetration, 3.1 for male sexual violence perpetration, and 19.5 for female sexual violence victimization. 36-month morbid risks per 1000 soldiers are 10.8 for male physical violence perpetration, 7.7 for male sexual violence perpetration, and 42.6 for female sexual violence victimization (computed using the actuarial method [23] implemented in SAS PROC LIFETEST [24]). Survival curves show that all outcomes were much less likely to occur in the first months of service, when new soldiers are in training, than later in the follow-up period (Fig. 1). Median (interquartile range) months-to-occurrence were 20 (13–25) for male physical violence perpetration, 14 (7–22) for male sexual violence perpetration, and 9 (6–15) for female sexual violence victimization.

Fig. 1
figure 1

Survival curves for the outcomes over the 36-month follow-up period (n = 18,838 men and 2952 women)

Correlations between predictions based on the separate NSS and HADS models

The correlations between HADS and NSS composite risk scores varied over time because of monthly changes in the administrative variables used to generate the HADS predictions. Median (interquartile range) within-month Pearson correlations between the two scores were .36 (.34–.38) for major physical violence perpetration among men, .06 (.05–.07) for sexual violence perpetration among men, and .26 (.24–.27) for sexual violence victimization among women (Table 1). The magnitudes of the associations between the two composite risk scores decreased over time for physical violence perpetration and sexual violence victimization, with Pearson correlations of −.78 and − .84 between number of months in service and magnitude of the within-month association between the two scores. The associations increased over time, in comparison, for sexual violence perpetration, r = .84 (Fig. 2).

Table 1 Pearson correlations between composite risk scores based on the HADS and the NSS by month in the NSS sample (n = 18,838 men and 2952 women)a
Fig. 2
figure 2

Pearson correlations between composite risk scores based on the HADS and the NSS by month in the NSS sample (n = 18,838 men and 2952 women)

Relative fit of the base models and extensions

In the first analytic step, none of the expansions of the base HADS models for nonlinearities or interactions improved model fit in predicting either physical violence perpetration among men or sexual violence victimization among women. However, the addition of the NSS risk score improved model fit in both cases. We consequently focused on the additive model (i.e., HADS plus NSS composite risk score) for these outcomes. For sexual violence perpetration among men, however, model fit was improved by inclusion of the interaction between the NSS composite risk score and time since survey administration (χ22 = 6.8, p = .034) relative to the additive model (Table 2). (See Additional file 1: Table S1, for odds ratios and chi-square values for all models tested.)

Table 2 Model fit statistics and model comparison tests (n = 18,838 men and 2952 women)a,b

Coefficients in the best-fitting models

Inspection of the odds ratios (ORs) of univariate models with either the NSS or HADS composite risk scores as the only predictors shows that each score is associated with significantly increased odds of each outcome, with ORs relatively comparable in magnitude for NSS (OR = 1.9–2.1) and HADS (OR = 1.5–2.5). Due to their collinearity, individual predictors’ ORs are lower but remain significant in the two additive models that include both composite risk scores as predictors (HADS OR = 2.1 for physical violence perpetration and 1.3 for sexual violence victimization; NSS OR = 1.6 for physical violence perpetration and 1.8 for sexual violence victimization). In the model for sexual violence perpetration, the HADS composite risk score is significant (OR = 1.4) and stable over the follow-up period, whereas the NSS composite risk score is a significant predictor in the first 12 months of service (OR = 2.3), decreases but remains significant during the second year of service (months 13–24; OR = 1.7), and becomes nonsignificant beyond the second year of service (months 25+; OR = 1.3) (Table 3).

Table 3 Odds ratios for univariate and best-fitting models (n = 18,838 men and 2,952 women)a,b

Concentration of risk and positive predictive value in the best-fitting models

Concentration of risk was strongly elevated compared to the 5% of observed cases expected by chance in the top 3 predicted risk ventiles of all three best-fitting models (Fig. 3). 39.5% of observed physical violence perpetration, 26.1% of sexual violence perpetration, and 29.4% of sexual violence victimization occurred among the 5% of soldiers in the top risk ventiles for those outcomes (Table 4). Between 49.8% (sexual violence victimization) and 56.3% (physical violence perpetration) of observed cases of the outcomes occurred among the 15% of soldiers in the top three risk ventiles.

Fig. 3
figure 3

Concentration of risk by ventiles for best model of each outcome

Table 4 Performance of univariate and best-fitting models (n = 18,838 men and 2952 women)a

These proportions were for the most part meaningfully higher than those achieved by using only the HADS predicted risk score (Table 4). For example, the 39.5% concentration of risk of physical violence perpetration among soldiers in the top risk ventile of the best-fitting model was proportionally 16.6% greater than the 33.9% concentration of risk among soldiers in the top risk ventile of the model based only on the HADS predicted risk score (i.e., 39.5/33.9). Three of these 9 proportional improvements (i.e., the top 3 ventiles for each of 3 outcomes) were less than 15% (4.8–11.2%). Four others were 15–30% (16.6–29.6%). The largest two were 45.9% and 67.9%.

Despite the high concentrations of risk in the top predicted risk ventiles, positive predictive value was low even in the highest risk ventiles due to the rarity of the outcomes. In any given month, 3.4/1000 male soldiers in the highest predicted risk ventile of physical violence perpetration were accused of that outcome, 1.5/1000 male soldiers in the highest predicted risk ventile of sexual violence perpetration were accused of that outcome, and 11.5/1000 female soldiers in the highest predicted risk ventile of sexual violence victimization experienced that outcome. However, cumulative positive predictive value projected over the first 36 months of service was considerably higher, between 34.6 and 241.7/1000 soldiers in the highest risk ventile across the outcomes.

Discussion

Prediction of all three outcomes considered here was improved, in some cases substantially so, by adding information from the NSS predicted risk score to information from the HADS predicted risk score. One would expect this improvement to shrink somewhat in out-of-sample performance due to the fact that the NSS predicted risk score was developed in the same sample as it was applied. However, a counter-balancing consideration is that incremental prediction accuracy might increase beyond the level found here if an NSS survey became a routine part of Army accession, as the sample available for analysis would then be large enough for disaggregated analyses of individual predictors from both administrative and survey data rather than requiring the use of the composite predicted risk scores we were forced to use here because of the small NSS sample size.

We found unexpectedly that the strength of the NSS predicted risk scores remained stable over the time period of the study for physical violence perpetration and sexual violence victimization. This suggests that the NSS tapped into relatively stable individual differences in risk factors for these two outcomes rather than situational risk factors that became less relevant over time. A review of the most important predictors making up the NSS predicted risk scores showed, consistent with this interpretation, that these variables are dominated by measures of personality, history of pre-enlistment lifetime psychopathology, and history of pre-enlistment lifetime trauma exposures, most notably prior sexual violence victimization among women and prior involvement in violence among men [13]. For sexual violence perpetration, however, the NSS risk score was no longer a significant predictor after the end of the second year in service (i.e., in months 25 and beyond). This could simply be a function of greater uncertainty in the model as time progresses (as the confidence interval for the odds ratio at months 25+ still contains relatively large values), or it could reflect a true decrease in the predictive strength over time. Regardless, the NSS data were valuable for predicting the majority of sexual violence perpetration outcomes, because 83.2% of reported assaults occurred within the first two years of service.

It is less clear why the HADS predicted risk scores did not increase in strength over time, as administrative information about soldiers became richer over time. One plausible interpretation is that the early months of service, when administrative records are sparse, are also characterized by lower prevalence of the outcomes considered here, as new soldiers are more closely supervised during Basic Combat Training (BCT; the first 10–16 weeks in service) and Advanced Individual Training (2–12 months after completion of BCT) so that opportunities for violence are lower. Administrative records become richer after the end of training, at which time prevalence of violence outcomes also increases. The extent to which the temporal consistencies in prediction accuracy continue beyond the early years of service studied here is unclear, although it seems unlikely that the prediction accuracy of survey reports obtained from 18-year-old new soldiers will maintain constant strength over many years in service. This question will be the focus of ongoing analyses of the STARRS data as the NSS cohort ages.

Even though the addition of NSS data improved prediction of all outcomes beyond the HADS predictions, the question remains whether the magnitudes of these improvements are large enough to justify implementing an ongoing NSS for all new soldiers. The answer depends on the number of interventions the Army might want to implement (which could involve many more than the three outcomes considered in this report), the proportional increases in concentration of risk of a composite risk score using the NSS as well as the HADS compared to the HADS alone at the thresholds selected by the Army for intervention implementation, and the costs, benefits, and competing risks of those interventions in relation to the costs of implementing an ongoing NSS. Uncertainties about these values make it impossible to calculate these cost-benefit ratios here, but these are the calculations the Army needs to make if it is interested in using evidence-based standards for targeting preventive interventions for high-risk new soldiers. If so, future research might also consider other data sources that could be added beyond an ongoing NSS to improve prediction accuracy even further over the accuracy of models based exclusively on administrative predictors.

The performance of these models is on par with, or better than, other attempts to use machine learning or more traditional methods to predict risk of crime (e.g., [26, 27]), but the accuracy is nonetheless intermediate in strength. Consequently, using these predictions as a basis for decision-making (whether with or without NSS predictor) requires that the benefits of the action taken for those accurately classified as high-risk outweigh the costs to those misclassified as high-risk. For instance, it would certainly be beneficial to deliver a reasonably low-cost intervention that does no harm to those to whom it is administered, but has some effect on reducing interpersonal violence, to a group of soldiers identified as high risk using these models. However, classification might not be accurate enough to deliver an intervention with high per-person expense, or one that causes some kind of harm (e.g., stigma, limiting career advancement) to those who are misclassified.

In addition to their implications for informing U.S. Army decision-making regarding data collection and use, these findings may be relevant to other researchers using machine learning methods to predict various outcomes for individual humans. In this study, even an extremely rich passively-collected administrative data set was no substitute for querying individuals directly about psychologically relevant variables. Administrative/institutional data are often abundant and incur relatively low additional cost to collect, so they are have formed the typical feature sets used in machine learning algorithms. However, prediction may be considerably improved through the addition of self-report data, especially (1) when an outcome is partly psychologically driven, and consequently, subjective information may be a powerful predictor, and/or (2) when important predictors in administrative data sets may be noisy or inaccurate because they are not fully captured by institutional systems (e.g., health events when no medical care was sought, covert antisocial behaviors). The control conferred when administering self-report questionnaires is an additional advantage; researchers can select the variables they consider to be most essential, and questionnaires can be designed in such a way that data require little processing prior to use in algorithms.

Conclusions

Self-report information can substantially improve prediction of risk for interpersonal violence beyond the information available in administrative databases for Regular Army soldiers early in their careers. The U. S. Army may benefit from ongoing administration of self-administered questionnaires to new soldiers. Other researchers may want to consider collecting self-report data to augment administrative/institutional data sets when developing machine learning algorithms, especially to predict psychologically driven outcomes.