Introduction

In the USA, police officers sometimes face the most unpredictable, traumatic, and violent circumstances of any profession (White et al., 2019). Although much of an officer’s workday is comprised of brief, unremarkable events, some calls for service can escalate into life-threatening situations. For officers to adequately mitigate the risks, they must be well informed about the types of risks they might face, settings that are especially dangerous, and tactics that can enhance their safety.

Prior literature has focused heavily on how technical advances such as body armor and patrol car design can improve officer safety in the field, especially when coupled with better training (Cunningham et al., 2021). Less attention has been given to the role that 911 professionals (i.e., call-takers and dispatchers) play in enhancing officer safety. 911 professionals are responsible for extracting critical details from 911 callers, assessing the risks, and transmitting such information clearly to responding police units (Gillooly, 2020; 2022; Neusteter et al., 2019; https://www.911.gov/). Consequently, they have a huge role in shaping officers’ perceptions of risk when responding to calls (Gillooly, 2022; Taylor, 2020).

Call-takers and dispatchers are not immune to fatigue, overload, and questionable judgments that can influence their assessments of risk (Gillooly, 2022; Manning, 1988). In addition, callers will sometimes provide incorrect and confusing information that typically is transmitted in a very short time; there is little opportunity to confirm and double-check key information. Despite these and other challenges, the information conveyed to dispatched police officers will in practice become the raw material for an operational forecast of what might happen at the crime scene when the police arrive. It is the forecast, not a set of loosely connected and incomplete facts, that helps frame how police officers respond. Yet, this forecasting framing has received almost no attention when dispatching is studied.

Forecasts have long been at least implicit in a wide variety of police planning and decision making. The allocation of police assets to different neighborhoods is one common example. Over the past several decades, the forecasting has in many law enforcement settings become more data driven, more statistical, and more concerned with statistical and causal inference. COMPSTAT was an early illustration, but currently “AI” in the form of predictive policing and offender risk assessment have made explicit forecasting almost commonplace (Berk, 2021).Footnote 1

The pages ahead draw on these criminal justice forecasting experiences. We offer a demonstration of concept illustrating the possibilities from using machine learning to forecast the risks that police officers face when dispatched in response to a call for service.Footnote 2 Our forecasting algorithm is developed from information provided by 911 calls, officers’ completed offense forms, and a variety of particulars from other sources. We quantify the uncertainty in our forecasts using nested conformal prediction sets, which are a relatively recent statistical development (Gupta et al., 2022).

We hold that useful forecasts of officer risk are feasible with only modest improvements in our approach. Some might argue that our current work already could be useful in practice. We make no claim that our data or procedures are the last word.

Past Research

Risks for Police Officers

Policing in the USA is widely considered a high-risk occupation (Bierie et al., 2016; Bierie, 2017; Crifasi et al., 2016; Mumford et al., 2021; Ricciardelli, 2018). As Brandl and Stroshine (2012: 268) note, police are “responsible for intervening in situations where they may not be invited and where they may be dealing with hostile citizens and suspects.” Policing differs from other occupations “in that injury and death come not just from accidents, but from job performance” (Moskos, 2009: 1). The characterization of policing as a high-risk occupation is also supported by the conclusion that seven of the ten National Institute for Occupational Safety and Health’s risk factors for workplace violence are central to police work (Crifasi et al., 2016; Fridell et al., 2009).

The numbers support most broad claims (Bierie et al., 2016; Bierie, 2017; Brandl & Stroshine, 2003; Hine et al., 2018; Sierra-Arevalo et al., 2022; White et al., 2019). For example, Maguire and colleagues (2002) found that the fatality rate of police officers is nearly three times that of the average US worker. Sierra-Arévalo et al. (2022) find that the rate of gun homicide of police is 1.6 times larger than the US rate, and for non-fatal firearm assaults, the rate is 2.7 times larger. Peek-Asa and colleagues (1997), in their study of nonfatal workplace assault injuries, report that police were 73.1 times more likely to be assaulted while at work compared to the overall average over other occupational settings. In addition, approximately 10% of police officers are assaulted each year (Bierie, 2017). According to the FBI’s Law Enforcement Officers Killed and Assaulted database, 43,649 officers were assaulted in 2021 and 15,368 of the police officers sustained injuries.Footnote 3 Lastly, Sierra-Arévalo et al. (2023) found that the murder of George Floyd in 2020 was associated with a 3-week spike in firearm assaults on police.

But such statistics do not convey a full story. Officer injury data surely are imperfect. Uchida and King (2002) highlight that some agencies may not keep precise counts of officer injuries, which, in turn, compromise the precision of national estimates. Furthermore, injuries are not always reported. For example, injuries sustained during higher status calls for service are more likely to be reported than those sustained during lower status calls for service. There is little to be gained from reporting a bruised hip caused by slipping on an icy sidewalk while ticketing a parking violation.

There are also concerns that the reality of police work can be misconstrued when operationalized primarily by killings and assaults of officers in the line of duty. Such incidents are very rare (Brandl & Stroshine, 2003; Bierie et al., 2016; Hine et al., 2018; Sierra-Arevalo and Nix, 2020), and the very low risk probabilities offer an incomplete accounting. White et al. (2019) point out that, nevertheless, the large volume of police and citizen encounters makes violence against police “a daily event” widely noted in the interpersonal networks of sworn officers. Recruiting, retention, and morale can be adversely affected (Fridell et al., 2009; Kaminski & Sorensen, 1995).

Looking more closely complicates appearances. Although being shot was the leading cause of law enforcement deaths between 2012 and 2022, deaths from COVID-19Footnote 4 and motor vehicle-related incidents, including being struck by a vehicle or being involved in a crash, have consistently been among the leading causes of line-of-duty deaths of law enforcement officers.Footnote 5 Moreover, nonfatal assaults represent the most common type of violence directed at police (Sierra-Arevalo & Nix, 2020), and accidental injuries are the most common job-related hazard (Brandl & Stroshine, 2003). Accidental injuries also can have undesirable consequences. The raw number of police officers on work-related disability can be quite large and create a drain on department staffing and budgets (Brandl & Stroshine, 2002; Bierie, 2017; Fridell et al., 2009; Kaminski & Sorensen, 1995). In short, although shootings of police garner the most media attention, there are a wide variety of other risks that are far more numerous and can have adverse consequences for police officers and the departments in which they serve.

The Importance of 911 Professionals

In the USA and elsewhere, law enforcement agencies have made efforts to improve officer safety. These efforts include investments in body armor (e.g., bulletproof vests, shields, helmets), technology (e.g., electronic control weapons and conducted energy devices), and training (e.g., tactical preparedness training), among others (Cunningham et al., 2021). One domain that has not received the same attention is dispatch. Dispatch allocates police units to calls for service and transmits information about the incident to the responding units (Gillooly, 2020). “When dispatched to a distal call, an officer’s initial understanding of the incident will be formed almost entirely by the information received from dispatch” (Taylor, 2020: 315).Footnote 6

Some scholars characterize 911 professionals as gatekeepers (Lum et al., 2020), in part because call-takers and dispatchers filter out calls that may not require a police response. Lum and colleagues (2020) observed that approximately 50% of calls were resolved without a police response. Gillooly (2020) appends the characterization of “risk appraisers” to the job description. As a practical matter, 911 professionals determine the priority level of each call and the number of units that will initially respond (Gillooly, 2020). Their summary of risk then informs how quickly a unit arrives at the scene, whether the unit should wait for back-up, and how the unit should prepare tactically, among other considerations.

Gillooly (2022: 766) also stresses that 911 professionals are not “neutral conduits” that simply transfer information to responding units. Call-takers and dispatchers interpret information provided by callers, which inherently introduces personal judgments (Gillooly, 2022; Manning, 1988). In her study of how call-takers appraise risk and classify calls, Gillooly (2022) found that different call-takers will often classify similar calls differently. Accuracy can suffer as well. Call-takers and dispatchers tend to overestimate the severity of calls, with some call centers having between 20 and 40% of all crime calls answered by call-takers downgraded by once police are at the scene (Gillooly, 2020; 2022)

And it matters. Call-takers can exert substantial influence over police perceptions of the calls to which they are responding (Gillooly, 2022: 780). For mental health and public assault calls, police officers were much more likely to classify the incidents as high-priority when the call-taker initially classified the incident as high priority. In his study examining dispatch priming and the decision to use deadly force, Taylor (2020) found that when officers were told earlier that a potential perpetrator appeared to be talking on a cell phone and that individual subsequently produced the cellphone during the encounter, 6% of officers made a shooting error. Conversely, when officers were told earlier that a potential perpetrator might be holding a gun, and subsequently produced a cellphone during the encounter, 62% of the officers made a shooting error. Yet, in a replication of Taylor’s (2020) study, Potts et al. (2022) used a realistic virtual reality scenario to test the effects of dispatch priming and found no overall effect of dispatch priming on the responder’s likelihood of firing a gun. In short, the precise nature and size of call-taker influence on police perceptions are unresolved.

In summary, 911 professionals provide consequential links connecting calls for service to police responses. Some communication errors are inevitable. Some overestimates and underestimates of risk are inevitable as well. Thus, Bierie (2017) has emphasized the potential in “risk assessment tools for police.”

We agree. To set the stage, we see this exercise as a demonstration of concept. Could dispatch data be employed to anticipate especially risky situations for police officers when they respond to 911 calls? We use the Camden County, New Jersey Police Department as a study site. Insofar as our forecasting tools work well with the data available, we hope that at other sites, better data might be collected, 911 professionals might be trained to improve risk forecasts, and the dangerous work undertaken by police officers can be made more safe.

Data Collection and Preparation

Our research team partnered with the Camden County Police Department (CCPD) to evaluate the promise of machine learning forecasts of police officer risks for improving the information extracted from 911 calls and the quality of subsequent dispatches. The CCPD is the primary provider of law enforcement services to the City of Camden, New Jersey. Camden is located in southern New Jersey directly across the Delaware River from Philadelphia, Pennsylvania, and has an estimated population of 73,562. Camden has historically had one of the highest homicide rates in the country, with 87 murders per 100,000 residents in 2012. In recent years, the homicide rate has fallen significantly, a change that is often attributed to the adoption of “community policing” in 2013.

CCPD’s primary response mechanism is its neighborhood response team mobilized units. According to CCPD’s website, these units “serve as the primary tiered responders to emergency calls for service and perform neighborhood directed patrols in alignment with the daily resource deployment plan.”Footnote 7 The department also uses a call prioritization system in which the closest unit is automatically dispatched to emergency calls. Units are dispatched through the Camden County Communications Center’s Police Central, which is responsible for the dispatching of police from 27 municipalities and has designated dispatchers for the city of Camden.Footnote 8

The CCPD provided us with data on every 911 call for service from January 1, 2015 until, December 31, 2019. During this time, the CCPD averaged 105,000 calls for service every year or about 290 calls for service every day.Footnote 9 However, any interest in forecasting the risks for police officers anticipated by these calls was complicated by the relative lack of calls from incidents that actually placed police officers in harm’s way. Armed robberies in progress, for example, are known to be dangerous for police officers, but are typically quite rare. Most calls for service do not encode such risks. Nevertheless, progress can be made, as we hope to show below.

The unit of analysis for this forecasting enterprise is the call for serviceFootnote 10 We view these calls as exchangeable. The calls are seen as generated by the social, psychological, and other factors responsible for 911 calls such that the order in which the calls are received does not affect the probability distribution of those calls. This will suffice to justify the nested conformal prediction sets we later construct to convey forecasting uncertainty, and in any case is probably a realistic assumption.Footnote 11

Our binary response variable is whether a dispatch leads to a high compared to a low risk encounter―both defined in some detail shortly. For our CCPD data, the proportion of incidents overall putting police officers in substantial danger is less than 1% because the base number of calls overall is so large (i.e., 309,490). Statistically, the several thousand calls conveying significant risk were relatively rare events. It follows that our binary response variable is very badly unbalanced.Footnote 12

For some preliminary analyses, the forecasts of risk for police officers using all the available data were very accurate and demonstrably pointless. Using no predictors at all, a forecast of low risk would be correct over 99% of the time employing information solely from the Bernoulli distribution of the response variable. Very high accuracy from the marginal distribution of a binary response variable by itself typically makes moot information available from promising predictors. Simply put, the vast majority of dispatches from 911 calls in our data posed virtually no risk to responding officers, and forecasting low risk automatically for all new calls as they were received would almost certainly be right most of the time. Nevertheless, the relatively small number of dispatches that posed a danger to responding officers remained a legitimate concern. Police officer injuries and deaths could result. And even for incidents in which the dangers were safely managed, emotional stress would likely be significantly elevated. These were for police officers low probability, high cost incidents. We were tasked with searching for the proverbial needle in a haystack.Footnote 13

It is routine practice when using the statistical tools that we describe shortly to subset the data into random disjoint subsets, usually labeled training data and testing (or test) data (Berk, 2018). The training data are used to construct a forecasting algorithm. The testing data are used to evaluate the algorithm, in an “honest” manner, uncontaminated by the how the forecasting algorithm was built and how well the forecasting algorithm performs. There were 232,118 calls in the training data and 77,372 calls in the testing data representing 75% and 25% of the data respectively.Footnote 14

Addressing the highly unbalanced response variable led to additional data splits. We further subdivided the training and testing data by broad categories for different kinds of calls for service and then examined each subset separately. We were seeking crime types with less unbalanced response variables. In effect, we sought to make the haystack smaller. For example, one such subset was calls stating or implying the use or presence of firearms. In the training data, there were 1928 such calls or about 4.5 calls per day. Of these, 278 resulted in weapons-related criminal charges. Although knowing that these calls resulted in weapons-related charges does not indicate that the specific calls were truly high-risk when they occurred, we can reasonably assume that at least a portion of these 278 calls required police officers to apprehend an armed offender at some substantial risk to themselves.Footnote 15

The resulting response variable was far less unbalanced. 86% of the calls, not 99%, were low risk. An illustrative very low risk incident had the alleged offender leaving the scene before the police arrived. In short, the response variable was still unbalanced but within a range that often allows for performance gains from available predictors.

We generated 81 predictor variables conceivably known to a dispatcher, from a call for service and other information that could be rapidly and routinely accessed. We were careful to exclude predictors that could only be known after officers arrived at the scene of the 911 call because such information could not be known when a dispatch was made. These 81 variables fell into eight categories as follows.

  • Crime type included in the dispatch

  • Date and time: We included information about the hour, day of week, month, and year in which the incident took place, which allows the algorithm to identify some important temporal patterns

  • Initiation type: Whether a community member or an officer initiated the call for service

  • Weather conditions: We included several variables related to the weather conditions on the day that the incident took place, including falling snow, rain, temperature, and the presence of fog

  • Local trends: Local patterns of activity may be important predictors of how a particular incident will unfold, so we included counts of arrests and calls for service for any reason within the past 30, 90, and 180 days in the same police sector (a sector is a spatial unit used by the CCPD roughly comparable to a neighborhood). All counts are generated one day prior to the incident in question, which means that the counts include only past events for any given call for serviceFootnote 16

  • Local timing details: We included for each police sector the number of days since the last injury, arrest, or call for service for any reason at the address because repeated calls to the same address may be a risk factor. All counts were generated one day prior to the incident in question, which means that the counts include only past events for any given call for service (see also footnote 16)

  • Census tract information: We included aggregated information on housing vacancies, employment, population density, and race for the census tract in which the incident took place, using data from the 2010 Census. Neighborhood characteristics raise important questions about fairness but may also be important predictors of call outcomes. We do not know, however, how census tracts overlap with the spatial units used by the Camden Police Department (CCPD), which makes them additionally problematic for our analyses.

The binary response variable was constructed using several indicators of risk to police officers, some taken from the dispatch information and some from officer-provided offense forms.Footnote 17 Risk included officer injuries, incidents in which the suspect(s) eluded or resisted arrest, in which the suspect(s) possessed a weapon(s), and all crimes “in progress” when the call for service was received. Incidents in which the suspect(s) eluded arrest were defined as dangerous because they often involved a chase on foot. A crime in progress was defined as dangerous because police officers likely would drive at high speeds to reach the incident location and because perpetrators were more likely to still be actively engaged in their criminal behavior. “High risk” was coded “1,” and “low risk” was coded “0.”

It was apparent that many of the predictors were highly correlated and challenging to interpret in a “held constant” framework. To illustrate, for a given police sector, what would one make of the relationship between the risk for police officers and the number of past calls for service over the preceding 180 days holding constant the number of past calls over the preceding 90 days and also the preceding 30 days? Using methods described in the next section, we were able to reduce the number of predictors to seven with no meaningful reduction in forecasting performance. The seven predictors were (1) Fridays, (2) Saturdays, (3) evening hours, (4) the month June, (5) the month of July, (6) the month of August, and (7) the number of past 911 calls from the particular police sector over the past 30 days prior to a given call. The locale was the same as the spatial origination of the call. Our variable selection was undertaken by applying stochastic gradient boosting to the training data as described immediately below and by removing predictors that were not “important” for the fit. Importance was measured by contributions to reductions in the deviance. Given the bias likely introduced by variable selection (Berk et al., 2013), all results reported below are computed as needed from the testing data, not the training data. Post model-selection bias from empirically chosen predictors is not carried forward in testing data (Berk et al., 2013).

Statistical Methods for Forecasting Risk

We begin this section with the most relevant meta-issues. In mathematical statistics and common statistical applications, a model is meant to represent literally how the data were generated (Freedman, 2009). Conventional linear regression is a popular example. Developing a model for this paper was ruled out a priori because there is now ample evidence and formal mathematics demonstrating that machine learning algorithms in the social and biomedical sciences typically will forecast at least as well as models, and usually better (Berk & Bleich, 2013; Berk, 2018). Moreover, models typically require, before the analysis begins, far greater subject-matter knowledge than algorithms require. In particular, models depend on a pre-specified structure (e.,g., linear and additive in the predictors), whereas most forecasting algorithms are non-parametric and can respond adaptively to linear as well as nonlinear associations found in the data (Berk et al., 2023). As Kearns and Roth emphasize in their book The Ethical Algorithm (2020: 4) “An algorithm is nothing more than a very precisely specified series of instructions for performing some concrete task.” Consequently, misspecification is not even defined. This can serve forecasting skill very well, but usually makes explanation a secondary consideration.

A valid data analysis requires “training data” to develop a forecasting procedure and separate “testing data” to properly evaluate its performance. The data from which forecasts subsequently are constructed must be generated in the same manner as the data used to build and assess the forecasting tool. All such observations should be realized independently and identically from the same distribution (i.e., i.i.d.). Exchangeable observations can be a proper fallback position. Note that testing data are well known to be essential in algorithmic forecasting for obtaining valid measures of forecasting uncertainty, protecting against potential cherry-picked results, and precluding overfitting (Hastie et al., 2016).

If the outcome to be forecasted is categorical, the forecasting procedure should be a classifier. We applied stochastic gradient boosting (Friedman, 2001) used as a classifier because it can be easily tuned to capture relatively rare events.Footnote 18 Stochastic gradient boosting is a form of supervised machine learning.

We applied an excellent implementation of stochastic gradient boosting, gbm, available in the R programming language. Stochastic gradient boosting is among the best performing classifiers readily available and will yield asymptotically unbiased forecasts conditional on the predictors available and the values of tuning parameters. Interaction depth was set at 5, because deep classification trees were needed to find the rare, high risk events. Minimum node size was fixed at 1 consistent with the recommendations of Wyner et al. (2015), which helps our classifier perform like an interpolator that, in turn, can formally determine how superior performance may be achieved (Liang & Recht, 2021). Because the number of predictors was small, the fit stopped improving in a meaningful way at about 50 iterations.Footnote 19

Finally, the consequences of false negatives and false positives are not the same in our forecasting application. Their differential costs were incorporated into the analysis by weighting the data. In particular, the costs of having responding police officers surprised by unexpected risk were seen by stakeholders as substantially worse than having officers over-prepared.Footnote 20 Provisionally, the costs were set at 10 to 1; being under-prepared was seen as about 10 times worse than being over-prepared. With a forecast of high risk labeled a “positive,” and low risk labeled a “negative,” the weighting meant that the boosting algorithm would work substantially harder to avoid false negatives than to avoid false positives. This was precisely the intent.Footnote 21 A cost ratio is a subject-matter decision, not a tuning parameter. Different cost ratios might be appropriate depending on the settings and stakeholders.Footnote 22

Results

Stochastic gradient boosting was applied to each dataset for each broad crime category whose response variable imbalance was not an insurmountable obstacle. In the interest of space, we focus on weapons-related dispatches and consider other kinds of dispatches only in passing.

Table 1 is a standard confusion table constructed using the testing data. It is a cross-tabulation of the outcome class labels in the testing data and the outcome classes determined by trained, risk algorithm applied to the testing data. One can see that the target cost ratio of 10 to 1 is very well approximated (\(524/53 = 9.88\)) despite being a product of the testing data.Footnote 23

Table 1 Confusion table for weapons dispatches (labeled outcome classes high risk or low risk refer to outcome classes in the data; predicted outcome classes high risk or low risk are outcome classes determined by the algorithm)

An apparent problem is that there are nearly 5 times more false positives than true positives (i.e., \(524/109 = 4.81\)). However, the large number of false positives (i.e., 524) follows directly from the 10 to 1 cost ratio. When a classifier is working especially hard to avoid false negatives, the mathematical tradeoff encourages false positives, which are far less costly. Should stakeholders choose to make the cost ratio more balanced, there would be fewer false positives and more false negatives. This tradeoff depends on a policy choice and can easily be changed.

The cost ratio leads as well to rather different misclassification rates for classes labeled in the data as low risk or high risk calls. Sixty-four percent of the calls labeled low risk in the data are misclassified as high risk calls. Thirty-three percent of the calls labeled as high risk in the data are misclassified for low risk calls. The good news is that about 2/3rds of the calls labeled high risk were correctly classified. More good news, also a product of the 10 to 1 cost ratio, is that although the confusion table forecasting error for high risk class is 82% (i.e., 525/633), the confusion table forecasting error for low risk dispatches is only 16% (i.e., 53/339).Footnote 24 With the algorithm working so hard to avoid false negatives, it should be no surprise that there are only 53 of them. Forecasts of low risk from the confusion table can from this analysis be quite credible.

Note that the term “forecasting error” is with respect to outcomes classes as labeled in the data on hand, not to outcome classes that are at this point unknown. Those unknown outcome classes to be forecasted are addressed with conformal prediction sets provided shortly.

Also of interest is which predictors are driving the risk algorithm’s fitted values. This information can have important policy implications. For example, even very good forecasting results can be questioned if the most important predictors strongly contradict existing research and/or widely accepted subject-matter assumptions.Footnote 25

Figure 1 shows the relative contribution of each predictor to the classifier’s fit of the training data (i.e., the contributions sum to 100%). About 80% of the deviance reduction can be attributed to the number of 911 calls from particular locations in the past 30 days, even though most are not repeat calls. This replicates the well known finding that crime and calls for service are spatially concentrated. All of the other predictors matter as well, but far less. For example, there are greater risks to police officers on weekends, evenings and during the summer months. None of these is a surprise.

Fig. 1
figure 1

Weapons dispatches: relative variable importance for the boosting fit

However, all of the predictors are empirically related such that, for example, a substantial fraction of the 80% relative deviance reduction attributed to the number of past 911 calls is actually shared with other predictors because of statistical interaction effects. To provide more details about these interaction effects can be somewhat involved (Molnar, 2022) and beyond the scope of this paper. We would proceed with such a discussion were we building a causal model.Footnote 26

Further information can be extracted from partial dependence plots (Friedman, 2001). These show the relationship between a given predictor and the response, holding all other predictors constant in a novel manner.Footnote 27 For example, Fig. 2 displays how the number of incidents leading to calls for service in the preceding 30 days from a given police sector is related to the probability of a high risk dispatch. The rug plot at the bottom indicates that there are relatively few calls from police sectors with more than 1200 calls for service. The plot to the right of 1200 should probably be ignored. It rests on very sparse data, and the smoothed values in black are distorted downward. For the remaining cases, one essentially has a step function, shown in red.Footnote 28

Fig. 2
figure 2

Weapons dispatches: smoothed dependence plot for the number incidents in the past 30 days in particular police reporting districts

The relationship in Fig. 2 is highly nonlinear. A credible conclusion is that the probability of a high risk dispatch increases sharply from around .20 to about .60 (with a maximum of more than .80) as the number of prior incidents increases from about 100 to about 800 (with considerable local variation) and then levels off. The slight decline in risk from about 800 calls to about 1200 calls cannot be distinguished from noise. Within the range of calls with sufficient data, a greater density of 911 calls is associated in a complex fashion with a higher risk for police officers, all other included predictors held constant.Footnote 29

One must be careful about attributing a causal effect to the past number of calls for service. It is unlikely that the number of past calls for service in a given locale directly increases risk. More likely, the number of past calls for service is an indicator of the causal social factors responsible for high crime rates, such as social disorganization (Sampson & Groves, 1989). These, in turn, affect the risks experienced by police officers. Also, if our boosting formulation were (inappropriately) treated as a causal model, it would no doubt be badly misspecified. That is acceptable when forecasting accuracy is the priority. Recall that the forecasts are asymptotically unbiased given the tuning and predictors included. But that does not preclude greater accuracy with a better set of predictors.

Prediction Accuracy Through Nested Conformal Prediction Sets

One of the major problems with interpretations of confusion tables is that the reliability of the predictions is not considered. The outcome class predicted by the classifier is the outcome class with the largest estimated probability. For example, if for a given case the estimated probability of high risk is .80, and the estimated probability of low risk is .20, most would consider the forecast of high risk quite reliable. But if the two probabilities are .53 and .47 respectively (or even closer), the forecast is being made by a procedure that is just a little better than a 50-50 coin flip. The reliability is low and likely provides poor guidance for 911 professionals to pass along to the responding officers.

When the two outcome probabilities are near one another, a proper policy conclusion is that the algorithm is unable to make a definitive decision about the most likely outcome class. These and other difficulties with confusion tables are discussed in Kuchibhotla and Berk (2023). We need to do better forecasting police officer risk.

Conformal prediction sets provide a better, principled solution (Angelopoulos & Bates, 2022). They have some of the look and feel of confidence intervals. But whereas a confidence interval is an estimated region in which population parameter estimates (e.g., an estimate of a population proportion) will fall with a certain probability, conformal prediction sets estimate the future outcome class (or classes) that will be right with a certain probability. The former conveys the “certainty” associated with, say, the estimated proportion of high risk incidents over many past dispatches to armed robberies in progress. The latter conveys the “certainty” associated with forecasts of high risk for many new dispatches in response to new armed robberies in progress.Footnote 30

Consider now the two outcome classes we have been using: high risk and low risk. With conformal prediction sets and two outcome classes coded “1” for high risk and “0” for low risk, there are four prediction sets logically possible for each case needing a forecast: {1}, {0}, {1, 0}, {∅}. The first set is a forecast of high risk, the second set is a forecast of low risk, and the third set is a “can’t tell” result because the classifier is unable to make a reliable distinction between high risk and low risk. The fourth is an empty set indicating that no forecast at all is made because the given case likely is an outlier that is very different from the data on which the classifier was trained.

We held out 100 cases at random from calibration data and treated them as if we did not know the outcome class, just as if each was a dispatch for which a forecast was needed. We constructed a nested conformal prediction set for each dispatch based on the pseudocode provided by Kuchibhotla and Berk (2023).Footnote 31 The coverage probability was fixed in advance at .75.

Of those 100 randomly selected holdout cases, the prediction set included only the high risk outcome class for 26% of the cases, meaning that for these cases, high risk was the forecast. The prediction set included only the low risk outcome class for 24% of the cases, meaning that for these cases, low risk was the forecast. Each of these two prediction sets contain the true future outcome with a probability of at least .75. In practice, this means that about 75 out of 100 such conformal forecasts will be correct.

For 50% of the 100 cases, both outcome classes were included in the prediction set, meaning that the classifier could not make reliable distinction between the two. The 50% figure is an "honest" reliability evaluation for the forecasting exercise undertaken, but is otherwise unsatisfactory. For real dispatches, having half of the risk assessments too unreliable to provide dispatch guidance is a policy disappointment. The large faction of “can’t tells” likely results substantially from the large imbalance in the binary outcome coupled with the need for better predictors. The implications will be further addressed shortly.

There were no empty prediction sets, which is a necessary result from employing nested conformal scores (Kuchibhotla & Berk, 2023). Each nested conformal prediction set will always include at least one outcome class. Consequently, the issue of outlier cases requiring forecasts does not arise as it can for other conformal prediction set methods.

There is nothing special about the coverage probability of .75. It represents an odds of 3 to 1 (.75/.25); the odds are at least 3 to 1 that the true outcome class is included in each prediction set. If one is prepared to accept a lower coverage probability (e.g., .70), the fraction of “can’t tell” prediction sets could be reduced. If one prefers a higher coverage probability (e.g., .80) the “can’t tell” fraction could increase. This is a call to be made by stakeholders, although work in progress may make the choice moot. In any case, there is no formal rationale for automatically employing a common default probability such as .95 or .99.

Results for Other Crime Categories

We applied stochastic gradient boosting to four other broad crime categories. The results for assaults and the results for robberies were qualitatively very similar to the results for weapons offenses. For domestic violence and general disturbances, the number of high risk dispatches was roughly the same as for weapons offenses. However the number of calls for service was about five times larger. This made the lack of balance far more dramatic, and we could not produce any useful results.Footnote 32

Thinking about steps toward implementation, careful consideration must be given to how different sets of crime categories should be determined. That will depend on local stakeholder views and on the local proportions of dispatches that are high risk for police officers. In addition, a lot will rest on making improvements in forecasting accuracy.

The broad crime categories chosen have more than statistical consequences. One can imagine a 911 professional having access to several sets of forecasting algorithms, each for a different broad crime category. With the crime category chosen, the relevant forecasting algorithm could be selected by dispatchers (and others) in an informed manner. A forecast could then be produced in a few seconds.

Discussion

Implementation

Using information that is routinely available, we have shown that interpretable and technically defensible forecasts of risk can be produced for police officers responding to a dispatch. But we make no claim that our procedures are ready for implementation. To begin, in most applications, “can’t tell” prediction sets can occur at least once in a while. They indicate that for a given case when a forecast is sought, the classifier cannot make a reliable distinction between different outcome classes. Stakeholders might decide to default such prediction sets to either a high risk or a low risk outcome, depending on whether false alarms are more or less costly than the absence of an alarm when needed. The choice will probably vary across different police departments. It may also be possible to introduce outside information to help inform an otherwise ambiguous forecast. For example, the caller’s tone of voice may convey imminent danger. In short, careful thought must still be given to possible responses to equivocal forecasts.

Either for accuracy or policy reasons, the outcome risk variable might be usefully defined in other ways, depending substantially on data availability and quality. For example, risk might be more narrowly defined as an actual injury or fatality if either are, unfortunately, sufficiently common. In other jurisdictions, injuries and fatalities might be so rare that they can perhaps be ignored in the risk definition. In addition, the risk outcomes do not have to be binary. In principle, one might construct a scale of risk with fatalities having the high score, threats of violence having the lowest score, and non-fatal injuries in the middle. None of these options were a practical choice for the data from the CCPD.

There are likely to be useful predictors that were not available from the CCPD dispatches but perhaps available elsewhere. For example, it might be very instructive if the alleged perpetrator’s gender, approximate age, and relationship to the caller/victim were elicited. Young males can be especially dangerous (Berk et al., 2009). Were drugs or alcohol involved? Romantic couples in the process of separating can place officers in situations that are emotionally charged (Berk et al., 2016). Articulated threats of violence might be predictive as well. And, there also might be useful predictors that vary by locale such as whether gang hostilities are involved. In short, a richer set of predictors is needed and surely considerable progress can be made.Footnote 33

We appreciate that it can be difficult to obtain fruitful information from some calls for service, and that key information must be obtained quickly. But, we believe that in many police departments it is possible to improve on current practice. For example, when forms are filled out as each call is taken (either by hand or by data entry), those forms might be improved by having boxes to be filled in for the alleged perpetrator’s age and gender. Relying solely on an incident narrative that might include such information will typically produce less complete results. Good advice for designing effective data entry forms is readily available (Wiggins et al., 2011).

Finally, there are a host of operational details to be worked out site by site.

  1. 1.

    A feasibility assessment must be mounted.

  2. 2.

    If the estimated feasibility conveys promise, support (or at least not opposition) from the relevant stakeholders must be obtained.

  3. 3.

    Resources must be provided for staff time, IT support, dedicated computer hardware and software, and perhaps outside consultants. The computation costs themselves should be modest. On a modern laptop, the elapsed time to train a risk forecasting algorithm should be less than a minute, and subsequently, forecasts can be produced almost instantaneously.

  4. 4.

    Methods must be provided to transmit relevant information from each call for service to a risk algorithm along with existing information such as the number of past calls in the recent past from the same neighborhood.

  5. 5.

    The forecasting algorithm must be made operational, which should be done with foresight and technical skill in a site-specific manner. Note that there is no need to re-train an algorithm with each 911 call. All that is required is a module from the trained algorithm that does the forecasting and some code to input predictor values and output the forecasts. This form of AI should not be confused with current Large Language Models that power ChatGPT and related software. Currently, these algorithms are so expensive to train and use that even deep-pocket firms like Google and Microsoft are having serious difficulties finding a sustainable business model for them (Oremus, 2023).Footnote 34

  6. 6.

    Some retraining of dispatchers will likely be needed.

  7. 7.

    Ample time should be provided to thoroughly test each step in the implementation.

These requirements may seem daunting, but similar challenges have been successfully overcome in other criminal justice settings (Berk, 2017).

There is also an important methodological message going forward. The use of conformal prediction sets underscores that the reliability of forecasted response variable classes from a confusion table can be very low. For our analysis of police officer risks, the large fraction of “can’t tell” prediction sets means that about half of the forecasted outcome classes from a classifier’s confusion table could be properly treated as unreliable and difficult to justify for real dispatches. Part of the reason is the very limited dataset given to the classifier we used. But a lot also depends on how the conformal enterprise is tuned. Kuchibhotla and Berk (2023) show that even strong performance by a classifier will produce a large proportion of legitimate “can’t tell” results when a very high coverage probability is required. In short, relying solely on a confusion table to forecast outcome classes can be ill advised, or at least insufficient, and conformal prediction tuning should not be perfunctory.

The Role of Race

Because of the recent concerns about artificial intelligence and the particular claim of racial bias (Berk et al., 2023), use of almost any risk algorithm by a police department could become very controversial. Yet, our proposed algorithm does not target individuals based on their race. Neither the race of the caller nor the race of the alleged perpetrator are predictors, and we are not proposing that they should be.

The role of neighborhoods is more subtle because of residential segregation based on income, nationality, and race. There can then be, in particular, concerns about “over-policing” (Boehme et al., 2022), sometimes characterized as racial profiling in certain areas (Grogger & Ridgeway, 2006). Within such scenarios, police choose to concentrate their assets in disadvantaged neighborhoods as a top-down process, even if well-intentioned.

People in disadvantaged communities are inordinately victims of crime, and we capitalize on a bottom-up process initiated by concerned neighborhood residents themselves. To some observers, this may look like over-policing. But it is a response to requests for help that arguably are justified by a Hobbesian-like social contract. One consequence of that contract is that police who respond can be put in harm’s way. Better informed dispatches may help to reduce that risk.

It also is important to appreciate that our algorithm only provides a forecast of risk. What responding police officers do with that information can be consequential but should be determined by department policy and police officer training. The risk algorithm is not responsible for either. One might add to the implementation steps listed above that any police department adopting a risk algorithm for police officers should integrate and train for the best police practices that depend on the amount of risk. Also, for some calls, the best response may be deferring to responders who are not police officers. In Brooklyn, New York, for example, the police have from time to time stepped aside and let community members of the Brownsville Safety Alliance respond to 911 calls that do not involve a violent crime in progress (Cramer, 2023). These low risk incidents can account for a large fraction of calls for service. Informal assessments look promising.

Some may wonder why a similar effort could not be made for predicting when citizens are at risk from unnecessary force and/or racially motivated practices by police officers. There are a large number of studies about these matters, such as the path breaking work by Grogger and Ridgeway (2006) and by Ridgeway and MacDonald (2014), but none to our knowledge have been framed explicitly as a forecasting problem. Moreover, there are operational challenges. For example, it is very difficult except in extreme cases to determine when a police apprehension is racially tainted. It is also hard to tell, again, except in extreme cases, when a use of force is unnecessary. There would likely be resistance from police officer organizations as well.

Conclusions

Police officer safety has long been a policy concern. If high risk incidents could be better anticipated, there is the possibility of introducing more effective and more timely safety measures. We have shown that improved forecasting of the risks when responding to calls for service is possible, although our work is a somewhat circumscribed first attempt. The requisite statistical tools are readily available and can be implemented and maintained in police departments with sufficient IT expertise. IT expertise can also be purchased. Perhaps the major obstacle is appropriate, readily available data, in part because calls for service are a very small window of data collection opportunity. Still, the skills of call-takers and dispatchers can be improved along with the ways 911 call data are organized and stored. And some very simple improvements have the promise to move this demonstration of concept toward at least a provisional implemetation.