Introduction

Violence is a common and serious public health problem. In 2019, there were an estimated 5.4 million violent victimization experiences among U.S. residents 12 or older (Morgan and Jennifer 2020). In order to track incidence, accurate surveillance of violent events is critical. Administrative data for violence identification typically occur in two systems: first, systems where the primary purpose is violence identification, and, second, systems where the primary purpose is not violence identification. An example of the former is child protective services (CPS) data, where the primary function is to identify children who are victims of violence and related victimization (e.g., neglect). These and other types of administrative systems designed to capture violence cases, such as police data, are most commonly used for violence surveillance. However, there has been concern about bias in these data, because they only capture a fraction of cases, due to under-reporting (Rochelle and Buonanno 2018; Gray et al. 2017; Lipsky et al. 2012; Feldman et al. 2017). Further, the cases that are captured may skew toward more highly scrutinized communities (e.g., those with interaction with mandated reporters through public benefits programs) (Maguire-Jack et al. 2018).

Because of concerns with administrative data designed for violence identification, there has been increasing attention to the utility of administrative data where the primary function is not violence identification. An example of this second type of administrative data is hospital claims databases, where the primary function is hospital billing. These databases are formed using medical records. The medical records are used by health information professionals to code a patient's visit for billing purposes, using three main types of codes: International Classification of Disease Clinical Modification, 9th Revision (ICD-9 CM codes or hereafter ICD-9 codes), external cause (E-codes), and supplementary classification factors influencing health status codes); since October 2015, billing coding is based on the ICD-10-CM. ICD-9 CM codes are primarily used to assess cost and reimbursement, but they are widely available and therefore are also useful in surveillance and research on disease morbidity and mortality (Thomas et al. 2002), and may be similarly used to study violence (Ahern et al. 2018).

A common way to identify violence in hospital billing data is through ICD codes that explicitly diagnose an injury as caused by violence (e.g., ICD-9 CM: 995.81 for physical abuse) (Scott et al. 2009). However, these so-called “explicit” codes are underutilized (Muldoon et al. 2019) and may be biased in similar ways to police or child protection data. For example, for an ICD violence code to be assigned, either the patient must reveal that the injury that brought them into the hospital was due to violence, or the provider must make a subjective assessment that the intent of the injury was violence since the coding system does not allow for suspected abuse or neglect to be documented. There are several reasons why patients may not disclose violence, including presence of the perpetrator in the hospital room with the patient/victim (Lipsky et al. 2009; Hymel et al. 2018; Lane et al. 2002). Other sources of bias as it relates to violence codes could be related to the patients socio-economic status (Keenan et al. 2017), resource barriers (Beynon et al. 2012), and urban versus rural setting (Edwards 2015). More specifically, these codes could be overutilized for patients of color or low socioeconomic status individuals and underutilized for rural communities where violence tends to be underreported. This then would bias associations when examining how these factors are associated with violence. Thus, cases identified through these explicit codes may not be representative of the true distribution of violence-related injuries (Scott et al. 2009; Schnitzer et al. 2011; Lloyd and Rissing 1985; McKenzie et al. 2011; Hooft et al. 2015).

A second option for assessing violence in hospital data is through “proxy” codes for injuries. Proxy codes are common outcomes of violence (Schnitzer et al. 2011; Bhargava et al. 2011). Using these “proxy” codes for violence identification may yield a more representative distribution of violence and less prone to bias because it does not rely on patient reports or subjective assessments by providers. Past research has examined the correlation of these proxy codes with “gold standard” violence identification methods. For example, in child maltreatment, one study (Schnitzer et al. 2011) used medical record review by project staff trained in child maltreatment, as well as consultation by an advisory board, to identify maltreatment cases. An ICD code was defined as “suggestive of child maltreatment” when greater than 66% of the visits with that code were thought to be caused by child maltreatment (Schnitzer et al. 2011). An example of these newly identified proxy codes included retinal hemorrhage. Intimate partner violence (IPV) proxy codes have been identified through a confluence of studies linking head, neck and face injuries to IPV (Bhargava et al. 2011; Perciaccante et al. 1999, 2010; Schafer et al. 2008; Davidov et al. 2015; Wu et al. 2010; Petridou et al. 2002; Bhandari et al. 2006; Halpern et al. 2009; Sheridan and Nash 2007). Specifically, one study used predictive models to examine which injuries correctly predicted IPV cases confirmed through telephone or clinical diagnoses (Bhargava et al. 2011). Another study found that head, neck and face injuries had a 91% sensitivity and 59% specificity for violence against women (Perciaccante et al. 1999). It is important to note that, high specificity means low type 1 error (low likelihood of finding false positive) but low specificity does introduce potentially type 2 error (false negative). Lastly, elder abuse proxy codes have been identified through common diagnoses among individuals who reported to Adult Protective Services (APS) (Wiglesworth et al. 2009). These methodologies (and others, Lachs et al. 1997; Gironda et al. 2016) provide an evidence base for use of proxy ICD codes to identify violent injury.

These two options of using explicit vs proxy ICD hospital codes for violence identification have trade-offs. For instance, explicit codes are very likely to reflect true violence, but are potentially biased. In contrast, proxy codes are likely to include injuries not due to violence, but this misclassification is more likely to be non-differential and may result in less systematic bias. While proxy codes are not currently used for violence surveillance, they could potentially address the impacts of systematic underreporting, improving case ascertainment, and thus reveal the hidden burden of violence. To move toward the use of proxy codes for surveillance, more needs to be understood about how proxy codes compare and contrast to explicit codes with regard to their temporal and geographic distribution. However, there is no research, to our knowledge, that compares proxy codes to explicit codes in terms of trends over time, and association with county-level predictors of violence such as poverty, racial/ethnic demographics, urbanicity, employment, and education level. The contribution of such a comparison adds important information on potentially differential patterns especially in specific subgroups that are over-scrutinized and those where violence tend to be culturally silent (e.g., rural, and elder abuse).

The goals of this paper are therefore to compare the trends in violence in Minnesota by county from 2004 to 2014, and associations of county-level demographic characteristics with violence rates as measured through explicit, proxy and a combination of explicit and proxy codes using Minnesota Hospital Discharge Data. Three violence subtypes (child maltreatment, elder abuse, and intimate partner abuse) are examined to represent three important types of violence over the lifetime.

Methods

Data

Minnesota hospital discharge data

Population representative hospital administrative data from 2004 to 2014 were obtained through the Minnesota Hospital Association (MHA). Minnesota hospitals (n = 246) are required to submit all inpatient, outpatient, and emergency department claims data to MHA, which compiles these data in a statewide administrative claims database. This database contains a data point for each patient encounter with a health care provider, including diagnoses (ICD codes). Individuals could appear in the database multiple times.

ICD-9 CM codes are used to describe the diagnosis of the condition being treated. For injuries, they describe the nature of the injury (e.g. facture, cut, etc.) and body part (skull, arm, etc.). ICD-9 CM codes are the main codes included in administrative datasets because they are required for billing and reimbursement. External cause codes (E-codes) are optional additional descriptors to ICD-9 CM codes that describe when and where the injury happened, to whom or by whom, how, and intentionality (Injury Data and Resources 2015). V-codes are supplementary classification of factors influencing health status. Cross-sectional (not longitudinally-linked) MHA data on ICD-9 CM, E-codes, and V-codes from 2004 to 2014 were used to measure cases of violence for this study. These data become the numerator for the violence rates.

Population data

Population denominators to calculate county-level violence-related injury rates were obtained from the Surveillance, Epidemiology, and End Results (SEER) Program (National Cancer Institute 2021) which provides annual population level estimates by county, sex, and age for each year from 2004 to 2014. The following denominators were used for each violence subtype: age 0 to 18 for child abuse, 65 plus for elder abuse, and 16 plus for intimate partner violence.

Sociodemographic data

The 2010 Decennial Census (US Census Bureau 2021a) and the 2010 American Community Survey (ACS) (US Census Bureau 2021b) were used to provide county-level sociodemographic predictors (correlated with violence in traditional systems) of county-level violence (United States Census Bureau 2020) including: percent poverty, percent minority, urban, percent unemployed and percent less than high school education. The continuous variables are dichotomized at the mean.

Case ascertainment

Violence-related injuries were identified using both explicit and proxy methods. To avoid duplication, using a unique encounter-specific identifier, encounters with both an explicit and a proxy code were identified in the explicit count. Three subtypes of violence were analyzed: child maltreatment (people ages 0–17 years), elder abuse (people ages 65+ years) and intimate partner violence (people ages 16+ years). Appropriate population denominators were applied to create incidence rates. These three subtypes of violence were chosen based on being representative of different forms of violence throughout one’s lifespan and availability of research identifying proxy codes, as described below.

Explicit operationalization of violence

ICD-9 CM codes, E-codes, and V-codes that indicate a diagnosis of violence (explicit codes) are listed in Table 1 along with corresponding average yearly counts.

Table 1 Explicit and proxy ICD-9, E- and V-codes used to define violence in minnesota hospital discharge data

Proxy operationalization of violence

ICD-9, E-Codes, and V-codes indicating injury suggestive of violence (proxy codes) are listed in Table 1 along with corresponding average yearly counts. The proxy operationalizations here were based on a review of literature. The final selection of proxy codes were based on studies identifying these codes through confirmed violence cases via in-depth medical record review (Schnitzer et al. 2011; Gironda et al. 2016; Btoush et al. 2009; Barlow et al. 1998), predictive modeling (Bhargava et al. 2011; Perciaccante et al. 1999, 2010; Reis et al. 2009), common diagnoses of known violent encounters (Schafer et al. 2008; Davidov et al. 2015; Wu et al. 2010; Petridou et al. 2002; Bhandari et al. 2006; Halpern et al. 2009; Sheridan and Nash 2007; Wiglesworth et al. 2009; Nannini et al. 2008; Rosen et al. 2016) and linking hospital records with known cases of violence through administrative data systems (Schnitzer et al. 2011; Lachs et al. 1997) such as Child Protection Services (CPS) or Elder Protection Services. Codes from these studies were only selected if there is consistency across literature for certain types of injuries and/or most likely to be violence, as indicated by the study. Motor vehicle crashes was excluded from IPV to minimize non-violence-related injuries (Sheridan and Nash 2007). Significantly less literature was available on elder abuse. Therefore, to increase certainty in elder abuse proxy codes, injuries identified as being “unintentional intent” e.g., (E928.9) were excluded from elder abuse. The resulting elder abuse proxy codes were therefore restricted to either undetermined or intentional injuries. Further descriptions of the proxy codes are in Additional file 1: Appendix Table 1.

Combined operationalization of violence

The counts for the number of explicit and proxy operationalization of violence codes were summed to create a third operationalization of violence. The combined outcome was meant to serve as an intermediary approach for identifying violence.

Analysis

The distribution and time trends of explicit and proxy child abuse, IPV, and elder abuse by all 87 counties in Minnesota from 2004 to 2014 were examined. For incidence rates, the yearly sum of cases in a county, defined by the given set of codes, served as the numerator and yearly county population data served as the denominator.

To estimate associations between county-level risk factors (e.g., poverty) and county violence rates, negative binomial regression models with generalized estimation equations were run to estimate incidence rate ratios with 95% CIs, accounting for within-county clustering over time. Two separate models were run for each outcome, crude and adjusted. First, crude models with the yearly count totals of each outcome were regressed separately on each individual socio-demographic variable and on year, with the yearly county-level population denominator for the offset (rate denominator). Second, fully-adjusted models that included all the county-level socio-demographic variables and year were estimated. Finally, as a sensitivity analysis, the fully adjusted models were run on a subset of codes within each violence subtypes (e.g., any burn code for IPV) to examine if certain codes were driving these associations (Additional file 1: Appendix Table 2). There was no null hypothesis significance testing conducted and results instead focus on estimation (Lash 2017). This study was deemed not human research by the University of Minnesota Institutional Review Board.

Table 2 Mean violence incidence rate ratios and percent county population characteristics for Minnesota 2004–2014

Results

Table 2 describes the average rates of explicit- and proxy-identified violence subtypes and socio-demographic characteristics across counties in the sample. Rates estimated using explicit and proxy codes were substantially different, especially for elder abuse (2 per 1000 for explicit vs. 106 per 1000 for proxy) and intimate partner violence (5 per 1000 for explicit vs. 294 per 1000 for proxy).

Table 3 describes the crude bivariate associations of year and each county socio-demographic factor with explicit-, proxy-, and combined-identified violence rates. Generally, there is a stronger magnitude of association with county level factors for explicit codes compared to proxy coded.

Table 3 Crude bivariate negative binomal regression with GEE: incidence rate ratio for the association between each county level socio-demographic characteristics and injury codes

The fully adjusted models are in Table 4. Using explicit codes, the rate of elder abuse appears to slightly increase from 2004 to 2014 (IRRexplicit per year: 1.03; 95% CI 1.01–1.06). The time trend for child abuse and IPV are both flat or slightly decreasing (child maltreatment IRRexplicit per year: 0.98; 95% CI 0.97–1.00, and IPV IRRexplicit per year: 0.98; 95% CI 0.96–1.01, respectively). In contrast, using proxy codes, there appears to be a substantial upward trend in elder abuse rates (IRRproxy per year:1.12; 95% CI 1.11–1.13) from 2004 to 2014. Child abuse and IPV measured using proxy codes are slightly increasing over time (child abuse IRRproxy per year :1.03; 95% CI 1.02–1.05, and IPV IRRproxy per year: 1.04; 95% CI 1.03–1.04). The combined explicit and proxy codes for child abuse, elder abuse and intimate partner violence mimic the proxy codes in magnitude and pattern.

Table 4 Fully adjustedb negative binomal regression with GEE: incidence rate ratio for the association between all county level socio-demographic characteristics and injury codes

In the fully adjusted models, explicit codes for child abuse indicate that counties with greater than or equal to the mean (11.3%) of people living in poverty have 1.36 (95% CI 1.09–1.68) times the rate of child abuse compared to counties that had less than 11.3% of people living in poverty. In general, the association of child abuse with county-level measures of poverty, people of color, unemployment, and education all decrease in magnitude when comparing explicit to proxy codes, although confidence intervals are often overlapping. For example, using the proxy measure of child abuse, counties with greater than or equal to 11.3% of people living in poverty have 1.12 (95% CI 0.88–1.43) times the rate of child abuse compared to counties with less than 11.3% of people living in poverty. Using combined explicit and proxy codes, counties with more than the average poverty have 1.27 (95% CI 1.03–1.55) times of the rate of child abuse as counties with less than the average poverty. The combined codes tend to be in the middle between proxy and explicit codes.

Using explicit elder abuse codes, counties with greater than or equal to the mean (5.8%) of people unemployed have a 58% increased rate of elder abuse compared to counties that had less than 5.8% unemployed. The associations between elder abuse and county-level socio-demographic characteristics also are lower in magnitude when using proxy rather than explicit codes, except for county-level education. For example, using proxy-identified and combined-identified elder abuse, counties with higher unemployment had about a 6% increased rate of elder abuse compared with counties that had lower unemployment (IRRproxy 1.06, 95% CI 0.90–1.25; IRRcombined 1.06, 95% CI 0.91–1.25), respectively.

As with child maltreatment and elder abuse, generally the associations between IPV and county level socio-demographic characteristics were closer to the null, although two associations flipped directionality, when using proxy and combined codes compared with explicit codes. For example, the counties with higher percent unemployment, and those with more people of color, were found to have a greater relative rates of IPV using the explicit versus proxy and combined codes (IRRexplicit 1.73, 95% CI 1.30–2.30 vs. IRRproxy 1.13, 95% CI 1.00–1.28 vs. IRRcombined 1.14, 95% CI 1.01–1.29) and (IRRexplicit 1.30, 95% CI 0.93–1.84 vs. IRRproxy 1.16, 95% CI 1.02–1.32 versus IRRcombined 1.16, 95% CI 1.03–1.32), respectively. Associations between IPV and poverty, education and urbanicity were close to the null in regardless of explicit, proxy or combined codes.

Lastly, the sensitivity analysis showed that, in the fully adjusted models, the malnutrition subset codes drove most of the associations with elder abuse and each county level sociodemographic characteristics. The associations between county level sociodemographic characteristics with child abuse and IPV subsets codes were less clear on which individual code drove these associations.

Discussion

The goal of this paper is to determine how the addition of proxy codes in relation and in combination to a more traditional approach (explicit codes) describe violence using population-based data for one state. Our main findings are that the magnitude of violence rates, and patterns of violence across time and by county-level violence differed depending on whether one used explicit or proxy codes. In particular, explicit codes suggested that child abuse and IPV trends were flat or decreased slightly from 2004 to 2014, while proxy codes suggested the opposite. Elder abuse increased during this timeframe for both explicit and proxy codes, but more dramatically when using proxy codes. In regard to the associations between county level characteristics and each violence subtype, previously identified county-level risk factors were more strongly related to explicitly-identified violence than to proxy-identified violence. Given the larger number of proxy-identified than explicit-identified violence cases, the trends and associations of combined codes align more closely with proxy codes, especially with elder abuse and IPV.

The finding of increasing violence over time using proxy codes contrast with evidence of declining trends of child abuse (Finkelhor et al. 2014), elder abuse (Morgan and Mason 2014) and IPV (Catalano 2013) from data sources such as the Uniform Crime Reports (UCR), Child Protection (Finkelhor et al. 2013a), and the National Crime Victimization Survey (NCVS) (Powers and Kaukinen 2012; Morgan and Kena 2017). There are several possible reasons for differences in findings between this study and these other data sources. Traditional surveillance systems for violence may have systemic selection bias. For example, UCR relies on police data that may be likely to over-report crime in communities of color and under-report in white communities (Myers 1980; Mesic et al. 2018; Voigt et al. 2017). Further, UCR excludes sexual assault, and crimes not reported to the police (Planty et al. 2014). Underreporting is also a problem in Child Protection data (Wildeman et al. 2014). For example, in 2011, the National Child Abuse and Neglect Data System (NCANDS) reported approximately three million U.S. children who received an investigation or response from a state child protection service agency (Maltreatment 2020) but in the same year, according to the National Survey of Children’s Exposure to Violence reported, approximately 10 million U.S. youth had experienced maltreatment by their caregiver (Finkelhor et al. 2013b). Therefore, different types of selection and usage of these different surveillance systems (such as health care utilization) could be a reason behind different trends because each system is measuring different populations or is measuring violence differently.

Generally, the associations between violence and county socio-demographic compositional factors are smaller for proxy codes than for explicit codes. For example, after adjustment for all other county-level demographic characteristics, explicit codes indicated that violent injuries for elder abuse are highest in counties that had population percentages at or above the Minnesota mean percent of people of color. When violence is measured with proxy codes or combined codes, these associations are still elevated but move toward the null. One possible explanation of this could be that proxy codes could be less systemically racially biased, or conversely, explicit codes are more greatly influenced by racial bias. Explicit codes require a subjective judgment by medical providers, which leaves them vulnerable to individuals’ implicit biases. Proxy codes do not require this judgment and may be less affected by this bias. On the other hand, proxy codes trade greater potential representativeness for lower specificity, potentially leading to non-differential misclassification, which could also move effect estimates closer to the null. These different associations by proxy versus explicit coding suggest that sole reliance on explicit coding of violence for surveillance and research may be insufficient and proxy codes may potentially help to address under- and biased reporting (Shepherd and Sivarajasingam 2005), yet research is required to understand the potential misclassification in proxy codes. Since ICD codes are a universal coding system in the U.S., further testing and application should be done to assess proxy codes’ validity for violence identification.

The differences between predictors and trends over time for proxy and explicit codes is unknown. The differences, in part, are likely due to a trade off in specificity and sensitivity of the codes. For instance, explicit codes likely have high specificity (i.e. those with violence codes are very likely to reflect true violence), but may suffer from low sensitivity (i.e. many/most cases of violence are not coded as violence, and thus miss many cases of violence). If this misclassification is non-differential (Aschengrau and Seage 2021) with respect to predictor variables, then it may attenuate associations. However, if, as suspected, this misclassification is differential due to greater suspicion of violence in certain populations associated with our hardship measures, then associations may be biased away from the null. Proxy codes could see a decrease in specificity but an increase in sensitivity and the final estimate may still be an underestimate of the actual effect but perhaps closer in magnitude to the actual effect. These codes are also misclassified, but perhaps in a less systematic way (less biased). The use of proxy codes allows for violence cases to be identified that may not have been otherwise detected which is important for prevention. The utility of proxy codes for prevention may make these codes more forgiving of false positives. To our knowledge, this is the first study to compare the predicters and trends over time for proxy and explicit codes, therefore, future validation work is important.

This study has limitations that should be considered. First, this study missed those who experienced violent events but did not go to the hospital. Second, hospital data may be an oversample of those with health insurance in the population. That said, more severe or urgent injuries may bring people in for care despite the lack of health insurance coverage (Sommers and Simon 2017). Thus, this study may be seen as an analysis of more severe violence-related injuries. Third, hospital data lack details such as perpetration and location of the event that are available in studies like UCR, NCVS, and NCANDS, which could help to identity potential points for intervention. Fourth, the analysis includes a set of county-level covariates from a single time point (the 2010 decennial census or ACS). This limits the ability to account for variation across time in the covariates. However, there is minimal change over this 11-year time frame in these measures at the state level (Minnesota Compass 2020), and a middle timepoint captures mean covariate levels across the study period. Fifth, while these data are representative of violent related injuries in Minnesota given the census of hospital records as the data source, the results may not be generalizable outside of Minnesota. Sixth, this study uses ICD-9 codes while the current version of hospital discharge codes is ICD-10, thus the study stopped at 2014. Despite the use of an older coding system, both injury codes (ICD-9 and ICD-10) continue to use a similar approaches that are translatable (Gibson et al. 2016). In addition, the ICD-10 external cause framework is developed to be as consistent as possible with ICD-9 codes (Injury Data and Resources 2019).

This study has several strengths that mitigate its weaknesses. First, this study utilizes a population-based administrative dataset at the county level for Minnesota, allowing generalizability to the entire population. Second, while proxy codes are likely to have some misclassification, they are subject to potentially less systemic bias and may be thus better capture violence in communities where violence is not traditionally identified, such as whiter or wealthier communities (Sumner et al. 2015). Given the different strengths and weaknesses of explicit and proxy codes, and the lack of a gold standard for violence identification, it is useful to consider both approaches in research and for replication in other studies. The combined code approach could be a possible way to be more inclusive for studies that are attempting to target a broad pool. Third, this study utilizes county-level geography to distinguish associations between violence and county socio-demographic characteristics, which can be useful for local public health agency surveillance. In contrast, violence trend data such as NCVS are commonly reported at the nation or region level, which limits the more granular assessment of trends that occur in different parts of the United States.

There are several implications for future research from these findings. Violence surveillance utilizing hospital discharge data, and particularly proxy codes, may add important data that traditional surveillance lacks. Most importantly, explicit and proxy codes indicate different geographic patterns and trends over time. The use of proxy codes for violence identification may provide an avenue for capturing violence that traditional surveillance misses (Boyle and Kirkbride 2005). Accurate surveillance of violence is critical for resource allocation for prevention and intervention. Utilizing proxy codes in conjunction with explicit codes may be one step towards more comprehensive surveillance. More specifically, hospital records could be used as a syndromic surveillance system for violence, which could lead to potentially more timely and impactful interventions. Future research examining hospital discharge data for violence identification utilizing and verifying proxy codes can help to identify the hidden burden of violence.