Introduction

Every year in the US, approximately 600,000 patients undergo a hysterectomy—a surgery during which the uterus is removed [1, 2]. An estimated 20 million patients in the US have had a hysterectomy [1, 3, 4]. Hysterectomies are often performed for benign gynecologic conditions such as fibroids, endometriosis, pelvic pain, and uterine prolapse [5]. After cesarean sections, hysterectomies are the second most common surgery performed on US women under age 65 with one-third receiving hysterectomy before age 60 [6].While the surgery can be an effective treatment for debilitating gynecologic conditions and lowers endometrial cancer risk, there are severe, irreversible consequences such as infertility, surgical complications, and increased stroke and mortality risks to consider especially when operating for benign gynecological conditions [7,8,9,10].

Most studies of gynecologic treatments like hysterectomy are unable to account for the treatment trade-offs and complex decision-making described above. A major reason is that these studies fail to capture multilevel influences on hysterectomy decision-making in diverse, population-based samples [11, 12]. Multilevel influences include patient factors (e.g., symptom severity, sociodemographic factors, other clinical factors), clinician factors (e.g., sociodemographics, clinical experience), and health care setting factors (e.g., type of practice, case mix, insurance mix). For instance, population-based studies using insurance claims or national databases, such as the National Inpatient Sample and National Hospital Discharge System, cannot measure key patient clinical factors, such as the symptom severity of benign gynecologic diseases, or important potentially important provider and health system factors [13]. Finally, insurance claims-based studies are often limited to single payor types, leaving the uninsured rarely studied [14,15,16,17,18]. In contrast to all the limitations mentioned above, EMR-based studies have rich data on patient, provider, and health care factors. However, like insurance-claims-based studies, EMR-based studies from single hospitals or small health systems fail to capture the breadth of people affected by gynecologic conditions [19,20,21].

In this paper, first, we describe a novel approach to overcome these barriers. We leverage multiple data sources to supplement EMR data from a large (n > 1800), population-based series of hysterectomy patients treated in a large health system in a southeastern state. We supplemented structured EMR data with manual EMR abstraction to collect in-depth patient clinical data unavailable in diagnostic codes. We also linked these data to surgeon-level licensing data and patient residential census data. Second, to demonstrate one strength of the data, we evaluated whether patient and surgeon characteristics differed between the uninsured (who are not represented in insurance claims-based analyses) and the rest of the population..

Materials and methods

Study design

This study consists of a case series of patients who underwent hysterectomy between October 2nd, 2014, and December 31st, 2017, in one of 10 hospitals that were part of a large health care system in one southeastern state. Analysis of this growing system, comprising academic centers and large and small community hospitals, is an enormous strength, allowing us to examine the heterogeneity of care. This healthcare system has a presence in 100 counties in the state, extending into neighboring states, including metro areas as well as suburban and rural areas. One of the hospitals is an academic medical center that serves a high proportion of uninsured patients in addition to the privately and publicly insured population.

Individuals were included if they underwent hysterectomy for a benign gynecologic condition between ages 18 and 44 years (see Fig. 1 for exclusion criteria). To ensure a sufficient lookback period (6 months) of patients’ gynecological history, patients were excluded if their surgery occurred less than 6 months after the rollout of the EMR system (n = 668). Patients were also excluded if they were pregnant at the time of surgery (n = 58), had prior or active breast, ovarian, uterine, or cervical cancer diagnoses, or diagnoses for other cancers whose treatment plans may involve hysterectomy (bladder, anal, colorectal) (n = 165).

Fig. 1
figure 1

Flow chart of inclusion/exclusion criteria of CHC, ages 18 to 44 years undergoing surgery between 10/02/2014–12/31/2017

Data sources

The parent study linked data from the healthcare system, the U.S. census, and the state physician licensing board to develop a multifaceted approach for understanding determinants of treatment with hysterectomy (see Fig. 2).

Fig. 2
figure 2

Data flow from various sources used to derive the Carolina Hysterectomy Cohort analytical dataset

Electronic medical record data

Structured EMR data – Data Warehouse: The health care system licenses a data warehouse to facilitate access to EMR data for research. Honest broker analysts from the data warehouse identified the study population by searching the EMR for hysterectomy procedure codes and relevant diagnostic codes provided by the research team. The data warehouse also provisioned to our team structured data on those patients, including patient demographics, diagnoses and procedures associated with the surgery, and other pre-specified treatments, prescriptions, and health care encounters.

EMR free text – Medical Record Abstraction: Additionally, we leveraged the unstructured EMR data to abstract detailed clinical information. After a thorough pilot study to refine an EMR abstraction process, the study team created a comprehensive abstraction protocol and data collection tool with Research Electronic Data Capture (REDCap) to input patient data. The EMR abstraction protocol, REDCap data collection tool, and accompanying data codebook are freely available for review [22,23,24]. Using the tools mentioned above, a team of professional medical record abstractors reviewed surgeon-reported patient progress and operative notes, and imaging and pathology reports. The information collected from the surgeon-reported notes included the presence or absence of patient symptom and diagnoses, previous treatments, imaging and pathology report findings, primary assessments of reason for surgery, and notes from the surgical operative report. The abstraction process and validation are described in more detail by Doll, et al. [25].

Census data

Finally, the Data Warehouse linked geocoded addresses of hysterectomy patients at the time of surgery, to U.S. census tract data. Census data from the U.S. Decennial Census and ongoing 2014 American Community Survey provided data about patients’ residential contexts.

Payor grouping

We grouped patients by payor according to the following categories: public, private, and uninsured. Public includes patients covered by Medicare (N = 67), Medicaid (N = 226), or receiving care at a prison facility. Private includes those with private insurance (N = 1340) or Tricare (N = 53), coverage provided to military service members and their families. Finally, the uninsured category includes those patients whose records indicated “self-pay,” either partially (n < 10) or wholly (n = 141).

Medical conditions

Because we performed medical record abstraction on all surgeries, we were able to classify surgeries by associated gynecologic conditions in two ways: (1) all gynecologic diagnoses listed at time of surgery and (2) the primary indication indicated in the surgeon’s pre-operative note. The former represents formal diagnoses associated with the surgery that have been identified using ICD-9 and ICD-10 codes and is commonly used in claims-based research. The latter, the primary reason for the surgery that the surgeon recorded in the text of the pre-operative note, is often not available in insurance claims. While both categorization systems allow for multiple conditions to be listed, the list of gynecologic diagnoses can be quite long, while the list of primary indications from the pre-operative note was usually limited to 1 or 2 conditions.

Patient symptom severity scores

We scored each patient on severity of their gynecologic symptoms to create a composite severity index on three domains: bulk, vaginal bleeding, and pelvic pain [25]. Example candidate input factors for Bulk score (score range 0 to 39) were bloating (1 point), pelvic pressure (1 point), bulk diagnosis at surgery, or preoperatively (2–3 points) and uterine size (4 points). The candidate input factor examples for bleeding score (score range 0 to 26) were vaginal bleeding (1 point), period for more than 7 days (2 points), anemia (4 points) and a history of blood transfusion (5 points). Example candidate input factors for pain score (score range 0 to 14) were pelvic pain (1 point), pain diagnosis code in the year before surgery (3 points), and opioid usage (4 points).

Linking surgeon data

We merged surgeon demographic and occupational data with the CHC patient dataset to understand practice associations with hysterectomy decision-making. We obtained surgeon data from a state specific Health Professions Dataset (HPDS). The HPDS maintains and disseminates the licensed medical professionals’ demographics (e.g., gender, race, ethnicity), education (e.g., graduation year, where trained,) and practice-level data (practice setting, total hours worked in an average week and percent time in direct patient care). The HPDS has produced and maintained continuous data files since 1979. We identified and linked the primary surgeon’s information to the patient data. See Fig. 3 for the primary surgeon identification algorithm. Briefly, the patient’s billing surgeon was identified as the lead surgeon. If the patient had multiple billing surgeons, then the lead surgeon was the surgeon who was listed as the primary surgeon in the OR log. An MD collaborator reviewed the algorithm results and adjudicated and confirmed the lead surgeon from reviewing patient records. Using practice name and address, we grouped all surgeons by the practices in which they work. The linked physician licensing data will be used to create surgeon- and practice-level variables, including surgeon volume but also novel measures such as distinctive practice-level treatment patterns. These variables will be utilized in future studies.

Fig. 3
figure 3

Algorithm showing how the lead surgeon was identified for each patient who underwent hysterectomy between 10/02/2014–12/31/2017

Analysis

We performed descriptive analyses in which we present counts and frequencies stratified by insurance payor. We used chi-squared/Fisher’s exact and Kruskal-Wallis to test whether the patient and surgeon characteristics differed by insurance payor. All analyses were performed using the Statistical Analysis System (SAS) statistical software package, version 9.4. SAS Institute Inc., Cary, NC, USA.

Results

Description of overall cohort: Carolina hysterectomy cohort

We identified 1857 patients through the EMR who underwent hysterectomy for benign gynecologic conditions and fit inclusion criteria described earlier. As shown in Table 1, most were non-Hispanic White (55.5%); about a third were non-Hispanic Black (30.4%); and the remaining were Hispanic (8.5%), non-Hispanic Asian (1.2%), American Indian/Alaskan Native (0.9%), or identified as other or unknown/refused (3.5%). In our case series, 57.2% were married and 73.8% lived in urban areas. They had a variety of insurance coverage types: public (17.3%), private (75.0%), and uninsured (7.7%) (Table 1). While the majority of patients were treated at a non-academic hospital (57.8%), almost half (42.2%) were treated at an academic medical center.

Table 1 Comparison of characteristics of EMR-based CHC versus stratified sample based on payor between 10/02/2014–12/31/2017

The most common diagnosis, identified using diagnostic codes from the structured EMR, associated with surgery from the data warehouse was menorrhagia (68.2%). However, a larger proportion (73.5%) of publicly insured patients had diagnoses of menorrhagia. The second most common diagnostic code among the overall sample was fibroids (59.5%), with larger proportions of private (61.5%) and uninsured (67.8%) patients having fibroid diagnoses than the overall sample.

We identified 25 unique practices and 115 primary surgeons who worked in these practices during the study period. More than half of the surgeons identified as female (54.8%) with a median age of 43 years (IQR: 34, 50 – data not shown). The median age of male surgeons was 49 years (IQR: 40, 56 – data not shown). Most surgeons were non-Hispanic White (71.3% were non-Hispanic White, 10.4% were Asian/ Pacific Islanders, 8.7 were non-Hispanic Black, 7.8% were Hispanic, and the rest identified as “Other.”) The median annualized surgeon volume was 4 (IQR: 2, 7.5 – data not shown). Over 75% of the surgeons had an annualized volume of 10 or less. Over 90% of the surgeons reported Gynecology, Gynecologic Oncology, or Obstetrics and Gynecology as their primary practice areas.

Comparisons between uninsured patients and the rest of the hysterectomy sample

Patient sociodemographic characteristics differed by payor. Hispanic patients were disproportionately likely to be uninsured: while they only represented 8.5% of the total population, they represented almost half (45.5%) of the uninsured population. For White and Black patients, the likelihood of being privately insured was similar, although Black patients were slightly more likely to be publicly insured. Among the uninsured, 41.3% were single and among the publicly insured, 44.9%, whereas among the overall population, only 27.4% were single. Married patients, who comprised 57.2% of the whole population, were disproportionately likely to be privately insured, comprising 64.8% of that population.

The uninsured and the publicly insured patients had higher median bleeding (9.0 (4.0, 19.0) and 7.0 (3.0, 12.0), respectively) scores than the privately insured. The median pain scores (6.0 (3.0, 11.0 and 6.0 (3.0, 11.0, respectively) in the uninsured and publicly insured were similar although higher than the privately insured (3.00 (0, 7.0)). The Charlson’s comorbidity index (CCI) in these mostly premenopausal benign patients was highly skewed. Most patients had a score of 0, unlike studies of older hysterectomy patients or those receiving hysterectomy as treatment for endometrial cancer [26].

In the overall patient population, based on the abstracted pre-operative EMR notes, the most common surgeon-reported reason for surgery was fibroids, a main reason for 37.3% of patients. Menorrhagia was also a common indication, listed in notes as a main reason for surgery in 34.5% of patients. However, the patterns differed for uninsured patients. Fibroids were the most common indication (39.2%) for the uninsured, but the second most common was abnormal uterine bleeding (36.4%), with menorrhagia only noted as a main indication in 15.4% of uninsured patients. Surgeon characteristics such as sex, race/ethnicity, surgeon volume and time since residency, did not statistically differ by insurance payor groups.

Discussion

Leveraging data from a healthcare delivery system allowed us to identify a diverse case series of patients using EMR data from a single-state health care system. Our study sample includes a reproductive-aged patient population that is diverse with respect to race/ethnicity, insurance status, and residential environments. Due to our focus on premenopausal hysterectomy, our study population is younger when compared with the state-wide claims based hysterectomy population, however, the demographic characteristics are similar [17]. Reproductive-aged Hispanic and Black patients were disproportionately likely to be uninsured. Comparing uninsured patients to other payor groups, we found differences in their marital status, primary reasons for surgery and symptom severity.

An analysis using claims data instead of EMR data would limit analyses to insured (single payor) or Medicaid patients. As a result, these claims-based analyses would disproportionately exclude Black and Hispanic patients from study, further exacerbating health inequities, some of which may be invisible in studies using EMR alone.

This study demonstrates a feasible method for capturing a reproductive-aged population that is more diverse than that captured from private insurance claims data and/or Medicaid data alone. A large issue that remains unaddressed with private insurance claims and Medicaid is the large proportion of adults in the USA who are uninsured—approximately 12.4% of adults (in 2016) [27]. Indeed, in our population of hysterectomy patients, 7.7% were uninsured. This group were disproportionately likely to be Hispanic, single, and live in a census tract with lower median household income than the overall population. Use of claims renders the uninsured absent from research. The ability to achieve more complete condition identification is an advantage of EMR data over claims data [18].

Another advantage of EMR-based research on reproductive-aged populations over claims-based research and National Inpatient and National Hospital Discharge datasets is the richness of unstructured patient EMR data. The primary indications abstracted from the unstructured EMR gave us more specificity about reasons for surgery than diagnostic codes associated with surgery. For example, we found that most patients had multiple diagnostic codes associated with their surgeries, making it difficult to distinguish important clinical indications for a hysterectomy. In contrast, abstracting from the preoperative notes helped focus on 1 or 2 important indications for the surgery. Additionally, the National Inpatient Sample is not representative of hysterectomy patient population as a great majority of hysterectomy now happens in outpatient settings [28].

The main operational challenges we encountered revolved around the need to harmonize structured and abstracted data from electronic medical records while also managing variations in EMR rollout dates. There were also some discrepancies in the structured EMR, which relied on ICD-9/ICD-10 codes, when compared to the unstructured EMR, which was abstracted from notes into REDCap. While this required additional time, labor/personnel, and costs for data cleaning and data management, this enabled us to capture a more detailed picture of how and when encounters happened as well as additional clinical detail when compared to structured EMR data alone. Additionally, we faced dynamic administrative circumstances as hospitals had variable EMR rollout dates. A primary exclusion criterion of our case series is a patient lookback period of 180 days, which resulted in the exclusion of 668 patients due to insufficient lookback period. During the time of patient selection, one site was acquired by an entity external to the health system under study, so this particular site only contributed patients until September 1st, 2018, varying from other sites.

Our study is limited to gynecologic surgery patients within one healthcare system in the U.S. South. Results may differ in larger or other kinds of health systems or for other health conditions. As with claims research, data are often reported by surgeons rather than from the patients themselves. As a consequence, some key variables, such as race, may differ from what patients would self-report themselves. Further research could supplement EMR-derived quantitative study designs with qualitative work or with direct patient surveys. We also were underpowered to examine hysterectomy among Asian-American, American Indian, and individuals of other racial or ethnic groups. In particular, Asian-Americans had very low numbers of hysterectomy in our sample, reflecting a relatively small proportion within the catchment area for this health system and their low rates of hysterectomy in the South [29]. An additional challenge with studying American Indian populations is that health services databases typically have low sensitivity for identifying them [30]. However, we believe that these descriptive analyses in these groups will serve as a foundation for future research for these groups, who are often, unfortunately, excluded from gynecologic health research. Much health services research that investigates influences on outcomes focuses on patient individual and residential environments and neglects multilevel health care factors.

Conclusion

We demonstrated that it is possible to assemble a socioeconomically, racial/ethnically and clinically diverse patient cohort of premenopausal patients using EMR data and link that patient population to surgeon and health care setting data. By considering differences in a single state, we minimized confounding by regional variation in practice. Rather than relying solely on diagnostic codes at single clinical sites, we leveraged richer clinical data. Finally, we included uninsured patients and demonstrated that these patients differ sociodemographically and clinically from insured patients. Other studies that exclude these patients by design are not accurately representing the patient population for premenopausal hysterectomy. Application of this study design will allow for further innovation and exploration of the understudied research area of gynecologic health [13, 31].