Disparities in US physician burnout rates across age, gender, and specialty groups as measured by the Maslach Burnout Inventory-Human Services Survey for Medical Personnel (MBI-HSS) are well documented [1,2,3,4,5,6]. Physicians who are younger, female, and practicing in front-line specialties (e.g., emergency medicine, family medicine, and general internal medicine) have generally reported higher rates of burnout compared to their older, male colleagues practicing in non-front-line specialties [1]. In response, the National Academy of Medicine has recommended that healthcare organizations monitor and intervene on demographic disparities within their institutions [4]. However, it is unclear whether the observed disparities are explained by differences in the MBI-HSS’s functioning, or what is known as a lack of measurement equivalence, across demographic groups [4].

A measure is equivalent when it functions the same way across groups of respondents who might differ in gender, age, or other personal characteristics that may influence their responses to a self-reported measure. However, when a measure lacks equivalence across respondents who differ demographically, subscale score differences may actually reflect systematic differences in the way the demographic groups interpret items or in their willingness to endorse items, as opposed to true differences in the groups’ latent (unobserved) burnout symptom levels [7]. For example, female physicians may have higher observed burnout scores than male physicians because they are more willing than male physicians to report their symptoms, despite both groups having the same latent burnout levels. Establishing the measurement equivalence of an instrument is a key aspect of construct validity; and, consequently, is required for the unbiased comparison of physician burnout across demographic groups [8, 9]. However, no studies, to our knowledge, have evaluated the demographic measurement equivalence of the MBI-HSS in US physicians [10].

The aim of this study was to examine whether demographic disparities in US physician burnout are explained by differences in the MBI-HSS’s functioning across physician age, gender, and specialty groups.

Methods

Design and sample

This study used secondary, cross-sectional survey data from a national study on the prevalence of physician burnout conducted by Shanafelt et al. [2]. Data were collected in 2014 from physicians of all specialties sampled via email from the American Medical Association Physician Master File. Further sampling design details are reported in Shanafelt et al. [2]. From this dataset, we excluded physicians who were not practicing in the US or were retired.

Measures

The MBI-HSS is an outcome assessment of job burnout containing three subscales: emotional exhaustion (EE) (9 items), depersonalization (DP) (5 items), and personal accomplishment (PA) (8 items). All MBI-HSS items have a 7-point Likert-type, frequency response scale (0 = never, 1 = a few times a year or less, 2 = once a month or less, 3 = a few times a month, 4 = once a week, 5 = a few times a week, 6 = every day). Higher scores on each subscale indicate more of each construct. Burnout symptoms are indicated by high scores on the EE and DP subscales and low scores on the PA subscale. Demographic variables included age group (< 35, 35–44, 45–54, 55–64, and ≥ 65 years), gender (male and female), and specialty.

Statistical analyses

We evaluated the demographic measurement equivalence of the MBI-HSS subscales in a series of multi-group item response theory- (IRT-) based differential item functioning (DIF) analyses (Additional file 1: Appendix 1). IRT represents a class of generalized linear mixed effect models for relating observed item responses to latent constructs. Within an IRT framework, a lack of measurement equivalence in an item is called differential item functioning (DIF). For a particular item under investigation, DIF occurred when the probability of endorsing one or more item response significantly differed across reference and focal groups (e.g., males versus females) for physicians with the same latent burnout symptom (EE, DP, or PA) level.

IRT-based DIF analyses require that all IRT model assumptions, such as essential unidimensionality, have been met prior to analysis. These assumptions were evaluated and met in a previous IRT calibration of the MBI-HSS in US physicians using the same dataset by Brady et al. [11]. In Brady et al. [11], each scale demonstrated essential unidimensionality in unidimensional or bifactor confirmatory factor analyses. In following Brady et al. [11], we summed items EE4 (“people real strain”) and EE8 (“people too much stress”) to form a single, combined scale (EE4EE8) to meet IRT model assumptions.

Our statistical analyses proceeded in two main steps: 1) DIF detection and 2) DIF impact assessment. Our analyses were informed by the scientific standards for instrument development and validation developed by the National Institutes of Health Patient-Reported Outcomes Measurement Information System (NIH PROMIS) [8].

DIF detection (item-level)

Following best practices [8, 12], we employed two IRT-based approaches to detecting DIF in each subscale item: log-likelihood ratio tests (LRTs) and Chalmers et al.’s (2018) signed differential response functioning (sDRF) statistic [13]. These approaches have shown to be robust detection methods in previous studies [12,13,14]. Both DIF detection approaches require the selection of anchor items that have little to no DIF, which are used to estimate the reference and focal groups’ latent burnout symptom levels in multi-group IRT model estimation [8, 15]. Specialty groups with < 200 respondents were excluded from the DIF specialty analysis to ensure adequate sample size for DIF detection [14].

In the first DIF detection approach, we estimated an unconstrained baseline multi-group IRT model where all item parameters (except anchor items) were estimated freely across reference and focal groups and, for each item, compared its fit using a LRT against a more restrictive model where the item parameters for the studied item were constrained to be equal across groups. In the second DIF detection approach, we detected DIF in each subscale item using the sDRF statistic at the item-level, computed from the unconstrained baseline multi-group IRT model [13]. The item-level sDRF statistic estimates the overall average difference (bias) in the reference and focal groups’ expected item scores (i.e., raw item scores) across the latent burnout symptom continuum due to DIF in an item, after matching physicians on their latent burnout symptoms levels [13]. Items showing a significant Benjamini-Hochberg adjusted LRT statistic (p < 0.05) or a Benjamini-Hochberg adjusted item-level sDRF statistic were flagged as displaying statistically significant DIF in one or more item parameters.

DIF magnitude, or the degree of DIF present in an item, was captured by size of the item-level sDRF statistic, which is in the same raw score metric as item scores [8, 13]. For example, a negative item-level sDRF statistic of − 1.0 for a particular item indicates that the focal group’s item scores will be, on average, one score point higher than the reference group’s item scores due to DIF; whereas, a positive item-level sDRF statistic of 1.0 for a particular item indicates that the focal group’s item scores will be, on average, one score point lower than the reference group’s item scores due to DIF. To aid in the interpretation of DIF magnitude, we converted absolute item-level sDRF estimates to SD units based on their respective item score distributions.

DIF impact assessment (subscale-level)

Although items may display statistically significant DIF, the effect of the DIF on subscale scores across reference and focal groups may be negligible [16]. Therefore, an essential part of assessing measurement equivalence is to evaluate the impact of the statistically significant DIF identified [8, 12]. DIF impact relates to the aggregate effect of DIF across all subscale items on group- and individual-level subscale scores [8]. To evaluate DIF impact, we evaluated the size of the sDRF statistics at the subscale-level for all statistically significant DIF identified [13]. The subscale-level sDRF statistic estimates the overall average difference (bias) in the reference and focal groups’ expected subscale scores (i.e., raw total scores) across the underlying burnout symptom continuum due to the aggregate effects of DIF across all subscale items, after matching physicians on their underlying burnout symptoms levels [13]. For example, a negative subscale-level sDRF statistic of − 1.0 indicates that the focal group will have total scores that are, on average, one raw score point higher than the reference group’s total scores due to the aggregate effects of DIF in the subscale; whereas, a positive item-level sDRF statistic of 1.0 indicates that the focal group will have total scores that are, on average, one raw score point lower than the reference group’s total scores due to the aggregate effects of DIF in the subscale. A significant subscale-level sDRF statistic (p < 0.05) indicated that the aggregate effects of DIF across all subscale items resulted in significant differences in the subscale’s functioning across reference and focal groups. To aid in the interpretation of the DIF impact, we converted absolute subscale-level sDRF estimates to SD units based on their respective total score distributions.

For aggregate DIF that resulted in a significant subscale-level sDRF statistic (p < 0.05), we assessed its practical impact by comparing differences in individuals’ IRT-estimated subscale scores and burnout symptom prevalence estimates produced from multi-group IRT models that were unadjusted and adjusted for DIF [8, 12].

All statistical analyses were conducted in R (v3.5.1) using the mirt package (v1.31.4) [17, 18]. This study was approved by the Boston University Medical Campus Institutional Review Board (H-37414).

Results

The overall sample included 6577 multi-specialty US physicians (Table 1). The majority of the sample was male, ≥ 55 years of age, and a non-primary care physician. We used physicians who were ≥ 65 years, male, and practicing in general internal medicine (GIM) as the reference group in respective age, gender, and specialty DIF analyses. Physicians in dermatology, neurosurgery, otolaryngology, pathology, radiation oncology, and urology were excluded from the specialty DIF analysis due to inadequate sample sizes.

Table 1 Overall and group-level sample characteristics

Detection of DIF in subscale items

We detected statistically significant DIF (via one or both detection methods) across age, gender, and specialty groups in all MBI-HSS items except EE5 (“burned out from work”) (Tables 2, 3 and 4). Statistically significant age DIF was detected in five EE items (EE1, EE3, EE6, EE7, EE9), three DP items (DP2-DP4), and seven PA items (PA1-PA7) (Tables 2, 3 and 4). Statistically significant gender DIF was detected in four EE items (EE1, EE2, EE6, EE7), one DP item (DP1), and three PA items (PA1, PA4, PA6). Statistically significant specialty DIF was detected in five EE items (EE2, EE6, EE7, EE4EE8, EE9), all DP items (DP1-DP5), and five PA items (PA1, PA3, PA4, PA7, PA8). See Additional file 1: Appendices 2–3 for additional DIF detection results.

Table 2 DIF detection and magnitude results by item and demographic variable – Emotional Exhaustion subscale
Table 3 DIF detection and magnitude results by item and demographic variable – Depersonalization subscale
Table 4 DIF detection and magnitude results by item and demographic variable – Personal Accomplishment subscale

Most EE items that had statistically significant age, gender, or specialty DIF were of a small magnitude, representing less than 0.10 SD of a given item’s score (Table 2). The DP and PA subscales had several items demonstrating larger age, gender, or specialty DIF, representing greater than 0.20 SD of a given item’s score (Tables 3 and 4). Within the EE, DP, and PA subscales, the largest DIF was observed in item: EE6 across GIM and general pediatrics specialty groups; DP4 across GIM and anesthesiology specialty groups; and PA8 across GIM and general surgery subspecialty groups, respectively. On average, general pediatricians, anesthesiologists, and general surgery subspecialists had respective item scores on EE6, DP4, and PA8 that were 0.33 points (0.18 SD), 0.40 points (0.30 SD), and 0.30 (0.26 SD) lower than general internists due to DIF.

Impact of DIF on subscale scores

A subset of the statistically significant DIF produced significant overall average differences in expected subscale scores across demographic groups (Table 5). However, in all cases, the overall average differences in total scores due to DIF amounted to less than 0.10 SD on each subscale (Table 5, also see Additional file 1: Appendix 4). Age DIF impacted both the PA and EE subscales, but had no significant impact on the DP subscale (Table 5). Compared to physicians ≥ 65 years, physicians 45–54 and 55–64 years had respective total scores on the PA and EE subscales that were, on average, 0.49 score points (0.07 SD) and 0.18 score points (0.01 SD) higher due to the aggregate effects of age DIF. Gender DIF impacted the EE subscale, but had no significant effect on the DP and PA subscales (Table 5). Compared to male physicians, female physicians had EE total scores that were, on average, 0.34 score points (0.03 SD) lower due to gender DIF. This was primarily caused by gender DIF in items EE6 and EE7, where female physicians were systematically less likely to endorse feeling she is “frustrated with work” and “working too hard” than male physicians, respectively.

Table 5 Summary of statistically significant subscale-level signed differential response functioning (sDRF) estimates - by demographic variable and MBI-HSS subscale

Specialty DIF impacted all three MBI-HSS subscales (Table 5). On the EE subscale: emergency medicine physicians and neurologists had total scores that were, on average, 0.42 score points (0.03 SD) and 0.46 score points (0.03 SD) higher than general internists due to specialty DIF, respectively; and general pediatricians and pediatric subspecialists had total scores that were, on average, 0.63 score points (0.05 SD) and 0.21 score points (0.02 SD) lower than general internists due to specialty DIF, respectively. On the DP subscale: family physicians and neurologists had total scores that were, on average, 0.40 score points (0.06 SD) and 0.41 score points (0.06 SD) lower than general internists due to specialty DIF, respectively; and general pediatricians and OBGYN physicians had total scores that were, on average, 0.35 score points (0.05 SD) and 0.38 score points (0.06 SD) higher than general internists due to specialty DIF, respectively. On the PA subscale: anesthesiologists, emergency medicine, neurologists, and psychiatrists had total scores that were, on average, 0.60 score points (0.09 SD), 0.30 score points (0.04 SD), 0.55 score points (0.08 SD), and 0.24 score points (0.04 SD) higher than general internists due to specialty DIF, respectively; and general surgery subspecialists had 0.53 score points (0.08 SD) lower than general internists due to specialty DIF.

Among the subscales with significant subscale-level sDRF, differences produced from DIF- unadjusted and adjusted models in physicians’ individual-level subscale scores and in symptom prevalence estimates were also very small (Table 6). In all cases, mean absolute differences in individual subscale scores and correlations between individual physicians’ subscale scores produced between DIF- unadjusted and adjusted models were < 0.04 z-score units and > 0.99, respectively. The absolute differences between physicians’ scores produced from EE, DP, and low PA prevalence estimates all differed by 0.00 to < 0.70%.

Table 6 Impact of aggregate DIF within the EE, DP, and PA subscales on individual physicians’ subscale scores and burnout symptom prevalence estimates a

Discussion

Studies have consistently demonstrated disparities in physician burnout by age, gender, and specialty on the MBI-HSS [4, 19, 20]. However, the extent to which disparities are explained by differences in the MBI-HSS’s functioning across demographic subgroups of US physicians has been unclear. In this study, we evaluated the measurement equivalence of the MBI-HSS subscales across age, gender, and specialty groups in a sample of US physicians. We found a lack of measurement equivalence across demographic groups in all items except EE5 (“feel burned out from work”), demonstrating that physicians’ age group, gender, or specialty biased nearly all item scores to some degree. However, in all cases, the overall average aggregate effects of DIF on biasing the total subscale scores were small (< 0.10 SD). Furthermore, DIF had very little practical impact on individual-level physicians’ scores and burnout symptom prevalence estimates. Overall, our findings demonstrate that age-, gender-, and specialty-related disparities in US physician burnout are not explained by differences in the MBI’s functioning across these demographic groups.

Our study has several important implications for federal agencies and healthcare organizations aiming to monitor demographic disparities in physician burnout using the MBI-HSS [20,21,22]. First, our findings support the use of the MBI-HSS as a valid tool to assess age-, gender-, and specialty-related disparities in US physician burnout. Second, our research underscores the importance of using the full MBI-HSS subscales to assess demographic disparities in burnout versus using individual items. At the subscale level, the effects of DIF often cancelled out. Subscale-level cancellation effects occur when DIF causes bias of similar magnitude but in opposing direction (e.g., one item upward biases total scores and another downward biases total scores of the same magnitude). Therefore, the subscale scores generally showed less bias due to DIF than item scores. If researchers are interested in using individual items or subsets of items, however, our analyses can be used to select the items with the least DIF. Furthermore, since the item-level and subscale-level sDRF statistics represent the degree of item- and subscale level bias in the same raw score metric as item scores and total scores, researchers can use our findings to 1) assess the impact that DIF may have on a particular analysis and 2) adjust cross-group comparisons of raw item and total scores for DIF.

This study has several main limitations. First, DIF analyses can be prone to Type I due to multiplicity or if the wrong anchor items are selected. We mitigated this by not only applying multiplicity adjustment but by also thoroughly evaluating whether statistically significant DIF impacted group-level subscale scores, individual-level subscale scores, and burnout symptom prevalence. Second, our analysis computed the overall average difference in item and total scores due to DIF across a range of latent burnout symptom scores. As these differences are overall average differences across the latent metric, the bias in reference and focal groups’ scores at a particular point on the latent metric may be larger or smaller than the overall average [13]. Third, there is a paucity of literature on what constitutes “small” item-level DIF. However, our methods of examining the impact of DIF on subscale scores and burnout symptom prevalence are reasonable solutions. Fourth, although early and late responder analyses by Shanafelt et al. [2] support the demographic representativeness of the sample, it is possible that the this sample is not entirely representative of the current US physician population. However, assuming that the items in this sample function the same as in the US physician population, the findings of this study would not be different. Finally, although the MBI-HSS subscales demonstrated measurement equivalence across age, gender, and specialty, they may lack equivalence across other groups that we did not evaluate (e.g., race/ethnicity groups). Future studies are needed to evaluate whether the MBI-HSS functions equivalently across other demographic groups.

Conclusions

As the MBI-HSS is increasingly employed in research and practice to monitor disparities in US physician burnout, it is important to understand its performance across demographic groups. Our findings demonstrate that differences in the way the MBI-HSS subscales function across groups do not account for the observed disparities in US physician burnout across age, gender, and specialty groups. Our findings support the use of the MBI-HSS as a valid tool to assess disparities in burnout across age, gender, and specialty groups in US physicians. Further research is needed to understand how these measures function across other physician subgroups.