INTRODUCTION

Core clerkships are a key foundation of medical education for students, and the assessments that are associated with these clerkships are informed by narrative evaluations completed by supervising physicians during these clerkships. These evaluations form the basis of clerkship grades, with the narrative language from evaluations being quoted in Medical Student Performance Evaluation (MSPE) letters and recommendation letters, that are a core component of residency applications.1 Inherently, however, narrative language is open to bias and the consequences that can arise from it.2

The National Academies of Science, Engineering, and Medicine found that in academic settings, subjective evaluation criteria are often infiltrated with bias that disadvantages women.3 The Association of American Medical Colleges reported that recruitment, evaluation, and promotion processes involve implicit and unconscious bias, inhibiting the development of a diverse medical workforce.4 Research using manual and programmatic approaches to linguistic analyses, such as qualitative coding and automated text analysis, respectively, suggests that narrative evaluations can introduce gender-based stereotypes, including the perception of women as emotional and sensitive5,6,7,8,9,10,11 that can be detrimental to the advancement of the individual being evaluated.12 Furthermore, the consequences of subjective assessments may be even more damaging to racial and ethnic minorities that are underrepresented in these fields.13 For example, underrepresented groups in medicine may be even more “othered,” or differentiated in a manner that excludes, marginalizes, or subordinates.14,15,16,17,18 This phenomenon may be due to an insufficiently diverse physician workforce,19 as well as subject to the reported tendency of supervisors to focus on social and cultural factors rather than competency-related factors.13

In 1999, the Accreditation Council for Graduate Medical Education (ACGME) and the American Board of Medical Specialties endorsed competency-based evaluations in an effort to move towards assessment of specific competence domains based on behavior rather than personal attributes. The ACGME later introduced milestones as a standardized method to operationalize the assessment of students’ progress towards achieving these competencies.20 As a result, American medical schools focus on competency-based assessment.

We aim to characterize narrative language differences among medical student evaluations using natural language processing techniques. Our primary measure is to understand whether students are described differently by gender as well as under-represented minority (URM) status using metrics commonly employed in natural language processing.

METHODS

Design

This study was approved by the University of California, San Francisco Institutional Review Board (15-18271) and deemed exempt by the Brown University Institutional Review Board. This is a secondary data analysis of narrative evaluations (text) from two medical schools. We applied natural language processing to elucidate differences by gender and URM status.

Data Sources

We included data from all third-year core clerkship evaluations from two medical schools affiliated with public and private academic institutions in large urban settings, with associated information about student demographics, clerkship specialty, and grade. Data were collected from 2006 to 2015 at school 1 (as identified in Table 1), and from 2011 to 2016 in school 2, to exclude years in which major grading practice changes were implemented. At both of these schools, grading choices in each clerkship were three mutually exclusive choices: non-pass, pass, or honors, with no intermediate options. Only complete cases containing student gender, URM status, clerkship specialty, and grade received were used in analyses, with a total of 87,922 evaluations meeting these criteria. Students self-identified their ethnicity, and the medical schools determined which racial/ethnic categories were URM. We used this institutional definition of URM status as Black or African American, Hispanic or Latino, and American Indian or Alaska Native. All other self-identified ethnicities were categorized as non-URM.

Table 1 Dataset Characteristics

Both schools included in this study fully incorporate ACGME recommendations for Core Competencies for medical student training,20 and these recommendations had been implemented before the study period. Additionally, grades for required core clerkships were determined by a combination of clinical ratings with standardized exam scores, where the National Board of Medical Examiners (NBME) exam accounted for no more than 25% of a grade. At each school, no more than 30% of students received honors in a clerkship. Sample evaluation forms from each institution are available in Appendix Figures 12 online. Faculty at each institution were similar in composition: in 2015, school 1 had 48% female faculty, while school 2 had 45%. At school 1, 8% of faculty were URM, 88% were non-URM, and 4% were unknown; at school 2, 4% of faculty were URM, 84% were non-URM, and 12% were unknown.

Data were collected with unique identifiers for each narrative evaluation, without linkages between multiple evaluations for a single student across clerkships and time. A breakdown of evaluation composition by grade, gender, URM status, and specialty is shown in Table 1.

De-identification

The narrative text of evaluations was de-identified in a two-step process. First, a database of names from publicly available US Census and Social Security name records was compiled,21,22,23 and the text of evaluations was matched against these names, with a second filter of parts-of-speech processing to identify proper nouns. All names identified in this process were replaced with generic fillers. A subset of the narrative evaluations was manually verified for complete de-identification.

Parsing

We used an open software trained English language parser available from Google for parsing, which uses SyntaxNet,24, 25 an open-source neural network framework for TensorFlow machine learning. This was applied to the narrative evaluations both to assist in de-identification as well as attribution of parts-of-speech tagging and text parsing, which formed the basis of the dataset used below in the primary analyses.

Analysis

First, we compared the distribution of grades, dichotomized to honors versus pass, across gender and URM status, as well as clerkship specialty, using Pearson’s chi-squared tests after applying the Benjamini-Hochberg procedure for multiple testing correction. Second, we examined the length of evaluations, quantifying differences in distribution with the Wilcoxon-Mann-Whitney test.

Next, we generated a list of frequently used descriptors, defining descriptors as adjectives. The ten most frequent terms did not differ by gender or URM status when stratified by grade (Appendix Tables 2a and 2b online). To accurately characterize word frequency, we employed a widely used natural language processing method known as term frequency-inverse document frequency (TF-IDF),26 which is a measure of the frequency of a term, adjusted for how rarely it is used. Here, we defined term-frequency as the frequency of a given word in an evaluation, and inverse document frequency as the inverse of the total frequency of the word’s usage across all evaluations. We then averaged the TF-IDF value for a given word by gender and URM status. Examining TF-IDF by gender and URM status allowed us to infer the significance of a word, and whether this word was used with similar weight and meaning across evaluations. For example, the word “excellent” has a highly positive connotation. However, because it appeared so frequently across all evaluations, it corresponds to a lower TF-IDF score, thus one particular usage of the word “excellent” does not confer much meaning. In contrast, the word “energetic” appeared in fewer evaluations overall, so has a higher TF-IDF score, making each usage of “energetic” carry more weight.

TF-IDF has been shown in other work,27 primarily in the field of information retrieval, to be a superior method compared to absolute term frequency as it has the ability to weight the frequency of terms in a manner that relates their “importance.” As we suspected, the TF-IDF values of the most commonly used words determined by overall frequency were low, suggesting that their wide usage reflects a range of meanings (Appendix Table 3 online). We ranked the descriptors that were used in more than 1% of evaluations (a common threshold in large text datasets) by TF-IDF score, by gender, and by URM status.

Finally, we reported which descriptors evaluators used differently between groups by gender and URM status, using Pearson’s chi-squared tests with Benjamini-Hochberg corrections. We surveyed this study’s co-authors, who represent experts in medical education as well as clinical faculty, about the descriptors found to be used differently with statistical significance by gender and URM status, asking whether the descriptors were reflective of “personal attributes” versus “competency-related” terms, or neither of the above. We then categorized each word based on majority vote and present this categorization in Table 4.

All analyses were performed with scripts written in R version 3.3.0 (2016-05-03). We considered two-sided p < 0.05 to be significant, after correcting with the Benjamini-Hochberg procedure, a multiple testing correction using false discovery rate estimation.28, 29

RESULTS

Grade Distribution

Overall, 32% of evaluations among all students were associated with honors grades, with 66% of evaluations associated with passing grades, and the remainder receiving non-pass grades (Table 1). Women received more honors than men and were more likely to receive honors in pediatrics, obstetrics/gynecology, neurology, and psychiatry; men were more likely to receive honors in surgery and anesthesia (Table 2). A comparison of non-URM and URM students showed that evaluations of URM students were associated with fewer honors grades than evaluations of non-URM students. When stratifying by clerkship specialty, URM students received fewer honors grades across all specialties. These distributions were comparable at each school included in our dataset (data not shown).

Table 2 Grade Distribution by Gender, URM Status and Specialty

Evaluation Length

We looked at evaluation length by gender and URM status, stratified by grade (Appendix Table 1 online). We found that the distributions of evaluation length between different groups were similar and, although statistically significant in some instances, did not represent meaningful differences.

Common Descriptors by Gender and URM Status

Among descriptors that are used in more than 1% of evaluations, we examined the highest ranking words as measured by TF-IDF by gender and URM status (Table 3). Here, we found that the top ten ranked words were comparable across gender and URM status, suggesting that this measure does not provide sufficient granularity of analysis to assess meaningful differences in narrative evaluations.

Table 3 Important and Unique Descriptors, Among Commonly Used Words

Differential Usage of Descriptors by Statistical Significance

We found that among all evaluations, there were 37 words that differed by usage between men and women. Sixty-two percent (23/37) of these descriptors represented personal attributes, and of these, 57% (13/23) were used more in evaluations of women. In these evaluations of women, we saw that personal attribute descriptors such as “pleasant” were associated with pass grades, while “energetic,” “cheerful,” and “lovely” were neutral in their grade association. Additionally, personal attribute descriptors such as “wonderful” and “fabulous” that were used more frequently in evaluations of women were also associated with honors grades. In evaluations of men, personal attribute descriptors such as “respectful” or “considerate” were neutral in their association with grade, while “good” was seen more with pass grades, and “humble” was seen more with honors grades.

Of the 37 descriptors we found that differed by gender, only 19% (7/37) of these were words that we assigned as competency-related descriptors, and of these, 57% (4/7) were used more in evaluations of women. The descriptors “efficient,” “comprehensive,” and “compassionate” were used more often in evaluations of women and were also associated with honors grades; evaluations of men described as “relevant” were also associated with honors grades.

These descriptors that were associated with significantly different usage between men and women are shown in Figure 1 by their distribution along the x-axis, and the association any given word has with honors or pass grades is indicated by its distribution along the y-axis. In addition, words represented in Figure 1 (and Figure 2, described below) were found to be of high importance as measured by TF-IDF, with even higher values than the common words reported in Table 3 (data not shown).

Fig. 1
figure 1

Descriptors with statistically significant differences in usage by gender All words were assessed for differential usage between groups of interest, with statistical significance defined as p < 0.05. Location of a word point on the men-women axis indicates its preferential use in either gender. Distance from the y-axis also indicates increased difference from expected word distribution, noting however that all words shown are statistically significant in their usage by gender. Placement along the pass-honors axis indicates association of a given word with usage in either more honors- or pass-graded evaluations. Orange-highlighted words identify words that are used more in evaluations of women, while blue-highlighted words identify words that are used more in evaluations of men. The categorization of these terms by “personal attribute” versus “competency-related” descriptors can be found in Table 4.

Fig. 2
figure 2

Descriptors with statistically significant differences in usage by URM status All words were assessed for differential usage between groups of interest, with statistical significance defined as p < 0.05. Location of a word point on the Non-URM-URM axis indicates its preferential in either gender. Distance from the y-axis also indicates increased difference from expected word distribution, noting however that all words shown are statistically significant in their usage by URM status. Placement along the pass-honors axis indicates association of a given word with usage in either more honors- or pass-graded evaluations. Orange-highlighted words identify words that are used more in evaluations of URM students, while blue-highlighted words identify words that are used more in evaluations of non-URM students. The categorization of these terms by “personal attribute” versus “competency-related” descriptors can be found in Table 4.

Among all evaluations, there are 53 descriptors that differed by their usage between evaluations of URM and non-URM students. Thirty percent (16/53) of descriptors represented personal attributes, and of these, 81% (13/16) were used more often to describe non-URM students. The descriptors “pleasant,” “open,” and “nice” were used to describe URM students and were associated with passing grades. Many personal attribute descriptors used to describe non-URM students, such as “enthusiastic,” “sharp,” or “bright,” were neutral in their association with grade, while “mature” and “sophisticated” were more frequently associated with honors grades.

Of the 53 descriptors that differed by URM status, only 28% (15/53) of these were competency-related descriptors, and 100% of these (15/15) were used more in evaluations of non-URM students. The competency-related descriptors “outstanding,” “impressive,” and “advanced” were more frequently associated with honors, while “superior,” “conscientious,” and “integral” were neutral in their association with grade. Of note, all of the descriptors (either personal attribute or competency-related) that were used more frequently in evaluations of non-URM students had either neutral associations with grade or were associated with honors grades. These descriptors that were associated with significantly different usage between URM and non-URM students are shown in Figure 2 by their distribution along the x-axis, along with the association any given word had with honors or pass grades as indicated by its distribution along the y-axis.

DISCUSSION

This novel application of natural language processing to what we believe is the largest sample of medical student evaluations analyzed to-date reveals how students are described differently by gender and URM status. We found that across student evaluations, common, important words were used with similar frequency across gender and URM status. However, our analysis revealed significant differences in the usage of particular words between genders, as well as by URM status. While terms deemed important by the TF-IDF metric were reflective of personal attributes and competence, and were comparable among genders and URM status, the words with statistically significant differences in usage between these groups indicate inclusion of personal attributes more so than competencies, as defined in Table 4. Although there were both competency-related and personal attribute descriptors that are used differentially between gender and URM statuses, there is a dominance of personal attribute descriptors in the words we found to be used differently between these groups that we believe is important in signaling how student performance is assessed.

Table 4 Categorization of descriptors by personal attribute vs. competency

Our study is consistent with previous work examining differences in grading between genders and URM status groups in limited settings. Lee et al. showed that URM students receive lower grades across clerkships,30 and other work has shown conflicting effects of student demographics on clerkship grades, including student age.31 Whether other objective measures of academic performance, such as prior standardized test performance or undergraduate GPAs, contribute to clerkship evaluation grades has also been debated.30, 32 However, previous work has been limited in scope, both with respect to clerkship specialties and small sample sizes. The breadth of our data allows for identification of infrequent instances of differential descriptors that are concerning when considered in the context of the entire population of medical students.

In prior studies of narrative evaluations, investigators examined differential usage of a pre-determined set of words. Women have been shown to be more likely than men to be associated with words like “compassion,” “enthusiasm,” and “sensitivity,” while other studies have shown that the presence of “standout” words, such as “exceptional” or “outstanding,” predicted evaluations of men but not of women.8 Additional research suggests that similar patterns extend beyond the realm of student evaluations.5, 33 A strength of natural language processing is that we did not have to pre-specify words that might differ; instead, we were able to extract any differing words without the introduction of additional analytic bias.

Despite the intent of clerkship assessment to address competencies by observing behaviors, the differences we found among URM and non-URM assessments were more reflective of perceived personal attributes and traits. In prior work, Ross et al. found that Black residency applicants were more likely to be described as “competent”, whereas White applicants more frequently received standout and ability descriptors, like “exceptional”.34 Examining our findings of the variation in descriptors used for URM and non-URM students in the context of the literature is pertinent considering the discrimination and disparities faced by racial and ethnic minorities as trainees and healthcare providers. Disparities in clerkship grades,30 membership in honors societies,35 and promotion36, 37 are well-documented. Research on performance assessments suggests that small differences in assessment can result in larger differences in grades and awards received—a phenomenon referred to as the “amplification cascade.” Teherani et al. illustrate the presence of this phenomenon among URM and non-URM students in undergraduate medical education.38 The amplification cascade holds major implications for residency selection and ultimately career progression that can disproportionately affect students from underrepresented groups.

Our study has limitations. First, although our sample size is large, we analyzed evaluations from two medical schools. Second, baseline measures of academic performance or subsequent markers of career success were unavailable, which limited our ability to extrapolate the effect of these differences in text beyond the grade received in the clerkship. Third, due to data limitations, we were unable to link evaluations of individual students across clerkships to assess patterns in grading behaviors and biases, although all students rotating in core clerkships in the study years were included in the dataset. Fourth, we were unable to assess any interaction between evaluator demographics and narrative language differences, as seen in other studies.8 Fifth, we did not have access to Dean’s Letters, also known as the MSPE, which are a compilation of comments from individual clerkships and are the direct link between student evaluations and residency applications. Finally, we did not examine if narrative differences were more pronounced among members of both groups, such as women who also identify as URM.

Despite efforts to standardize medical student evaluations, the differences in narrative language suggest directions for improvement of medical education assessment. At a minimum, our findings raise questions about the inclusion of verbatim commentary from these assessments in MSPE letters used in residency applications, as is the accepted national standard.1 Similarly, our work demonstrates that the competency-based evaluation framework37 ostensibly in use for evaluating medical students remains incompletely implemented. Finally, behavioral science research has uncovered best practices for reducing bias in evaluations, including comparative evaluations,39 structured criteria,40, 41 task demonstrations, blinded evaluations,42 and evaluator training featuring de-biasing techniques.43 In the future, it may be possible for language processing tools to provide real-time and data-driven feedback to evaluators to address unconscious bias. Perhaps it is time to rethink narrative clerkship evaluations to better serve all students.