Background

Clinical reasoning is of fundamental importance in the practice of medicine [1, 2]. Despite a great interest in measuring clinical reasoning ability [3, 4], educators still face challenges in practical application [5, 6]. Currently available clinical reasoning instruments have been validated using construct validity, where investigators offer evidence of the instrument’s ability to measure the intended topic [6]. Investigators also offer evidence of the instrument’s inter-rater reliability. However, the convergent validity of these instruments has received little attention [7,8,9]. That is, practical application is limited by our lack of understanding in how to compare scores across instruments. Furthermore, given disparate perspectives on clinical reasoning definitions [10,11,12], we need robust empiric studies to clarify the context of instrument scores. The medical educator’s ability to compare instruments is paramount in developing robust competency evaluation programs in medical training curricula [13]. Messick’s criteria offers a useful framework for studying the relationships of these instruments.

According to Messick’s criteria, validity evidence is comprised of five underlying arguments: content (“topic of interest”), response process (“rater and examinee actions’ alignment with construct”), internal structure (“reliability, item analysis, and factor analysis”), consequences (“impact of the assessment”), and convergent validity (“relationship to other variables”) [8, 14]. Convergent validity is a powerful, yet underutilized validity argument [6,7,8, 15]. The convergent validity argument is founded on the relationship of a novel instrument’s score to scoring of associated instruments. Two instruments measuring the same information should be strongly, positively related. Conversely, there should be little to no appreciable correlation between instruments measuring unrelated phenomena. Convergent validity studies allow the emergence of unexpected data to challenge previously held assumptions about real world observations, and theories about unobservable constructs [16]. Convergent validity does not address whether the intended construct is measured (which is content validity), but rather how similar (or dissimilar) the information captured by the new instrument is to other instruments.

This study’s purpose is to determine the convergent validity of three clinical reasoning instruments: Clinical Reasoning Task (CRT) checklist [17], Patient Note Scoring Rubric (PNS) [18], and Summary Statement Assessment Rubric (SSAR) [19], by comparing each instrument’s scoring of clinical notes created at the conclusion of a virtual patient module to Clinical Data Interpretation (CDI) test [20, 21] scores. Moreover, to further evaluate real world associations, we investigated the relationships of each instrument’s scoring to student characteristics.

Method

Data collection

In November 2016, at the end their 18-month pre-clinical curriculum, 235 students began a two-week Foundations of Clinical Reasoning course at a large academic medical center in the United States. All data collection took place on the first day of the Foundations of Clinical Reasoning course [Fig. 1]. Students completed the CDI test prior to the first session. Students worked in small groups on a computer-based case presentation of an Exercise in Clinical Reasoning [22]. The case paused twice for students to input a working differential diagnosis and plan. At the conclusion of the case, each student wrote an individual clinical note. We randomly-selected four students from each CDI quartile (n = 16) to write a clinical note on a second published clinical case [23].

Fig. 1
figure 1

Comparison of Three Assessment Instruments of Clinical Reasoning Capability

Demographic variables included potential associations with novice clinical reasoning ability: college major, premedical clinical experiences (e.g. medical volunteering, apprenticeships, health professions careers), gender, and self-reported ethnicity [24, 25]. No exclusion criteria applied. Students were not incentivized for participation.

Instruments

The Clinical Data Interpretation (CDI) test is a 72-item multiple-choice question instrument. The CDI is grounded in script concordance theory, and seeks to determine clinical reasoning capability during diagnostic uncertainty [21]. In the same template as the Script Concordance Test [26, 27], for each item, after considering a chief complaint and one item of clinical data, students designate a likelihood for a diagnostic hypothesis [20, 27]. In the only divergence from the Script Concordance Test, Williams and colleagues created a scoring key with a single correct answer per item [21]. Each question correct answer receives a full point. Twenty-six symptoms correlated to diagnoses across multiple clinical specialties appear in on the test. Two of these symptoms formed the presenting chief complaints for the Exercise in Clinical Reasoning (shortness of breath), and the second clinical case (memory loss). Each student had 60 min to complete the CDI. Raw scores are total points earned out of 72.

The Clinical Reasoning Task checklist (CRT) was developed as a taxonomy of 24 tasks physicians use to reason through clinical cases [17]. These tasks have been used to explore the reasoning patterns used by medical students, residents and attending physicians [28, 29]. Students earned one point each time a CRT task was used, including repeats, in accordance with previously published protocols [29, 30]. Total scores were assigned by adding the total number of CRT tasks used. The Patient Note Scoring Rubric (PNS) was created to capture student clinical reasoning capability [18]. The authors used the standardized scoring instrument, which covers three domains scored on a scale of 1–4 points: documentation of pertinent history and exam findings, differential diagnosis, and diagnostic workup. The Summary Statement Assessment Rubric (SSAR) is a 5-domain instrument to validated to evaluate the clinical reasoning documented within summary statements created by medical students [19]. The domains include factual accuracy, appropriate narrowing of the differential diagnosis, transformation of information, semantic qualifier use, and a global rating. Students received 0–2 points on each domain of their summary statement, except factual accuracy, which received 0 (inaccurate) or 1 (factually accurate).

Analysis

Three teams of reviewers scored the clinical notes with CRT, PNS, and SSAR instruments. The primary investigator (Y.C.) reviewed each clinical note in detail with each team and selected cases to create initial examples of scoring criteria. Each team iteratively compared scoring criteria by reviewing batches of sample notes until we agreed on scoring criteria. The remaining notes, or summary statements, were coded by the team with final scores assigned by agreement. We achieved statistically significant and high-degree agreement in all qualitative coding analysis of the clinical notes. CRT reviewers (Y.C., N.W) achieved an intraclass correlation (ICC) of 0.978. The SSAR reviewers worked in two teams with K.G.’s scores being corroborated by YC (ICC = 0.831) and JW (ICC = 0.773). PNS reviewers (J.W. and K.G) achieved an ICC of 0.781 coding the 16 clinical notes for the second case.

Descriptive statistics were calculated for CDI scores, demographics, and instrument scoring. The authors correlated each instrument’s global score to CDI with Pearson correlation. Correlation analyses were performed with each instrument’s global score, and domain scores. Spearman’s rank correlations were performed to investigate non-linear correlations. Effect sizes categorized in accordance with published standards [31]. Finally, five student demographic variables were individually compared to instrument scores with one-way ANOVA [Table 1]. Due to multiple comparisons, for all reported analysis, those achieving two-tailed p-value ≤0.01 were considered statistically significant. Data analyzed with SPSS (IBM, version 25, 2017, Armonk, NY). The UT Southwestern Medical Center Institutional Review Board approved this study.

Table 1 Demographic Characteristics of 121 Medical Students in the Foundations of Clinical Reasoning Course

Results

The CDI test was completed by 234 of 235 students (99.6% response rate). Voluntary demographic data were collected from 121 of 235 students (51.5% response rate) [Table 1]. Women represented 52.1% (n = 63) of the respondents. There was no statistical difference in CDI mean (SD) score among respondents and non-respondents [44.9 (5.4) vs 43.9 (4.8); p = 0.35]. We compared five categories of student characteristics (i.e., gender, race/ethnicity, college major, premedical clinical experience type, and premedical clinical time) with corresponding scores on each clinical reasoning assessment instrument. We did not find any significant association between any demographic variable and the corresponding student CDI, CRT, PNS, or SSAR scores with one-way ANOVA.

For the first case, 229/235 clinical notes, and 227/235 summary statements were submitted (97.5 and 96.6% response rates, respectively). All 16 students selected to submit written clinical notes from the second clinical case completed the activity. Table 2 and Fig. 2 show the Pearson product moment correlation coefficients for the correlations of the CDI score with each final clinical note score. Similar results were seen with Spearman’s Rank Correlations. The average number of CRT checklist items used per clinical note was 11.8 [range 1–30]. The mean SSAR score of the 227 summary statements was 3.68 [range 0–9]. The mean PNS score of the 16 clinical notes from the second case was 45.8 [range 24–68]. Small to medium effect sizes were seen between CDI scores and the three instruments [Table 2 and Fig. 2]. Only CRT demonstrated significance, with a small effect size with CDI (r = 0.16, p = 0.01, df = 228).

Table 2 Correlations of Three Clinical Reasoning SOAP Note Instruments with Clinical Data Interpretation score
Fig. 2
figure 2

Scatterplot Matrix of Three Clinical Reasoning SOAP Note Instruments with Clinical Data Interpretation test

Among the three instruments, significant correlation was found between the PNS and the CRT checklist was found (r = 0.712; p = 0.002; df = 15) [Table 2 and Fig. 2]. To investigate the potential overlap of assessment domains between the PNS and CRT, we compared frequency of the domains within the PNS (i.e. Workup, Differential Diagnosis and Documentation) and the fourteen possible items on the CRT checklist. We observed multiple medium effect sizes and one significant correlation between these two instruments [Table 3]. Students who scored well on the PNS Documentation of history and physical exam articulated the need for consultation and follow up on the CRT (r = 0.631; p = 0.009; df = 15).

Table 3 Correlations of the components of the Patient Note Scoring and Clinical Reasoning Task instruments in 16 SOAP notes

Discussion

We hypothesized that the CDI test would correlate with the each of the instruments. We found a small significant correlation between CDI and CRT. The significance of this small effect size stems from a large sample size. This means that we accept our null hypothesis – none of the clinical reasoning assessment instruments demonstrated a statistically significant correlation to CDI test.

Our findings represent the first large comparison of clinical reasoning clinical note assessment instruments with a standard, the CDI test. Our results contribute to the current body of validity evidence surrounding clinical reasoning assessment instruments in the area of convergent validity, or relationship to other variables. This empirical data supports the argument that clinical reasoning is currently described by multiple theoretical frameworks that may not describe the same phenomenon [10, 11, 32]. Furthermore, our study mitigated the typical limitation of study design for convergent validity – participant time burden for multiple assessments [7] – by (a) scoring one clinical note with multiple instruments, and (b) selecting a subpopulation for a second assessment through stratifying by CDI test score.

We found a large, significant correlation between PNS and CRT global scores [Table 2], as well as specific scoring domains within each instrument [Table 3]. There is statistically significant overlap in the constructs underpinning these two instruments. These instruments may measure some of the early capabilities expressed by the novice as they communicate their clinical reasoning about a clinical case.

We hypothesized that the student characteristics selected will interact with the clinical reasoning assessment scores. We did not find statistically significant associations across the four assessment instruments. This means that educators interested in clinical reasoning assessment of clinical notes should be encouraged that the four instruments tested demonstrated objectivity across measured demographics.

Limitations of our study include that this is a single-institution study of student’s clinical reasoning capability on two written cases. Our study benefits from a large, diverse participant sample with minimal attrition. To this end, we did not find any significant associations between student demographics with variables of clinical reasoning. We rated clinical notes from a single time point, which restricted our ability to assess both temporal stability, and predictive validity of the participant scoring. To circumvent student availability in the curriculum for repeated time measurements, we did select a subpopulation of students to participate in a concurrent, separate measurement of a written (paper) clinical note after submitting their electronic clinical note. Given the large number of participants, multiple instruments, and different performance opportunities (written and electronic clinical notes), our results have significant educational impact.

Convergent validity is a powerful, yet underutilized, validity argument that serves to position, and confirm an instrument into the current understanding of the intended construct. It is accepted that there is no specific amount of validity evidence that satisfies “validity” of an instrument. Nevertheless, we propose that educators consider incorporating convergent validity in to their validation procedures, especially in conjunction with real-world scores to further evidence of extrapolation [9].

The future of clinical reasoning assessment holds promise through increasingly well-designed studies. In our study, differing strengths of association between clinical reasoning instruments suggests varying degrees of overlap in the clinical reasoning frameworks underpinning the assessment domains. Our future work will include clinical note assessment at all training levels to determine the clinical reasoning instruments most appropriate for each level of clinical reasoning capability development, and their effect on student learning as a part of a larger assessment program of this complex competency [13]. We implore researchers to include convergent validity testing when developing their instrument to situate it within the context of available assessment instruments. Such investigations will accelerate our understanding of the multidimensional construct of clinical reasoning.