Background

Clinical reasoning has been the focus of research for much of the past thirty years. This has been due as much to an inherent fascination with the topic itself as to the need to reduce the high incidence of adverse events due to missed and delayed diagnoses [1, 2]. Indeed, the patient safety literature abounds with studies which describe common types of diagnostic error [3, 4] but few, if any, propose a way of identifying these errors in practice and approaches to remediation [5]. Nevertheless, there is now, not only a much enhanced understanding of the cognitive processes involved in diagnosis and their relationship to knowledge [6], but also an increased focus on developing and enhancing clinical reasoning skills in students and practitioners [7].

In pursuit of this goal, numerous strategies have been devised to teach and to learn the diagnostic process and develop clinical reasoning skills, using both cognitive and formulaic approaches (such as heuristics and decision trees) [813]. Commonly (and perhaps understandably), the indicator of success in these teaching approaches is diagnostic accuracy with relatively little emphasis being placed on the need to develop a sound underpinning reasoning process. To date, a valid, reliable and objective method of identifying and evaluating an individual’s clinical reasoning characteristics and ability remains elusive [14].

In the absence of such a gold standard, developing a suite of methods able to evaluate one or more aspects of the clinical reasoning process may bring the achievement of this goal closer. Two possible, already established, methods are the Clinical Reasoning Problems (CRPs) [15] and the Script Concordance Test (SCT) [16]. These methods have some attributes in common, but generally are complementary with regard to their theoretical framework and assessment approach. Both these methods have been used in a variety of contexts and have demonstrated reliability and validity as tests of clinical reasoning skill in medical students and practitioners [15, 16].

The CRPs aim to assess skill in diagnostic hypothesis generation as well as clinical data identification and interpretation, and thus provide a detailed and comprehensive evaluation of the clinical reasoning process. Each CRP describes a patient’s presentation, history and physical examination findings and respondents are asked to nominate the two most likely diagnoses based only on the information provided. For each nominated diagnosis, participants are asked to choose, from a provided list of clinical features, those features they considered important in reaching their diagnosis, as well as a weighting (positive or negative) for each which best describes its influence on their decision. A CRP score consists of three scales based on the marks for the diagnoses (d-mark), for feature identification and interpretation (f-mark) and total mark (d-mark + f-mark). When administered using the web-based version, respondents are also given immediate qualitative feedback in the form of access to the responses of the expert reference group which forms the basis of the marking scheme. CRPs therefore are able to be used for both teaching and assessment purposes [17].

The SCT has been well-described in several studies [16, 18, 19]. In contrast to the CRPs, SCTs focus specifically on clinical data interpretation. Their design allows weaknesses in this aspect of the reasoning process to be identified. As with the CRPs, SCTs use a case-based format and consist of a clinical scenario followed by up to five questions of the “if this…..then that…” type. Each question provides a possible diagnosis based on the scenario, followed by additional clinical information. Respondents are asked to indicate the impact of this information on the likelihood of the suggested diagnosis being correct, using a five-point scale from -2 (very unlikely) to +2 (very likely).

The scoring schemes for both methods are derived from the responses of a reference group with the highest marks being awarded to those responses which are closest to the majority panel responses.

Both the CRPs and SCTs have been used with large cohorts of medical students [20, 21]. However, although a comprehensive assessment of clinical reasoning, CRPs are relatively time-consuming as each case requires approximately 10 minutes to work through, and 15–20 cases are required for good reliability. On the other hand, SCTs are more time-efficient requiring only a few minutes per case and it has been shown that 30–80 questions in the SCT format are needed to provide good reliability [19]. However, the cases are more narrowly focussed than the cases used in CRPs, as a number of diagnostic hypotheses (usually five) are included in each scenario. Thus, in a given time period, SCTs are able to test clinical data interpretation over a wider range of possible diagnoses than CRPs.

We speculated that using SCTs as a screening method to initially identify students with weak clinical reasoning, followed by a more comprehensive evaluation of those students using the CRPs, might provide an efficient yet targeted appraisal of their clinical reasoning process. The resulting detailed clinical reasoning profile could then potentially be used to design customised remediation activities for individual students. For such a combined approach to work however, there would need to be evidence of a partial correlation between total SCT and CRP scores, and a stronger correlation between total SCT score and the f-subscale of the CRPs.

Consequently, the aim of this study was to test the compatibility of the SCTs and CRPs used in a combined approach and whether this approach would provide a valid, reliable and comprehensive analysis of clinical reasoning characteristics that could subsequently be used to facilitate the development of customised teaching of medical students.

Methods

Subjects

Three groups of subjects were recruited on a voluntary basis as required for ethical approval of the study by the participating institutions. The first subject-group consisted of general practitioners associated with two Australian medical schools. The other two subject groups were third and fourth (final) year students enrolled in each university’s medical program.

Instruments

The CRPs used in this study were those developed and evaluated previously15. They consisted of 20 clinical diagnostic scenarios, divided into two sets of 10 cases, labelled CRP 1 (cases 1–10) and CRP 2 (cases 11–20). The content of both sets were similar in that they covered a range of patient demographics and contexts representative of the type of common, undifferentiated clinical presentations that final year medical students would expect to encounter after graduation. (e.g. each set contained one case relating to the cardiovascular system, one relating to the respiratory system, etc.).

Each CRP scenario was also re-formatted as an SCT comprising the case and five questions. This resulted in two corresponding sets of SCTs of 10 cases and 50 questions each, labelled SCT 1 (cases 1–10) and SCT 2 (cases 11–20). Thus, each set of 10 clinical diagnostic scenarios was available in both CRP and SCT format (labelled CRP 1, SCT 1, CRP 2, SCT 2 respectively). The marking schemes for both formats (CRPs and SCTs) were drawn from the responses of an expert reference group of 21 experienced Australian GPs / family doctors [15].

An example of one clinical scenario presented in both CRP and SCT formats is provided in Additional file 1.

Procedure

The study used a cross-over design in which one set of 10 CRPs was matched with its complementary set of 10 SCTs, thus forming two test-groups: CRP1/SCT 2 and SCT1/CRP2. Participants were allocated alternately to one of these two test-groups, so that they completed all 20 cases, half in CRP format and half in SCT format. Participants were emailed their allocated set of SCTs, as well as a login to access their CRPs online at a dedicated website. They were then asked to complete and submit both sets of questions electronically within three weeks. Completion time was estimated at about 90 minutes for the CRPs and about 30 minutes for the SCTs, bringing the expected total testing time to approximately two hours. All responses were automatically scored on submission and both scores and feedback provided to the participant. This was immediate in the case of the CRPs through access to the collated responses of the expert reference panel. For the SCTs, participants were provided with a comparison of their responses with the expert panels’ “best” answers, by return email.

Data analysis

Statistical analysis was undertaken using SPSS 20. The Shapiro-Wilk statistic was calculated to determine distribution and Levene’s test to determine the homogeneity of variances. Reliability was calculated using Cronbach’s alpha coefficient for internal consistency. Evidence of construct validity was assessed by calculating the correlation between total SCT scores and CRP feature (“f-score”) and total scores. Differences between subject- groups were analysed using one-way analysis of variance (ANOVA). Finally, if the SCTs are to have utility as a screening technique, it is necessary to ensure that SCT scores are able to predict subsequent performance in the CRPs. In the absence of a criterion-referenced pass mark and to approximate a 50% score, the second quartile of the total score for the SCTs and the f-score for the CRPs was chosen as the notional pass mark. Using this figure, the number and proportion of students passing and failing the SCTs and CRPs was calculated.

Results

From a total of 17 GPs and 202 students who agreed to participate in the study, CRP and/or SCT responses were received from 12 GPs (71%) and 119 students (59%). In the CRP1/SCT2 stream, these consisted of eight GPs, 20 Year 4 and 44 Year 3 students; in the SCT1/CRP2 stream, there were four GPs, 22 Year 4 and 33 Year 3 students. Additionally, 57 sets of SCTs were incomplete and removed from further analysis. Thus, the final analysis was based on 131 sets of CRPs and 74 sets of SCTs across all subject-groups.

Descriptive statistics

The mean scores, standard deviations and distribution for all sets of CRPs and SCTs are shown in Table 1. The results of the Shapiro-Wilk test for normality, calculated on the combined group scores indicated that all data, with the exception of CRP 1 scores were normally distributed, thus justifying the use of parametric statistical analyses. Calculation of Levene’s statistic indicated that variances from the mean were homogenous across tests, again with the exception of CRP 1.

Table 1 Descriptive statistics and distribution over all Cohorts

Consequently, one-way ANOVA was used to compare differences between subject-groups (see Table 2). Results indicate that inter-subject-group differences were significant or approached significance for the CRPs but not for the SCTs. Contrast tests between pairs of subject-groups consistently showed significant differences in CRP performance across all scales (d-mark, f-mark and total mark) between the GPs and one or both student groups.

Table 2 Comparison of Means by Cohort

Reliability

Table 3 shows Cronbach’s alpha coefficient for internal consistency for each group of tests. Over all cohorts, Cronbach’s alpha was 0.61 for CRP 1, 0.56 for CRP 2, 0.36 for SCT 1 and 0.60 for SCT 2. As would be expected, reliability increased when calculated using all 20 cases - to 0.93 for the CRPs and to 0.63 for the SCTs. Deleting any single problem from the analysis did not produce a substantial change in reliability.

Table 3 Reliability analyses

Construct validity

The mean scores for all CRPs and SCT cases were calculated and the Pearson correlation coefficients determined. Correlation between total CRP and SCT scores ranged from 0.46 – 0.49, and from 0.44-0.69 between CRP f-score and total SCT score (see Table 4). Statistically significant correlations were found between mean CRP 2 f-score and SCT 2, mean total combined CRP score and mean combined SCT score, and between combined CRP mean f-score and combined SCT score.

Table 4 Correlation analyses between CRPs and SCTs

Using the described notional pass mark, a total of 11 out of 35 (31%) of students passed both SCT and CRP (f-score) tests, and 9 (26%) students failed both tests. Of 16 students who failed the SCT, nine students (56%) failed the CRP f-score, but 7 students (44%) passed the CRP f-score; of 19 students who passed the SCT, 11 students (58%) passed the CRP f-score, whilst 8 (42%) failed the CRP f-score (Table 5).

Table 5 Pass-Fail Comparison based on second quartile SCT total score and second quartile CRP f-score

Discussion

This study has explored the compatibility of two methods of evaluating clinical reasoning, the CRPs and SCTs, to profile the clinical reasoning characteristics of students and clinicians.

Overall, the results suggest that CRPs discriminate well between levels of expertise; this may be because reflecting back on the features considered in generating a diagnostic hypothesis is less difficult once a provisional decision has been made. Interestingly, the SCTs were less able to discriminate between levels of expertise; this finding is difficult to interpret as, for each question within a case, subjects are provided with both a possible diagnosis and related patient information, and are required only to interpret the specific clinical data provided. It is possible that part of the explanation lies in the voluntary nature of the student sample and the relatively low response rate (59%), possibly resulting in the more able students being disproportionately represented and leading to smaller differences in SCT scores between them and the experts than would normally be expected. Alternatively, the cases were designed to cover a range of patient demographics and presentations and it may be that content specificity at all levels of expertise was responsible for some of the difficulty in discriminating between subject-groups. A third possibility is that medical students and GP clinicians are more readily able to recall the medical knowledge needed to interpret clinical findings to a diagnosis, once the diagnosis has been specified.

The moderate correlation between SCTs and CRPs suggests that the two methods measure overlapping but not identical reasoning characteristics. As would be expected, a higher correlation was found between the CRP scale related to data interpretation (f-score) and SCT score as this is the aspect of greatest convergence. Despite this correlation however, just 20 out of 35 students (65%) consistently passed or failed both the SCT and CRP f-score tests, while student performance in the SCT test was not consistent with performance in the CRP f-score for 35% of students. Additionally, only 56% of students who failed the SCT test also failed the CRP f-score, indicating it was not a useful predictor of performance in CRP f-score. Small participant numbers are likely to have influenced these results, and it is possible that a larger sample size combined with a more systematic approach to setting the pass mark for both tests may improve the correlation between them. Cronbach’s alpha calculations show that CRPs have acceptable reliability, taking into account the semi-qualitative nature of the measure. While it is likely that reliability would improve if the number of problems per set were increased, this would mean extending the assessment period which may decrease feasibility and participation. The modest reliability of the SCTs found here is puzzling in that it is not consistent with previous studies which have calculated an average alpha coefficient of approximately 0.78 for 30–80 items [19]. Again, a possible explanation may be the relatively small number of cases used in the current study; further investigation is required to determine the influence of the number of cases on reliability.

The study’s findings are limited by the small number of subjects (particularly GPs), and the availability of complete sets of data for some of the analyses. While this is somewhat unavoidable due to the requirement for ethical approval that participation by both students and GPs be voluntary, it does mean larger trials are needed before the reliability and validity of this approach can be firmly established. In hindsight, it may also have been useful to include a self-report measure of clinical reasoning, such as the Diagnostic Thinking Inventory [22], to encourage self-reflection and analysis, thereby increasing individuals’ understanding of their own reasoning process in relation to that of diagnostic experts. Future work could explore the benefit of incorporating self-reporting measures to further emphasise the importance of metacognition in diagnostic expertise.

Conclusion

Our findings suggest that using different but complementary methods of evaluating clinical reasoning provides a more detailed and qualitative appraisal than either the CRPs or SCTs alone. The SCTs are a practical, valid and time-efficient method of assessing interpretation of clinical data with respect to a given provisional diagnosis in large cohorts; whereas CRPs provide a more comprehensive picture by evaluating individual ability in diagnostic hypothesis generation and data synthesis, as well as data interpretation. While both tests assess data interpretation, this study demonstrates that results can vary depending on the way this is done. This, in combination with the low level of agreement in performance between the two methods suggests that they are likely to be most useful for teaching rather than assessment purposes. Important features of both techniques are that they provide immediate quantitative and/or qualitative feedback. Used together, they can provide a more comprehensive analysis of clinical reasoning ability that is necessary to develop customised remediation of specific identified weaknesses in three important aspects of the diagnostic process - hypothesis generation and clinical data synthesis and interpretation.

In summary, although the findings of this study suggest that using a two-stage approach provides a more comprehensive evaluation of clinical reasoning than either the SCT or CRPs alone, the choice of methods is critical particularly if it is to be used for assessment purposes.