Define the construct of interest
Clearly defining a construct is essential to assessment development and validation [20, 23]. When considering a broad concept like clinical reasoning, it is important that authors provide an explicit definition of clinical reasoning (or synonym) or a clear description of the target construct used in their assessment. This recommendation does not imply that there is one universally “correct” definition of clinical reasoning [32, 33], but rather it emphasizes the importance of providing a clear description of the specific construct of interest. This clear description will promote appropriate understanding, defensible application, and meaningful integration of the assessment method.
In developing a construct of clinical reasoning for assessment, an important consideration to specify is whether clinical reasoning is being assessed as an outcome, process, or combination thereof . This clarification of clinical reasoning as an outcome, process, or both leads to the use of different forms of assessment. For example, an assessment concerned with diagnostic accuracy may treat clinical reasoning as an outcome (i.e., a defined output) while another assessment concerned with data gathering may emphasize clinical reasoning as a process (i.e., a sequence of steps). How well the assessment distinguishes between process and outcome, and how that relates to the construct definition, will have important implications in understanding the strengths and limitations of a given assessment approach.
If an assessment is a process-oriented assessment (i.e., focused on clinical reasoning as a process rather than an outcome), authors should clearly define the clinical reasoning subprocesses of interest [3, 14]. Some assessments focus on information gathering and differential diagnosis (e.g., OSCEs); other methods are geared towards the assessment of problem representation (e.g., concept maps) . Specifying the clinical reasoning subprocesses or components targeted by a given assessment approach will make it clear if, how, and in which manner the assessment method is capturing clinical reasoning performance.
Clinical reasoning has been studied using a variety of lenses, and several theories have been mobilized to better understand this complex construct. Given that clinical reasoning can be considered a process or outcome, a deeply cognitive or collaborative activity, we recommend that authors describe the theoretical framework(s) that shape how they understand the construct of clinical reasoning, and the perspectives that ground the assessment and study design . Multiple theoretical frameworks exist that highlight the multiple understandings of clinical reasoning, and different frameworks provide different rationales for selecting certain assessment approaches over others . For example, use of dual-process theory as a conceptual framework may lead researchers to discriminate between and assess both unconscious/automated cognitive processes and conscious/analytical processes . Alternatively, relying on situated cognition theories may lead to emphasizing a contextual dimension to clinical reasoning—as a process that emerges from interactions between physician factors, patient factors, and practice environment factors . If these two different theoretical stances are carried forward to assessment design, one could imagine a situated cognition researcher focusing an assessment approach on how the diagnostic accuracy of a learner is affected by an angry patient, whereas a researcher using dual process theory might focus on the cognitive processes a learner uses to navigate diagnostic ambiguity. Explicitly stating the theoretical framework and definition of clinical reasoning in an article allows the reader to understand the rationale for the assessment methodology and discussion of the results.
Describe the assessment tool
In addition to an explicit definition or articulated construct, a clearly described assessment development and validation approach is critical to understanding the nature, purpose, quality, and potential utility of the assessment. Clinical reasoning assessments can vary by their setting (classroom versus clinical), type of encounter (simulated versus workplace), clinical reasoning component of interest, and a variety of other factors [14, 37, 38]. Additional important components of an assessment include the instruments used to collect examinee responses (i.e., assessment format), the method used to score those responses, the intended use of the scores generated by the assessment, and, when applicable, the background or training of the raters who conduct the scoring.
In conducting our review on assessments of clinical reasoning, our research group focused on data reported about an assessment that included: the stimulus format (i.e., what necessitated a response), response format (i.e., how a learner could respond to the assessment), and scoring activity (i.e., how a learner response was transformed into an assessment score). We found these elements helpful in structuring our analysis of existing literature  and suggest reporting these characteristics would increase clarity for the development and description of new research on clinical reasoning assessment.
Stimulus format describes the way in which an examinee is presented with a clinical scenario . Examples include a real patient, standardized patient, computer-based virtual patient, or a written clinical vignette. Providing detail on how these stimuli are chosen or constructed, their level of complexity, and their degree of intended ambiguity or uncertainty will help promote understanding of the development, scope, and application of the assessment and contribute to building a validity argument in support of score interpretation .
An examinee’s choices or series of actions in response to the stimulus format need to be recorded, and this component is captured in response format . Responses can be selected in which an examinee chooses from a list of provided answers or constructed in which an examinee responds verbally or in writing to open prompts . Furthermore, constructed responses can exist in different formats such as essays, diagrams, and post-encounter documentation. We recommend authors describe how response instruments were either created or selected to meet the goals of the assessment and—particularly if novel—whether they were piloted to see if examinees understood the question, response options, and mechanics of the collection instrument.
Scoring activity is the process by which examinee’s responses are converted into a performance result. It can be quantitative or qualitative. It refers to both “answer key” generation and application, and in certain assessment approaches includes rater training. Scoring activities can occur pre-, intra-, and/or post-assessment and should be explicitly described. Answer key generation typically occurs pre-assessment, when a group develops a clinical question or scenario and determines the “correct” response. Information on how consensus was achieved for the answer key (if relevant) should be provided.
Intra-assessment scoring occurs primarily during direct observation. Authors should provide the details about the tool with accompanying rationale for why a particular scoring approach is used (e.g., checklist vs global; complete/incomplete vs. Likert-scale). Intra-assessment scoring activity is challenging because of the multi-tasking required of the assessor. Thus, authors should provide details about rater qualifications, experience, training, and inter- or intra-rater reliability. Information should also be provided about the time needed to complete the assessment activity.
Post-assessment scoring can involve grading selected or constructed responses. Depending on the format, it may be automated or by hand. Like intra-assessment scoring, providing background information on the scorers and/or technology and time needed to complete the activity will allow for determinations about the feasibility of an assessment method.
In summary, to better support transparent reporting, we suggest papers describing assessments of clinical reasoning include a clear definition of the construct of interest, describe the theoretical framework underpinning the assessment, and describe the stimulus format, response format, and scoring activity.
Collect validity evidence
Validity refers to the evidence provided to support the interpretation of the assessment results and the decisions that follow . Evidence of validity is essential to ensure the defensible use of assessment scores and is an important component of assessment literature . A common misassumption that we encountered in our scoping review was that validity is a dichotomous feature (i.e., that an assessment is either valid or not valid). This notion of validity as a characteristic of a test has been reported elsewhere and is not limited to the clinical reasoning assessment literature . In more current conceptualizations of validity, an assessment may have varying strengths of evidence in an argument supporting the use of its scores to make specific inferences about specific populations. In other words, evidence of validity does not “travel” with an assessment tool and should be collected each time an assessment is used in a different context or population and for each different score use (i.e., formative feedback vs. summative decisions). For example, validity evidence collected for an examination to determine whether medical students demonstrate a minimum level of competence in their clinical reasoning to pass the pediatrics clerkship would not be sufficient to justify using the examination to determine whether pediatric residents should be licensed to practice independently. In summary, validity evidence must be collected for a clinical reasoning assessment supporting the intended score use and the decisions that result.
We recommend that authors use an explicit validity framework to collect and report the validity evidence. There are multiple approaches, but in HPE, two major validity frameworks are frequently used: Messick’s unified framework of construct validity  and Kane’s validity argument framework . Messick’s framework, as modified by Downing , identifies five categories of evidence to support score interpretation and use. These categories (adapted in the Standards for educational and psychological testing ) are: content (e.g., blueprinting an examination or ensuring workplace assessments are performed aligned with the definition of clinical reasoning adopted), response process (e.g., ensuring those being assessed and those assessing understand the assessment as intended), internal structure (e.g., analysis of difficulty and discrimination of items/cases and how that aligns with the definition of clinical reasoning adopted), relationship to other variables (e.g., relationship with another assessment that assesses a similar or different aspect of clinical reasoning), and consequences (e.g., whether decisions made based on results of the assessment are justified for the individual and the institution). Notably, reliability is included as one piece of evidence supporting validity (part of internal structure), rather than treated as a separate construct from validity. In summary, Messick’s approach focuses on the types of evidence that help us determine whether a decision based on assessment data is sound, or, in other words, whether a score interpretation (e.g., pass/fail) is defensible.
As described by Schuwirth and van der Vleuten, Kane’s framework involves organizing categories of evidence into a logical argument to justify that the units of assessment can be translated into an inference about the population being assessed . The four links in the logic chain are scoring (translation of an observation into a score), generalization (using a score to reflect performance in a particular setting), extrapolation (using the score to reflect performance in the real-world setting), and implications (using the score to make a decision about the learner). Through this lens, sequential inferences are made about the learner’s clinical reasoning. For example, a workplace-based assessment may begin with a chart-stimulated recall  of a resident’s clinical reasoning about a case (a supervisor translates that observation into a score), followed by generalization of that score to make an inference about the resident’s clinical reasoning performance in the outpatient clinic within their patient population, followed by extrapolation to their clinical reasoning ability for outpatients in general, and ending in a decision to allow the resident to make clinical decisions without supervision.
Our recommendations and resources for developing, describing, and documenting an assessment of clinical reasoning are summarized in Tab. 1. Where Messick’s approach focuses on types of evidence, Kane focuses on the types of inference that are made moving from an assessment to judgment such as “competent.” Thus, these two frameworks are complementary and have been combined in order to think comprehensively about the validity of assessment score interpretation .
We hope these recommendations can assist investigators in their conception and execution of new studies on the assessment of clinical reasoning. Similarly, these recommendations can guide educators in analyzing the quality of existing studies in order to decide whether to incorporate an assessment approach into their own educational program. Towards this end, these recommendations offer a structured approach to question the utility of an assessment paper based on whether a clear definition of clinical reasoning has been provided in order to evaluate if the study is generalizable to one’s context, a theory was utilized to shed insight into the perspectives and assumptions behind the study, sufficient methodological detail is available for the assessment to be reproduced accurately and reliably, and the validity evidence provides adequate justification for pass/fail or promotion decisions.
Potential future implications of our reflections and proposed recommendations include the creation of formal guidelines to be used in reporting of studies on the assessment of clinical reasoning, similar to PRISMA guidelines for systematic reviews and metanalysis . Our recommendations require further review and discussion by experts in clinical reasoning and assessment research as well as additional empirical examination, but we hope this manuscript can serve as an important conversation starter for developing formal guidelines. As the volume of articles on clinical reasoning assessment continues to rapidly expand , we worry that the literature will become even more fragmented, further impeding the ability to synthesize, reproduce, and generalize research in this critical domain. An increasingly fragmented literature may limit our ability to effectively share innovative approaches to assessment, and limit constructive engagement with how to best assess, or how to best combine multiple assessments, to effectively capture clinical reasoning performance.