Introduction

Expert performance results from deliberately engaging in activities that improve and maintain high performance [1]. Repeated practice of tasks and immediate feedback on performance allows individuals to focus on weaknesses while refining other aspects of performance [2]. For surgical trainees, the operating room (OR) has long been the primary site of learning and practice [3, 4]. It is a complex, high-stress workplace in which teams are under time and outcome pressures and trainees are required to simultaneously process considerable information [3, 5]. In the context of an evolving surgical and technological landscape [6,7,8,9,10,11,12,13,14,15,16], there has been a recent emergence of training mechanisms that simulate the complex OR environment as well as specific surgical procedures [17]. In both OR-based and simulated experiences, learning is restricted by working memory.

Cognitive load theory (CLT) is a theory of learning based on our knowledge of human cognitive architecture, in particular, the limitations of working memory [18]. It defines three types of cognitive load: intrinsic, extraneous, and germane load. Intrinsic load (IL) arises from the information-processing demands associated with the performance of a given task. Both task complexity and learner expertise determine the IL [19]. Extraneous load (EL), or distractions from the task, is determined by the way in which the task is presented and how much working memory is spent processing information not essential to the task [20]. For surgical trainees in the OR, EL may include sounds unrelated to the task they are performing (e.g., non-essential communication between team members, anesthesia machine alerts, music); self-induced and instructor-induced performance and time pressures; concern for patients on the surgical floor; among others. Germane load (GL) arises when learners use working memory resources to transfer information to long-term memory by forming and refining cognitive schemas [21]. GL can be increased to maximize learning through appropriate instruction that supports learner schema development [22].

Numerous psychometric approaches, including single-item scales, Paas’ scale [23], and multi-item scales, have been used to measure cognitive load [24]. While some of these scales have been used to measure cognitive load among surgeons [25], none were designed specifically for procedural skills. The Surgical Task Load Index (SURG-TLX) was adapted from the NASA Task Load Index (NASA-TLX) to specifically capture the surgical context [26]. However, it does not measure the three types of CL relevant to learning (i.e., IL, EL, and GL) and instead measures workload during a procedure as defined by the surgeon’s experience of the degree of mental fatigue, physical fatigue, feeling rushed, complexity of the procedure, anxiety during the procedure, and distractions in the operating environment. Such general measures limit educators’ ability to understand the specific etiology of trainee cognitive overload and to effectively adapt training sessions and curricula to enhance learning. To address this gap, the Cognitive Load Inventory for Colonoscopy (CLIC) was developed in 2016 [27]. Developers of the CLIC underwent a rigorous instrument development process and generated validity evidence for the CLIC in measuring IL, EL, and GL during colonoscopy training [27].

We adapted the CLIC to be used for procedural skills more broadly, to have a tool designed specifically for simulated and clinical surgical procedures. Understanding surgical trainee cognitive load in various educational settings and across time may benefit surgical educators and surgical education researchers as they work to assess cognitive load in this population. Practically, this may ultimately enable surgical educators to refine instruction to optimize training and surgical trainees to recognize their individual cognitive processing in procedural contexts. In this study, we aimed to gather preliminary validity evidence for our adapted tool, designed specifically for surgical procedures.

Materials and methods

Study design

This is a psychometric study designed to provide validity evidence for the Cognitive Load Inventory for Surgical Skills (CLISS), a self-rating instrument measuring cognitive load experienced by surgical trainees while learning procedural skills relevant to their practice. Based on the unitary model of validity [28, 29], we obtained evidence from several sources: content of the instrument, response process, and internal structure. The Institutional Review Board of the University of California San Francisco (UCSF) granted the study exempt status as an educational research study with minimal risk.

Initial development of the Cognitive Load Inventory for Surgical Skills: content validity

A review of the literature, discussion with surgical faculty at the UCSF, and input from cognitive load experts revealed that while several instruments measuring cognitive load have been applied to surgeons and surgical trainees performing surgical procedures in simulated and clinical settings [25], none were designed to focus specifically on procedural skills in this population. To support content validity in developing the Cognitive Load Inventory for Surgical Skills (CLISS), the authors started with the Cognitive Load Inventory for Colonoscopy (CLIC), the first self-report cognitive load instrument measuring IL, EL, and GL developed specifically for a procedural skill with validity evidence [27]. The 3-factor structure of the CLIC was retained, but each of the IL, EL, and GL items was modified to reflect the learning of any skill in a procedural setting. Cognitive load experts and CLIC creators iteratively reviewed drafts to ensure adjustments adequately retained concepts related to each of the three types of cognitive load while two expert surgeons reviewed to ensure concepts were relevant to surgical training.

Iteration and implementation of the Cognitive Load Inventory for Surgical Skills: response process

To examine the response process of surgical trainees, cognitive interviews with five surgical residents at the senior author’s institution were conducted to assess the clarity of wording and appropriate interpretation of statements. During each interview, trainees explained aloud how they understood each item relative to a hypothetical simulation of the creation of an anastomosis at depth. Minor revisions were made to various items to optimize brevity, wording, and clarity based on this process. The final product was again reviewed with CLIC creators to ensure concepts related to each of the three types of cognitive load were retained. As the CLISS was designed to accommodate the learning of any surgical skill, the instructions for each item were as follows: “Please answer the following related to [specific task to be input here],” allowing for survey administrators to complete the sentence based on the procedural task of interest. CLISS respondents were to indicate their level of agreement via a 5-point Likert scale (1 = strongly disagree to 5 = strongly agree). The response process was further evaluated in different settings (simulation exercises, operating room cases), relative to different surgical tasks, and with different respondents (residents, fellows, faculty).

Analysis of responses to the Cognitive Load Inventory for Surgical Skills: internal structure

Study participants included surgical residents, fellows, and faculty at UCSF who were asked either in-person or by email to complete the CLISS relative to a surgical procedure they had just completed. We included surgical residents, fellows, and faculty with the recognition that there is an opportunity for growth at every level and that the concept of cognitive load is relevant to individuals across levels. Including respondents from a range of levels would allow us to apply the tool to various forms of education, including trainees at the start of residency or subspecialty training and faculty learning a new modality (e.g., robotic surgery) or pursuing continuing medical education. We set an a priori target of 100 respondents. In the absence of clear consensus around the minimum sample size for measuring internal structure, we arrived at n = 100 based on prior literature [30, 31] as well as a a prior sample size calculator for structural equation models [32].

Table 1 provides a detailed overview of the settings in which the CLISS was administered. Each setting was part of our organization’s standard surgical education curriculum, reflecting either a simulation organized by our Surgical Skills Center or an OR-based experience. Regardless of the mechanism of survey invitation, surveys were distributed and completed using REDCap, an academic software program that supports research surveys [33]. Participants were instructed to complete the questionnaire in its entirety based on the instructions included within the survey.

Table 1 Overview of settings in which CLISS was administered

Evidence for the validity of the instrument's internal structure was provided in two forms. First, internal consistency (reliability item analysis) was examined using Cronbach's alpha. Second, confirmatory factor analysis (CFA) was conducted to confirm that relationships among the items were as hypothesized, specifically that items intended to measure the three types of CL (IL, GL, EL) would cluster together and have little overlap with items measuring other CL types. The adequacy of the 3-factor model was examined with three model fit indices, including Bentler’s Comparative Fit Index (CFI), the Tucker–Lewis Index (TLI), and the root mean square error of approximation (RMSEA). By convention, CFI and TLI values greater than 0.90 and RMSEA values less than 0.08 were used to indicate acceptable fit [34]. A sensitivity analysis was performed eliminating items with factor loading estimates that (1) are less than 0.40, (2) are not statistically significant, or (3) load onto more than 2 factors. Analyses were performed using STATA version 17.0 [35].

Results

The 21-item CLISS (Fig. 1) was distributed in 7 settings to 138 participants and yielded 100 responses (72% response rate; Table 1). Of the 100 respondents, 99 (99%) completed the entire CLISS; 1 respondent left a single item unanswered. No respondents raised questions to survey administrators at the time of survey completion, though several noted that the survey was long. Each of the 21 items garnered a range of responses (Table 2).

Fig. 1
figure 1

21-item Cognitive Load Inventory for Surgical Skills, developed based on the Cognitive Load Inventory for Colonoscopy

Table 2 Descriptive summary of responses to CLISS

The results of the reliability item analysis can be seen in Table 3. All 3 types of cognitive load had a Cronbach’s alpha above 0.7. However, several individual items were found not to correlate with other items within that load, as demonstrated by the improved Cronbach’s alpha with their removal. The results of the CFA can be seen in Table 4. The model had the following fit index values: CFI = 0.627, TLI = 0.579, RMSEA = 0.124. These values are all outside the conventional cutoffs, suggesting consensus that the initial 3-factor, 21-item model is not a good fit.

Table 3 Results of reliability item analysis
Table 4 Results of confirmatory factor analysis

Eliminating items with factor loading estimates that (1) are less than 0.40, (2) are not statistically significant, or (3) load onto more than 2 factors resulted in a revised 11-item tool (Fig. 2). Cronbach’s alpha improved for IL and GL in the revised tool (Table 5). The revised model had the following model fit indices: CFI = 0.940, TLI = 0.920, RMSEA = 0.076. These values all improved from the prior model and are all within the conventional threshold for goodness of fit, suggesting consensus that the revised 3-factor, 11-item model is a good model.

Fig. 2
figure 2

Revised 11-item Cognitive Load Inventory for Surgical Skills

Table 5 Results of reliability item analysis, best model fit

Additionally, we further eliminated an item pertaining to intrinsic load that referred to performing tasks “at-depth.” Upon further review, our team felt this item was not aligned with our objective of generalizability to any surgical setting as there may be procedures or simulations that do not have an at-depth component. This resulted in the current 10-item version of the CLISS, including 3 items pertaining to IL (α = 0.82), 4 items pertaining to GL (α = 0.84), and 3 items pertaining to EL (α = 0.82, Fig. 3). Removing any additional items resulted in lower model fit indices (data not shown).

Fig. 3
figure 3

Current 10-item Cognitive Load Inventory for Surgical Skills

Discussion

In this study, we adapted the CLIC to measure the three types of cognitive load experienced when performing procedural skills broadly and to be applied in different surgical settings. We tested the newly developed instrument, the CLISS, with respondents of different levels in multiple settings to generate preliminary validity evidence for this tool.

Our work suggests three conclusions pertaining to the validity evidence for the CLISS. First, content validity is supported based on our use of the established CLIC as a starting point [27]. Importantly, we retained the 3-factor structure to measure IL, EL, and GL while making adjustments to the CLIC and iteratively incorporated feedback from the CLIC creators to ensure adjustments adequately retained concepts related to each type of cognitive load. This 3-factor structure reflects the tenets of the CLT, which is an accepted model for identifying and quantifying activities of the working memory in complex learning settings, such as procedural skills training [27, 36, 37].

Second, the assessment of the response process suggested that the CLISS could be a practical tool for administration after simulated or OR-based experiences. Through cognitive interviews with surgical residents that included think-alouds for each item within the CLISS, we were able to assess the appropriate interpretation of items. Furthermore, issuing the tool using different mechanisms of distribution (in-person vs. via email), with different timelines (immediately after completing task(s) vs. after a delay), and by different distributors (surgical faculty vs. surgical trainee vs. administrative personnel) allowed us to draw preliminary practical inferences surrounding the response process. Namely, in-person, faculty-initiated distribution of the tool immediately after completion of the encounter containing the task(s) being evaluated may be an effective strategy to maximize response. This is consistent with findings in other settings that rely on survey response, including academic scholarship, marketing research, and event management, that highlight the efficacy of in-person distribution relative to other methods [38, 39], relevance of who issues the survey [40], and the importance of survey timing [41]. Rigorous study of the relationship between response rate and these factors was outside the scope of the present study and could be considered in the future to maximize response rate among this time-constrained population.

Third, an assessment of the internal structure of the original 21-item CLISS revealed an opportunity to optimize the tool. Despite the reasonable internal consistency [34], CFA identified several items that did not fit their respective factors well, suggesting that the 3-factor model with the original 21 items was not a good fit. The systematic elimination of these items resulted in a 3-factor, 11-item model with the improved model fit. Notably, 10 items (representing nearly 50% of the items in the questionnaire) were eliminated using this approach, despite our team conducting cognitive interviews to support the content validity of the questionnaire. This may be due to the use of a single surgical procedure (i.e., simulated anastomosis at depth) as the prompt for the think-alouds during these interviews. This approach may have hindered the cognitive interview process from revealing questions that were not generalizable to all surgical procedures. Moreover, the removal of the additional item that referred to performing tasks “at depth” further supported the content validity of the tool as this item was felt not to fit the objective of being generalizable to any surgical setting. The multi-factorial process of generating validity evidence—including iterative review with surgical educators and experts in CLT, cognitive interviews with residents, issuance of the survey in various settings, and statistical assessment of internal structure—resulted in the current 10-item version of the CLISS, which includes 3 items pertaining to intrinsic load, 4 items pertaining to germane load, and 3 items pertaining to extraneous load. Through this rigorous process, our study offers further empirical evidence supporting the 3-factor model of CLT over the more recently proposed dual division of cognitive load into intrinsic and extraneous load [42].

Having an evidence-based instrument designed specifically to measure cognitive load in surgical trainees in any procedural setting may facilitate the implementation and assessment of interventions to enhance surgical education. According to the CLT, learning is maximized when IL (inherent task difficulty or complexity) is matched to experience, EL (distractions in the learning environment and ineffective teaching techniques) is minimized, and GL (deliberate use of learning techniques that promote consolidation) is optimized with schemas [36, 37, 43]. The ability to measure the three types of cognitive load in surgical trainees would enable surgical educators to gain insight into elements of the learner’s education that could be adjusted (i.e., factors that affect EL and GL) and to assess the impact of adjustments on cognitive load. The CLISS was deliberately designed to be general so that it could be applied to any context without being overly onerous for the time-constrained respondents. Though being general limits its ability to identify the specific, contextual aspects of the learning and clinical environment that may be contributing to cognitive load in a particular setting, it is intended to serve as a starting point for surgical educators to understand the effectiveness of individual teachers, learning activities, curricula, and the learning environment [27]. Findings from the CLISS may be used to prompt a deeper dive into particular aspects of a given setting to facilitate implementation of changes. The tool could then be used to assess their impact and monitor continued effectiveness through longitudinal evaluation of cognitive load.

Importantly, having a versatile tool that can be applied to any procedure and in any teaching context (i.e., simulation or OR) would allow surgical education researchers to measure cognitive load in various procedural contexts. In its general form, it could enable surgical educators to compare cognitive load experienced by individual trainees performing different procedures (e.g., superficially and at-depth suturing) within the same setting and performing the same procedure across different settings (e.g., in a simulation lab and in the OR) to further isolate opportunities to enhance instruction. It could also enable learners to gain insight into their own cognitive load in these settings, which could guide individual educational strategies (e.g., deliberate practice vs. environmental optimization).

Though our study provides preliminary validity evidence to support the use of the CLISS in measuring the three types of cognitive load in surgical trainees in any procedural setting, it has several limitations. First, as with any post-experience survey, responses may have been affected by peak-end memory bias, whereby a memory or experience is judged based on the most emotionally intense moments (whether positive or negative) and the final moments of the experience [44]. However, completing the CLISS may actually have offered respondents an opportunity to comprehensively process and reflect on their experiences in a systematic manner. By reflecting on various moments and events that occurred during the experience (as prompted in the CLISS), it is possible that individuals may integrate a broader range of emotions and occurrences into their evaluation, which may balance the influence of the emotional peak and the ending. Second, the revised version of the CLISS was arrived at based on an analysis of responses to the original 21-item tool. The current 10-item version of the tool should be issued to a new set of participants in diverse settings to confirm acceptable internal structure and good model fit. Third, the relationship to other variables was not assessed as part of this study. Future studies comparing CLISS-measured cognitive load with objective measures of learning (e.g., task performance) and mental effort (e.g., functional MRI [45] or psychophysiological measurements [46]) would further provide evidence for the validity and utility of this tool. Finally, preliminary validity evidence in this study is based on respondents from a single academic center. The generalizability of this tool would be established through a national multi-center study that includes broad geographic, demographic, and post-graduate year representation.

In summary, the tenets of CLT offer insight into optimizing learning of complex, high-stakes tasks in dynamic, fast-paced settings, such as surgical procedures performed in an operating room or a simulation center. We developed an instrument that can measure the three types of cognitive load based on the principles of CLT. Once additional validity evidence is collected, the CLISS could be applied to measure cognitive load in diverse learning settings in surgical education, which we believe will serve as a tool for educators to optimize surgical training.