Overview of study design
The study was approved by and complied with the University of Toronto Office of Research Ethics and took place in Toronto, Canada. We developed a single teaching and learning session based on theories of critical reflection, reflexivity, and critical pedagogy, and on the teaching materials and scholarship of the first author and colleagues (Kinsella et al., 2012; Ng, 2012; Ng et al., 2015a, b; Phelan & Ng, 2015; Halman et al., 2017; Baker et al., 2018, 2020; Ng et al., 2018, 2020). To test its effectiveness, we employed an established education research design meant to test the transfer of learning to subsequent/future experiences, rather than simply test immediate knowledge acquisition and retention, as outlined in Fig. 1.
This design intended to make visible how participants engage with, or see, a new learning experience following their instructional exposure to the critical pedagogical learning experience. Participants in the control and intervention conditions completed an online module about SDoH, followed by different instructional exposures: either a critically reflective dialogue or further SDoH discussion. Then, participants in both conditions experienced two common learning sessions that served as our outcome assessment, one immediately after the instructional exposure and another one-week after the initial training. These two common experiences enabled measurement of the outcomes of interest.
The teaching materials used in this study are summarized in Table 1. They included an online module on SDoH (for both control and intervention conditions) and guides for the follow-up SDoH discussion (control) and for the critically reflective dialogue (intervention). An online homecare curriculum and a debriefing guide to follow the homecare curriculum served as the common learning resources that generated the outcome measures.
An interprofessional group of students (n = 75) were recruited from: the first two years of the University of Toronto MD program (n = 31), a fourth-year undergraduate pre-clinical service-learning course with students of mixed degree backgrounds (n = 18), the first year of master’s level occupational therapy (n = 6), physical therapy (n = 6) and speech-language pathology (n = 10), and one student from dentistry, pharmacy, physiotherapy assistant, and radiation therapy programs (n = 4). We chose pre-clinical and early year students to minimize the amount of formal health professions education and clinical practice exposure they would have had.
Participants were assigned to sixteen groups of five, ensuring a multiprofessional complement of students per group to enable interprofessional learning. Eight groups were randomly assigned to the control condition and eight to the intervention condition. We began the study with 80 participants but five participants—2 from the control condition and 3 from the intervention condition—dropped out prior to participation due to factors external to the study (e.g. weather conditions).
A research coordinator consented students and managed administrative elements of the data collection sessions. Both control and intervention participants completed a short SDoH curriculum online module. Each group of five participants completed this SDoH online module in a computer lab, each at their own pace using headphones for privacy, with a research coordinator present to ensure no distractions.
In their same groups of five, learning then diverged as participants proceeded to either a SDoH discussion (control) or critically reflective dialogue session (intervention). All sessions were completed in a small student lounge, with chairs set up in a circle, with the discussion/dialogue facilitator seated along with participants. These details align with a critical pedagogy approach in which the teacher-learner hierarchy is minimized and a learning climate conducive to openly challenging assumptions is emphasized. The SDoH session used a semi-structured facilitation guide (“Appendix 1”). A television screen was connected to a laptop to project supporting slides for the critical reflection session only (“Appendix 2”). Participants were provided with food and beverage during these sessions. Both the control and intervention conditions were one-hour in duration. Sessions were audio-recorded and later transcribed verbatim. The facilitators both held master’s degrees and were trained to run their sessions by the first author. To minimize the confounding variable/factor of individual facilitation styles, the facilitators for the SDoH discussion and critically reflective dialogue session switched roles at the halfway point of data collection. We also included facilitator as a factor in the analysis to account for any potential effects.
After completing the SDoH discussion or critically reflective dialogue sessions, participants were given a brief 10-min break before reconvening within the same groups in a small conference room. The setup included a table for participants and facilitator, and television screen to display the learning resource, the CACE Homecare Curriculum (http://www.capelearning.ca). A third facilitator ran the homecare curriculum and debriefs. Our aim was not to determine whether the participants learned about homecare or applied SDoH to the homecare context. Instead, we used the homecare curriculum followed by a debrief to uncover whether the critically reflective dialogue session, relative to the SDoH discussion, influenced what topics participants talked about during the curriculum and debrief, as well as how they talked during these experiences (in a critically reflective manner, or not). Participants were instructed to act as though they were directly involved in the patient case module. Each group completed one 35-min module from the curriculum—either “Amrita” (dementia-focused) or “Anne” (delirium-focused)—with the facilitator’s guidance. The sessions were audio-recorded and transcribed verbatim.
The facilitator then guided a 30-min debriefing with participants, within their groups, immediately after completing the homecare curriculum. The facilitator was trained to use the Promoting Excellence and Reflective Learning in Simulation (PEARLS)-informed (Eppich & Cheng, 2015) debriefing script (see "Appendix 3"). The debrief prompted participants to discuss their experience of the homecare curriculum, thoughts on the older adults’ situations and needs, and key lessons learned. The debriefing was audio-recorded and transcribed verbatim. The homecare/debrief facilitator also held a master’s degree, was trained by the first author, and remained blinded to each group’s condition assignment.
One week later, participants were brought back in their groups to complete the follow-up homecare curriculum module plus debrief to determine if any effects identified in the previous session persisted or changed. All participants completed another patient case module (if they received the Amrita module, they now received Anne, and vice versa) followed by a similarly structured debrief (“Appendix 3”). The sessions and debriefs were audio-recorded and transcribed.
Analysis was performed on each group of five interprofessional learners’ transcripts from the control (SDoH discussion) and intervention (critically reflective dialogue) sessions, as well as the homecare curriculum sessions and debriefs immediately post-instruction and at follow-up.
Before coding, meaning units were created, to ensure consistency in the amount and type of text to which what and how codes were subsequently applied, and that our two coders coded the same segments of text. Meaning units were created as follows. Within each transcript, every unique utterance by a participant (the boundaries of a unique utterance were determined by change in speaker) was labelled as a meaning unit, unless it met the exclusion criteria of: facilitator comments, neutral affirmations (mhmm, etc.), responses to homecare module quiz questions, responses about the quality of the module with no additional comment about content (e.g. “This module is fun”), responses to questions about participants’ program/year, or clarifying questions (facilitator asks a question and participant asks for clarification, and statements that added no meaning). Every meaning unit was then ready to be coded with at least one what code and one how code.
To create the what coding framework, two researchers VB (co-author) and the first author, initially coded transcripts inductively to arrive at an agreed upon set of eight descriptive codes plus a “no code” code and associated descriptions for each code. They iteratively defined codes until they could be applied strictly to name the topic of meaning units, without further changes. While there were sub-codes used to assist the coding process, only the higher-level codes were used in our statistical analyses.
For the how coding framework, the following definition was used to code data as critically reflective, or not. Meaning units coded as critically reflective were statements that: move beyond a dominant discourse (e.g. discuss a social model as opposed to medical model of disability), question individual or societal assumptions/beliefs (e.g. a clinician’s belief that they have authority over what a school does for a child with disability), demonstrate awareness of the broader system and how one is situated (e.g. recognize that personal support workers may lack resources and training opportunities relative to a regulated health professionals), question or challenge structures (e.g. question whether current funding approaches for homecare are limiting possible practices), and resist harmful practices (e.g. speak up if noticing something concerning) (Ng et al., 2019a, b). Meaning units coded as not critically reflective were: neutral descriptions, narrow views (e.g. illustrated through stigmatizing language), rote following/description of procedures or steps, blaming or patronizing the patient/caregiver/health worker.
Two blinded coders (VB, co-author and LN, acknowledged) were trained to apply these codes to the transcripts. We calculated the inter-rater agreement on eight transcripts to ensure the coders applied the codes consistently. For what codes, raw rater agreement was 95.9%; for how codes it was 82.1%. We determined that this was sufficient to allow the two coders to proceed independently, and subsequent transcripts were only coded by one coder each.
We conducted our analysis in accordance with the aims of this study: to investigate whether teaching for critical reflection influenced what students talked about during a future learning experience and how they talked about it. Thus, we constructed two regression models, one to model the presence of a what code in each meaning unit and one to model the presence of a how code in each meaning unit.
Paradigmatically aligned measurement and analyses were crucial to this study. We constructed, tested, and selected our models under a Bayesian framework. The following are our justifications. Frequentist inference addresses the question of how probable a set of observed data would be if there were no effect. We were specifically interested in addressing the question of how likely a code would be present in a meaning unit as a result of our intervention, which is an inherently Bayesian question. Bayesian inference directly quantifies uncertainty. It provides estimates of the probability of parameter estimates, their differences, and the uncertainty in those estimates and thus any conclusions drawn. Bayesian inference treats the gathered data as ‘fixed’ and models and their parameters as ‘varying’ attempts to explain the observed (‘fixed’) data, while quantifying the model and parameter uncertainty. Some in medical education have compared Bayesian statistics to constructivist grounded theory for the way in which it positions models as imperfect best attempts at representing the story of the data, with analyses informed by prior knowledge (Young et al., 2020). For details on Bayesian statistics, particularly details on the posterior distribution, we recommend McElreath (2020). All models were constructed using the Stan programming language (Carpenter et al., 2017) through the rstan (Stan Development Team, 2020) and brms (Burkner, 2017, 2018) packages in R statistical computing software (R Core Team, 2019).
We modelled predictive probabilities of what codes with a hierarchical multinomial regression model. Meaning units were categorized as containing any or all of nine codes, based on our inductive coding framework: Building on prior knowledge, CanMEDS roles, caregiver, professional expertise, patient, patient-psychosocial, recommendations for practice, social determinants of health, or no code for meaning units that remained in the codable set despite lacking relevance to any of the eight main codes. The definitions of these codes are included in the codebook within "Appendix 4". Population-level effects in this model included condition (control or intervention), session (initial instruction, initial homecare, initial debrief, follow-up homecare, follow-up debrief), and facilitator (Facilitator 1 or Facilitator 2). We also included interaction terms for: code*condition, code*session, condition*session, and facilitator*session. SDoH discussion group (for control participants) or critically reflective dialogue group (for intervention participants) was entered as a varying effect to adjust for clustering of meaning units within each five-member group. The presence of “no code,” at the Instruction Session, in the Control Condition, with Facilitator One was used as the reference case. We used mildly informative priors on the regression coefficients: αi ~ N(0,1), βi ~ N(0,1), σi ~ cauchy(0, e1).
We used Pareto smoothed importance sampling leave-one-out cross-validation (PSIS-LOO) (Vehtari et al., 2016) to evaluate and compare the fit of our full (hierarchical model) to an “empty” (intercept-only) model, and a non-hierarchical model (with the same population-level effects) to the data. We found our full model to be a significantly better fit to the data relative to both (i) the intercept-only model, with a favorable difference in PSIS-LOO expected log predictive density (ELPD) of 1599.8 (with a standard error of 42.2), and (ii) the non-hierarchical model with an ELPD of 93.9 (with a standard error of 13.3).
To model the predictive probabilities of the how codes, we constructed a hierarchical binary logistic regression model. Meaning units were categorized binomially as either present or absent of critical reflection. Population-level effects in this model included condition (control or intervention), session (instruction, initial homecare, initial debrief, follow-up homecare, follow-up debrief), and facilitator (Facilitator 1 or Facilitator 2). We also included interaction terms for: condition*session and facilitator*session. As in the what model, SDoH discussion group (for control participants) or critically reflective dialogue group (for intervention participants) was entered as a varying effect to adjust for clustering of meaning units within five-member groups. The Instruction Session, in the Control condition, with Facilitator One, was used as the reference case. We used mildly informative priors on the regression coefficients: αi ~ N(0,1), βi ~ N(0,1), σi ~ cauchy(0, e1).
As conducted with our what model, we used PSIS-LOO to evaluate and compare the fit of our full (hierarchical model) to an “empty” (intercept-only) model, and a non-hierarchical model (with the same population-level effects) to the data. We found our full model to be a significantly better fit to the data relative to both (i) the intercept-only model with a favorable difference in PSIS-LOO ELPD of 172.6 (with a standard error of 15.5), and (ii) the nonhierarchical model with an ELPD of 18.0 (with a standard error of 6.1).
To evaluate the effect of our intervention, we calculated the distribution of differences in posterior predicted probabilities of codes being present in each Session Type and then determined if the 89% highest density credible intervals of these predicted difference distributions included zero where the inclusion of zero was taken as an indication of no difference between Conditions. The credible interval is the range of the posterior distribution containing the specified proportion (i.e. 89%) of the parameters of interest (i.e., the differences in posterior predicted probabilities) such that one could say: “given the observed data, the effect has an 89% chance of falling in this range” as opposed to a less intuitive frequentist confidence interval, which would be interpretable as “there is an 89% probability that when computing a confidence interval from data of this sort, the effect falls within this range” (Makowski et al., 2019). The 89% credible interval is recommended by a number of leading Bayesian statistics thinkers for its computational stability relative to 95% intervals (Kruschke, 2014). While 90% was also proposed for this same reason, McElreath (2014, 2020) suggested that 89% makes potentially more sense because 89 is “the highest prime number that does not exceed the already unstable 95% threshold.”(Makowski et al., 2019a, b). Because Bayesian analyses yield probability distributions of parameter values (e.g., regression coefficients) calculated directly from observed data, they do not require calculation of p values or their associated confidence intervals. Rather, uncertainty is quantified directly from the calculated probability distributions of parameters (Kruschke & Liddell, 2018).