The term “educational environment” relates to how learners perceive their teaching and learning in the clinical setting. It is defined by the American Medical Association as “a social system that includes the learner (including the external relationships and other factors affecting the learner), the individuals with whom the learner interacts, the setting(s) and purpose(s) of the interaction, and the formal and informal rules/policies/norms governing the interaction”.1 A suitable educational environment is critical for effective knowledge transfer, skills progression, and development in the affective domain. This is especially crucial in the clinically important, time-critical environment of the operating theatre. Routine direct observation of teaching encounters is resource-intensive and not feasible. Educational environment measures (EEMs) are survey instruments administered to learners that have been developed for a variety of clinical settings. Educational environment measures with high reliability and validity may be used as a surrogate for direct evaluation of teaching and learning, enabling continual professional development and quality improvement at departmental and regional levels. A systematic review by Soemantri et al. lists 31 published EEMs in health professions education, nine of which were designed for the postgraduate medical setting.2

There is no contemporary EEM for anesthesia teaching encounters in the operating theatre. Previously developed clinical anesthesia EEMs conducted validation studies in a specific area or region or were not focused on clinical teaching encounters. Holt and Roff published the Anaesthetic Trainee Theatre Educational Environment Measure with input from focus groups of anesthesiology trainees, educational supervisors, and regional program directors, piloted on 218 trainees in one training region of the United Kingdom (UK).3 Smith and Castanelli developed an EEM with emphasis on the general learning environment rather than in-theatre teaching, piloted on 263 trainees in New Zealand (NZ) and Australia.4 They recently performed a second pilot on 172 trainees from one Australian training region, altering the Likert scale from the first pilot, to determine the minimum number of respondents required to maintain reliability of the measure.5

The objectives of our study were to:

  1. 1.

    Develop a clinical anesthesia EEM for teaching encounters in the operating theatre that was contemporary, based on most recent evidence, and generalizable to different training programs.

  2. 2.

    Interpret the pilot study results to guide future research

The Measure for the Anaesthesia Theatre Educational Environment (MATE) utilizes a series of methodologies to show validity, piloting the measure in different training jurisdictions. We chose to focus specifically on the operating theatre as the vast majority of teaching and learning in clinical anesthesia occurs here, providing insight into current practice and a focal point for potential quality improvement. Results from the pilot survey will be used to identify areas of future research.


We obtained approval from the Awhina Research and Knowledge Centre (Protocol RM13266). We used a literature search to generate the initial item list (see Appendix I; available as Electronic Supplementary Material [ESM]), categorized into four provisional domains that were identified a priori. A modified Delphi approach was used to develop consensus on item inclusion to ensure content validity. The Delphi approach is an established objective method of obtaining expert consensus, allowing a large number of experts to contribute anonymously in a non-adversarial manner in a series of phases, with successive feedback of collective opinion and opportunity for correction.6 The Delphi methods are explicitly described in Appendix I (available as ESM).7

Pilot of MATE draft

We administered the pilot to junior doctors in seven countries. We defined junior doctors as medical practitioners working in clinical anesthesia that required some form of supervision. These included interns, residents, house officers, medical officers, registrars, or fellows, not limited to those in vocational training programs. We anticipated the majority of participants to be vocational trainees, and the term “trainee” was utilized in department correspondence, with the above-emphasized definition. Residency program directors/coordinators in 16/17 Canadian, all 134 United States (US), and all three Singaporean anesthesiology residencies were sent an email with a request to forward the survey link to trainees in their departments. The head of school or trainee representative for 23/24 UK Schools of Anaesthesia and two large Hong Kong Special Administrative Region (HK SAR) teaching hospitals was sent a similar email. The Australian and New Zealand College of Anaesthetists (ANZCA) Clinical Trials Network facilitated email invitations to be sent to a random sample of 1,000 NZ and Australian trainees, and trainees in the Auckland region of NZ were individually emailed. United States military institutions, one Canadian institution, one UK School of Anaesthesia, and all but two Hong Kong departments were not approached because contact details could not be sourced.

Respondents were asked: “Please rate the following statements as they apply to your perception of teaching in the operating theatres of this department (applies to any site where anesthesia is delivered, including endoscopy or interventional suites)”. A seven-point rating scale was applied to all items (0 = strongly disagree, 6 = strongly agree). Participants were required to have been working in their department for a minimum of eight weeks to ensure adequate exposure to their clinical environment. The survey was administered using an online survey tool (Survey Monkey) and collected anonymously, with relevant demographic information. Stratification of results by demographic parameters was performed to guide possible future research.

Reliability and exploratory factor analyses (EFA)

Reliability and EFA of a pilot survey enabled refinement of the item list and demonstration of construct validity.8,9 Exploratory factor analysis is used to identify a set of latent constructs underlying a group of observed variables, as measured through items or questions.8,10 A series of mathematical iterations (factor rotations) creates linear combinations to explain the data, with each iteration revealing new information that allows the researcher to examine the relationships between items and factors.8 Redundant items may be removed if they load poorly onto factors or if they cross-load without strong primary loadings. The structure is refined until an efficient, mathematically sound, and theoretically grounded solution is reached.8 As this method of factor analysis is inherently designed to be exploratory,9 a global assessment is required to show real-world constructs for item-factor relationships.

Statistical analysis

Basic analysis for the Delphi phases was performed using Microsoft Excel (Microsoft, Redmond, WA, USA). Data from the pilot survey were analyzed with IBM SPSS 24 (IBM, Armonk, NY, USA). P ≤ 0.05 was considered statistically significant. We applied the Kolmogorov-Smirnov test to the pilot survey to determine if data were normally distributed. Cronbach’s α coefficients were generated to appraise internal consistency reliability, both prior to and after EFA.

Suitability for EFA was determined using the Kaiser-Meyer-Olkin Measure of Sampling Adequacy, Bartlett’s Test of Sphericity, and measures of communality. Factor extraction was performed using principal axis factoring (for non-parametric data). Eigenvalue analysis and the scree test were used to determine the number of factors retained—factors with eigenvalues of one or more and factors located above the inflection point on the scree plot.8,9 The eigenvalue describes the variance in the items explained by that factor.8 The scree test is a plot of the same eigenvalues on the y-axis and factor number on the x-axis, and can be open to subjective interpretation. Factor rotation was performed using an oblique method (promax), as we believed that the factors would be related to each other. During successive rotations, we removed items that failed to achieve a primary factor loading of at least 0.4 and items that exhibited cross-loadings of 0.3 or above without a strong primary factor loading (defined as ≥ 0.65). Rotations were performed until no items met the criteria for removal. Finally, a global assessment for accuracy of factor structure was performed to determine if the factors could be related to real-world constructs to determine the final MATE item structure.

Scores for each item were added to determine the overall MATE score, out of 198 (33 items × maximum score of 6). This was converted to a percentage score by dividing the total score by 198. For individual domains, total scores for items in each domain were divided by (number of items in that domain × 6). Respondents’ scores for the overall MATE were included only if they provided responses for all items, and for each domain only if they provided responses for all items within that domain. Demographic group comparisons were carried out using the Kruskal-Wallis test (for non-parametric data). If the Kruskal-Wallis test generated a P value ≤ 0.05, Dunn’s non-parametric pairwise comparisons (two-way comparison between groups in each demographic category) with Bonferroni-adjusted significance were carried out.


Literature review

The literature search yielded 6,820 results. Seventy-three papers were identified for further scrutiny after review of abstracts and 50 papers after bibliographic review, with seven found suitable for inclusion. These were two previously published anesthesia educational environment measures,3,4 three papers on characteristics of good teachers in anesthesia,11,12,13 and two validated instruments for evaluation of anesthesiologists’ supervision of trainees.14,15 Seventy-three discrete items were identified for the initial item list, grouped into four provisional domains.

Modified Delphi process

Thirty-five individuals were approached after being identified as potential “experts” for our panel, with 28 positive replies received. Four did not fulfill the inclusion criteria, either not having completed vocational training in anesthesiology or not possessing a formal qualification in medical education, resulting in a final figure of 24 experts. Response rates for phases 1-4 were 95.8%, 83.3%, 70.8%, and 66.7%, respectively. The demographic makeup of the panel and their response rates are listed in Table 1. Forty-four items achieved a mean score of ≥ 5 and a standard deviation (SD) of ≤ 1, for inclusion in the draft measure to be piloted (Table 2).

Table 1 Expert panel demographics and response rates
Table 2 Final rating of initial items in modified Delphi approach

Pilot survey response

We received 390 responses. Twenty-six responses were excluded, 16 for not scoring any items and ten because of having worked in their department for under eight weeks, leaving 364 responses available for analysis. We could not calculate the actual response rate as we were unable to determine what proportion of contacts forwarded the invitation email to their junior doctors and, for contacts that did so, how many junior doctors worked in those departments.

Exploratory factor analysis

The Kolmogorov-Smirnov test indicated that the pilot survey data were not normally distributed, and non-parametric statistical tests were henceforth applied. Detailed descriptions of the initial reliability analysis, preliminary analysis for suitability, factor extraction, and factor rotations are located in Appendices II and III (available as ESM). These EFA steps identified a further ten redundant items.

Global assessment for accuracy of factor structure showed that the provisional domains proposed in the draft MATE did not completely conform to the extracted factors, except for items in the provisional “Assessment and feedback” domain all loading to factor 1. Nevertheless, items in each factor could be related to real-world constructs, allowing for four distinct domains to be named and conferring construct validity to the MATE. The identified domains were “teaching preparation and practice” (factor 3), “assessment and feedback” (factor 1), “procedures and responsibility” (factor 4), and “overall atmosphere” (factor 2) (see table for rotation 3 in Appendix III; available as ESM). Minor adjustments were made to ensure consistency and avoid duplication under the new structure. The lowest-loading item under factor 2, “my clinical teachers provide appropriate support when I am performing a procedure for the first time”, was moved to factor 4 (“procedures and responsibility”), and the item “The clinical teachers are easily accessible should I require their help” was removed as it was deemed to be very similar to the highest-loading item in that factor, “My clinical teachers are accessible for advice”. The completed factor analysis resulted in the final refined 33-item MATE survey tool (see Appendix IV; available as ESM).

Post hoc reliability

Overall internal consistency of the MATE was excellent (Cronbach’s α = 0.975). Internal consistency for the new domain labels was 0.945 for “teaching preparation and practice”, 0.964 for “assessment and feedback”, 0.833 for “procedures and responsibility”, and 0.936 for “overall atmosphere”. No improvement in reliability could be gained with the deletion of any item for any of the four domains.

MATE scores

The mean (SD) % of the overall MATE score was 74.6 (15.6), with domain scores as follows: “teaching preparation and practice” [66.6 (19.2)], “Assessment and Feedback” [71.9 (19.0)], “procedures and responsibility” [85.5 (12.8)], and “overall atmosphere” [81.8 (16.2)]. Scores based on demographic background are listed in Table 3. A significant difference in MATE scores between groups was found in the country and age categories. With the former, post hoc testing using Dunn’s non-parametric pairwise comparisons indicated that this was between Canada and Australia (P = 0.013) and Canada and NZ (P = 0.036). Significant differences between these two country pairs were observed in all MATE domains except “overall atmosphere”. For the age category, only the “assessment and feedback” domain showed a significant difference, with junior doctors aged 30 yr and younger rating the mean (SD) domain higher than those over aged over 30 yr [74.8 (18.9) vs 69.7 (18.9); P = 0.003]. Less experienced junior doctors also rated this domain higher than their more experienced counterparts [75.7 (18.5) vs 70.8 (19.1); P = 0.029], although the overall MATE scores were not significantly different (P = 0.094).

Table 3 MATE scores based on demographic background


We have described the development of an instrument to measure the educational environment in the operating theatre for anesthesia, utilizing specific techniques at each stage of development to show different aspects of validity. A systematic literature review identified 73 items, reduced to 44 using a modified Delphi approach and further refined to 33 items using EFA. The reliability and distribution of scores in this final instrument are described, with excellent reliability analysis and a successful pilot in different training programs and jurisdictions. The MATE shares only 11/33 items with a similar tool published 14 years ago,3 justifying the development of an updated measure.

Delphi approach

There is no strong evidence for the number of panel members or required response rates. For a homogeneous population (experts from the same discipline), 15-30 people is recommended.7 While it is generally accepted that higher response rates are better, at least 70% is recommended for each phase.6 Our 24 panel members achieved this in all but the final phase (66.7%). Combined with the systematic review of the literature for initial item generation, the Delphi approach confers content validity to the development of the MATE.

Three items were excluded at the Delphi stage because of a lack of consensus (SD > 1.0) despite achieving the target mean score. These were “I feel responsible and accountable for the care given to my patients”, “There is no discrimination in this post”, and “I am given relief from duties to participate in formal educational programs”. Free-text comments by expert panel members to justify outlying ratings alluded to the lack of direct relevance to in-theatre teaching. The issue of discrimination, along with sexual harassment and bullying, is an important one. We were comfortable with the removal of the aforementioned item as these issues were likely to be addressed by the retained item, “My clinical teachers promote an atmosphere of mutual respect”.

Interpretation of MATE survey tool findings

Thirty respondents (8.8%) submitted a MATE score of < 50%. The measure developed by Holt and Roff showed 2.3% of respondents with an equivalent score of < 50%,3 while a more recent measure encompassing the overall anesthesia clinical learning environment had 3.4% of respondents submitting a (corrected) score of < 50%.4 At the opposite end, 194 MATE respondents (57.1%) rated their educational environment at > 75% compared with 37.6% and 41.1% in the two previous studies.3,4 This increase in both low and high ratings may be attributed to differences in the ratings scale, respondents, survey items, survey context, or other aspects of survey design. There is evidence that full labelling of the rating scale, as done in the two compared studies, results in respondents providing more central ratings and fewer ratings at the extreme ends of the scale.16 Almost half of all respondents in our study provided scores in the 50-80% range. The practice of full labelling vs labelling only at the endpoints is a contentious one. An analysis of 13 surveys by Alwin and Krosnick indicated that fully labelled surveys were more reliable compared with endpoint-only labelling (α = 0.783 vs 0.570),17 but this effect was not observed in our survey instrument (α = 0.975). The use of descriptors such as “moderately agree” or “moderately disagree” with full labelling renders the variables as ordinal data, as one is unable to state with certainty that the intervals between the different anchors are equal. Respondents may also interpret differently what it means to “moderately” agree or disagree with a statement. Endpoint-only labelling results in a continuous rating scale that arguably conveys the idea of equal intervals between each point and are no less valid that fully labelled scales.18 Strictly speaking, fully labelled scales are called Likert scales, although the term is frequently used when referring to continuous rating scales.

Based on our preliminary analysis, we propose that the following structure be used to evaluate scores for the MATE and its four domains: 0-50% = poor, 50.1-60% = below average, 60.1%-70% = average, 70.1-80% = good, 80.1-90% = very good, and 90.1-100% = excellent. The use of a descriptive evaluation structure confers concrete meaning to the score generated by the measure, allows for ease of interpretation, and provides targets for quality improvement. Table 4 lists respondents’ scores for the MATE and its constituent domains according to this evaluation structure. Interventions aimed at improving teaching and learning should focus on the “teaching preparation and practice” and “assessment and feedback” domains, as these obtained poor or below average evaluations from 33.5% and 23.9% of respondents, respectively.

Table 4 MATE and domain scores according to evaluation structure

Application of this evaluation structure requires an adequate sample size. A measure utilizing a four-point Likert scale recently showed adequate reliability with a minimum of eight respondents from a single department.5 For our study, we are unable to state a minimum sample size for individual departments. Conservatively, a sample size of < 10 may not allow for valid interpretation, and 10-20 should be interpreted with caution unless accompanied by a small variance. A standard deviation of 15% or less (0.9 on the 0-6 scale) may be sufficiently precise for a sample size of 10-20. Further research would be required to confirm the appropriateness of this evaluation structure and to determine minimum sample size for individual departments.

In subgroup analyses, a significant difference was observed between some countries. Nevertheless, we caution that firm conclusions cannot be drawn based on this evidence alone as sample sizes are insufficient to be representative of any single country and responses are biased towards participating departments, but it is an area that merits further investigation. Possible reasons may include differences in teaching culture, vocational training programs, institutional support, trainee expectations, educational resources, or clinical workload. Younger and less experienced trainees rated their experience of “assessment and feedback” significantly higher than their older and more experienced counterparts. One reason for this may be differences (real or perceived) in the quality and volume of feedback delivered, presumably higher in the younger and less experienced group because of their being at a stage of training that requires closer supervision and active teaching. Older and more advanced trainees may also be better equipped to critically rate a department because of their experience.

Exploratory factor analysis

There is no agreement on minimum sample size requirements for EFA, with figures ranging from 100-300.8 Others use the subject-to-variable (STV) ratio to base minimum sample size recommendations, with minimum ratios ranging from 5-10.8,9 A more contemporary view is that the required sample size is dependent on the strength of the item-factor relationship.8,9,10 For example, if all factors have at least four strong-loading items, the sample size may be irrelevant. Our study defined a strong loading as 0.65 or above, with other authors quoting as low as 0.59 or as high as 0.7.8,10 If there are ten to 12 items with moderate loadings (0.4-0.6), a sample size of at least 150 is required.8 Factors that have few items and have moderate-to-low loadings require a minimum sample size of 300.8 EFA produces unreliable and non-valid results if performed with an inadequate sample size.9 With 364 valid responses, our data began with an STV ratio of 8.5, increasing to 11.0 after removal of redundant items. The final factor loading matrix (Appendix II; available as ESM) showed very strong item loading for all but one factor (factor four), which loaded two items strongly (0.751 and 806) and two items moderately (0.492 and 0.543).

Conversion to percentage score

Deriving a percentage score from a rating scale is a common method for presenting EEM scores.4,5,19,20,21,22,23,24 The wider and consistent margins inherent in a 0-100 scale facilitate comparison between different measures or the same measure applied at a different time or place. A requirement for converting from a rating scale to a percentage score is factoring a zero point into the conversion if the lowest value in the original scale is not zero. For example, directly converting a 1-5 rating scale to a percentage score is erroneous because the lowest possible mean or median score is 1/5 or 20%, with a resultant 20-100% score. A 1-5 rating scale should therefore be recalculated as a 0-4 scale prior to conversion to generate a 0-100% score. Failure to adjust for this results in inflated percentage scores that are accentuated at the lower end of the scale, as shown in some studies.4,5,19 This inflation effect is worsened with narrower rating scales and produces inaccurate comparisons with other EEM results. One may also argue that fully labelled scales do not lend themselves to percentage conversion as one cannot be confident of equal intervals between rating points.


Potential applications of the MATE in the context of anesthesiology training are numerous. In other settings, EEMs have been used to evaluate interventions designed to improve teaching and learning,25 monitor the impact of curricular change,26,27,28 longitudinal changes over time and between cohorts,23 and differences in training locations.24 In a recent review of a generic postgraduate training EEM, 8/9 studies reported significant differences in overall EEM scores for rural vs urban training locations.29

Residency program directors or education coordinators in individual departments may use the MATE as an educational key performance index to address areas of concern as they are identified. There is evidence of correlation between positive EEM scores and improved academic performance. A study of 206 general medical residents in 21 training hospitals showed a positive correlation between EEM scores and performance in the in-training examination.30 A study of dental undergraduates showed correlation between scores in the perception of learning domain and higher grades, while low scores in three domains were associated with failing grades.31 One study with medical undergraduates showed no differences in overall scores but correlation between high scores in selected domains and superior academic performance.32 Nevertheless, a survey of nursing undergraduates in one institution showed no correlation between EEM scores and academic performance.33 A large study of 1,350 medical students from 22 medical schools showed a positive correlation between overall EEM scores and levels of resilience.34

Bodies responsible for accreditation of vocational training could identify outliers among institutions, learning from well-performing departments and providing assistance or remediation measures for poorly performing ones. While face-to-face interviews with trainees during site visits provide invaluable information for training accreditation decisions, the MATE allows for a more feasible and objective assessment of the educational environment. The measure allows input from all trainees, an impractical task with individual face-to-face interviews. Accreditation bodies may identify potential problems earlier and enable targeted enquiry during site visits. Training institutions, regions, or countries that report uniformly average or less-than average scores may seek to investigate why this difference exists.


The primary limitation of this study is the inability to determine an accurate response rate due to the method in which the survey was distributed. Certain subgroup comparisons do not allow for firm conclusions because of a lack of a representative sample. Nevertheless, this approach allowed us to obtain a much larger sample size than previously published anesthesia EEMs.3,4 It also allowed for sampling of a heterogeneous population with the implication that the results are likely to be generalizable to different training regions and systems. Future work comparing differences between training regions or countries should be designed to ensure representative sampling. Our reliability analysis could have been supplemented with a multivariate generalizability analysis to identify and assess the effects of various possible sources of error.

Future research

We invite educational supervisors to utilize the MATE on an ongoing basis. We offer to share (at no cost) a customized electronic survey and subsequent results analysis to any department that wishes to administer the measure. The complete measure is included in Appendix IV (available as ESM). As responses are generated, we aim to add these to a central database with the consent of participating institutions, maintaining anonymity of departments and individual respondents. This will enable participating institutions to compare their results with mean and median scores in their region or country and to track temporal changes or effects of interventions. Using this database, future studies may focus on differences between training regions/countries, changes over time, confirmation of the proposed evaluation structure through qualitative analysis, and determination of minimum sample size for individual departments. Further factor analyses on a new population should be performed to reconfirm the underlying structure. Multivariate generalizability analysis on future samples would identify and assess possible sources of error and determine a minimal sample size for individual departments.


The MATE is potentially a valid and reliable tool to measure the educational environment in the operating theatre, specific to anesthesia. It can be used by individual institutions or vocational training bodies as a key performance index in education or to evaluate effects of interventions in teaching and learning. Further research is required to investigate differences in training countries and possible underlying factors. The authors will maintain a database of responses, preserving the anonymity of respondents and their institutions. Educational supervisors and researchers are invited to administer the measure and collaborate with the authors to enable further investigation in this area.