Rating the quality of teamwork—a comparison of novice and expert ratings using the Team Emergency Assessment Measure (TEAM) in simulated emergencies
Training in teamwork behaviour improves technical resuscitation performance. However, its effect on patient outcome is less clear, partly because teamwork behaviour is difficult to measure. Furthermore, it is unknown who should evaluate it. In clinical practice, experts are obliged to participate in resuscitation efforts and are thus unavailable to assess teamwork quality. Consequently, we sought to determine if raters with little clinical experience and experts provide comparable evaluations of teamwork behaviour.
Novice and expert raters judged teamwork behaviour during 6 emergency medicine simulations using the Teamwork Emergency Assessment Measure (TEAM). Ratings of both groups were analysed descriptively and compared with U and t tests. We used a mixed effects model to identify the proportion of variance in TEAM scores attributable to rater status and other sources.
Twelve raters evaluated 7 teams rotating through 6 cases, for a total of 84 observations. We found no significant difference between expert and novice ratings for 7 of the 11 items of the TEAM or in the sums of all item scores. Novices rated teamwork behaviour higher on 4 items and overall. Rater status accounted for 11.1% of the total variance in scores.
Experts’ and novices’ ratings were similarly distributed, implying that raters with limited experience can provide reliable data on teamwork behaviour. Novices show a consistent, but slightly more lenient rating behaviour. Clinical studies and real-life teams may thus employ novices using a structured observational tool such as TEAM to inform their performance review and improvement.
KeywordsTeamwork Non-technical skills Expert rater Novice rater Assessment Simulation Resuscitation Emergency
Global rating scale
Intraclass correlation coefficient
Objective structured clinical examination
Principal component analysis
Return of spontaneous circulation
Teamwork Emergency Assessment Measure
Translation, review, adjudication, pre-testing, documentation
Medical response to high-urgency situations such as cardiac arrest remains an area for improvement. Depending on their initial rhythm, only around 25% of patients with out-of-hospital cardiac arrest achieve a return of spontaneous circulation (ROSC)  and overall survival to discharge lies around 10% [1, 2]. Survival of patients with in-hospital cardiac arrest is higher but still only ranges between 18 and 44% [3, 4].
Besides technical skills such as providing an adequate compression rate , working effectively together in a team is connected to patient outcome in high-urgency patients; therefore, training in teamwork behaviour1 has the potential to improve survival rates [6, 7]. For example, different studies have shown that training in communication and leadership skills in emergency response teams leads to improved ROSC and survival rates [8, 9, 10]. Findings from experimental investigations suggest that improved team communication and leadership result in a significant reduction of no-flow time and better chest compressions in simulated resuscitations . Further, working together in teams can improve diagnostic accuracy in emergency medicine [11, 12] as well as the quality of care compared to individual performance .
However, what exactly good teamwork behavior is depends on the task and the role of each team member. Generic rules such as “always practice closed-loop communication” are misleading. For example, one study demonstrated that closed-loop communication initiated by the team leader was associated with a shorter time until the correct diagnosis in an emergency trauma case was made, whereas the same communication pattern delayed the decision significantly if initiated by team members . Also, directive leadership behaviour improved technical performance at the beginning of a resuscitation, whereas in later phases, structuring inquiry (e.g., “What do we know about the patient?”) was associated with improved technical performance . These findings show the need to collect more data on teamwork, investigate specific individual and team behaviours, and take differences in task requirements into account. For this, we need valid and reliable tools with known properties that are feasible to use in real-world settings.
In addition, evidence of improvements in patient outcomes as a result of teamwork interventions is limited to a few small studies, many conducted in simulated emergencies [6, 7, 9, 10, 14]. Fung and colleagues suggested that the lack of an objective measurement of team performance is one reason for this paucity of data . While, for example, chest compression rate and depth can nowadays be tracked  and technical solutions help to document resuscitations more precisely , teamwork behaviour is not easy to measure, especially in real-life situations. Such information is not only relevant for research but also a necessity to inform debriefings after resuscitation . Consequently, different tools have been developed to assess individuals non-technical skills as well as teamwork behaviour. Some of these tools are designed for a specific context, such as the anaesthetists’ non-technical skills behavioural marker system (ANTS)  or the observational teamwork assessment for surgery (OTAS) [20, 21, 22], others are intended to be more generic and independent of context, such as the Ottawa Crisis Resource Management Global Rating Scale .
One tool that has been used in both, real-life emergency situations and simulated emergency trainings, is the Teamwork Emergency Assessment Measure (TEAM) [24, 25, 26, 27]. The TEAM was designed for emergency teams and is particularly used to assess teamwork, leadership and task-management in high emergency situations such as resuscitation [24, 28]. Since its development in 2010, TEAM has been translated into French , Hebrew and Chinese (available via www.medicalemergencyteam.com) and was used in real-life resuscitations [27, 28] and simulated environments (in centre and in situ) [24, 29, 30, 31, 32], observing teams of medical and nursing students [24, 31], nurses and physicians [25, 27, 30, 32] and comparing teams with different levels of expertise  (see Additional file 1). A recent review showed that it has good psychometric properties in contrast to most other tools for assessing teamwork . In summary, the TEAM has been used in several clinical and simulation-based studies with comparable outcomes (see Additional file 1) and is the most appropriate and valid tool for evaluating teamwork in emergency teams.
While some of the tools meant to quantify non-technical skills and teamwork are intended to be used as self-assessments by practitioners and trainees alike (such as the Mayo High Performance Teamwork Scale ), all of the above were designed for raters external to the team they observe [19, 22, 23, 24]. Selecting raters to use such instruments is as important as having a suitable tool, yet empirical evidence is lacking concerning who should or can assess teamwork behaviour in real or simulated emergencies. During training, it is usually the task of expert raters to assess and debrief participants [34, 35]. Until now, most studies using TEAM have employed expert raters; in two cases TEAM was used as a self-rating instrument for experienced team members as logistical reasons did not allow to recruit external observers [25, 27]. In practice, it might be even harder to find raters with high clinical expertise to observe resuscitations because of their high workload. Such an approach would also lead to ethical problems—especially given that expert raters would have broad knowledge of teamwork and emergency medicine (making them expert in this area), but would be restricted to observing. A possible solution for this methodological, ethical, and organisational dilemma could be the use of less clinically experienced raters, such as residents [36, 37].
We therefore compared novices with expert raters, as these two groups represent the widest difference in clinically relevant qualifications. Both types of raters evaluated teamwork behaviour in an extensive emergency simulation using TEAM. Equivalent ratings from the two rater groups would justify ratings by less experienced raters such as residents also in the workplace.
Description and translation of TEAM
TEAM consists of 11 items measuring the teamwork behaviour of medical teams dealing with critical situations . The tool consists of 3 subscales: leadership (2 items), teamwork (7 items), and task management (2 items); all items are rated on a Likert scale of 0 (never/hardly ever) to 4 (always/nearly always). A sum score with a possible range of 0 to 44 can be calculated. Furthermore, overall performance is rated on a global rating scale (GRS) of 1 to 10.
Although a French version exists that confirmed the excellent psychometric properties of the original English version , a German version of TEAM is currently lacking. Addressing this gap, our research team has translated TEAM into German using the TRAPD (translation, review, adjudication, pre-testing, and documentation) methodology . A pre-study was conducted to check feasibility and inter-rater reliability and showed excellent results .
The study was conducted at Charité Universitätsmedizin Berlin during an emergency medicine simulation for final year medical students . During this simulation, the participants acted in teams of 5 and rotated through 6 cases (duration about 30 min each; see Additional file 2: Table S2 for details), in which they had to deal with common emergencies including 1 resuscitation. These cases were realized using simulated patients and high-fidelity simulation. For every case, 1 participant was declared team leader; leadership changed after every case.
Two groups of raters, one of novices and one of content experts, evaluated participants’ teamwork behaviour throughout each case. For the novice raters, we recruited tutors from the local skills lab. They were advanced medical students with emergency medicine experience through clinical electives and/or work experience as paramedics. Expert raters were physicians and psychologists with broad experience in emergency medicine and/or expertise in rating and teaching teamwork during simulation-based education.
Before using TEAM to rate the teams’ performances, all raters participated in a rater training , which included an introduction to TEAM as a rating instrument, information about common rating errors, and a frame-of-reference training, where videotaped examples of teamwork were rated and discussed . Novice and expert raters received the same training (same length, content etc.) Due to organisational reasons they were trained on two separate occasions. Neither the experts nor the novices had any previous experience with the TEAM as a rating instrument.
Data were analysed using SPSS 24 (Armonk, NY: IBM Corp.) and R, version 3.4.4 . Different descriptive measures were computed separately for the ratings given by novice and expert raters. To analyse the measurement properties of the German version of TEAM, we calculated its’ reliability (Cronbach’s α), the item-total-score correlation and the correlation of all items plus the sum score with the GRS. As a measure of construct validity, we conducted a principal component analysis (PCA). In a PCA, the objective is to analyse the structure of a data set and to combine a number of observed variables into one factor. We used PCA to check if the items of the German TEAM could be combined into one general component, as was shown for the original version [24, 25]. All results were compared to other studies using TEAM.
Inter-rater reliability between novice and expert raters was calculated (using the intraclass correlation coefficient, ICC) to explore the agreement between these 2 groups. Additionally, their ratings were compared using Mann–Whitney U tests (for the 11 single items) and t tests (for the sum score and GRS).
We used a mixed effects model to identify the sources of variance in TEAM’s global rating scale . Mixed effects models are an extension of the ordinary linear regression model that allow for estimating one or more variance components (i.e., random effects) in addition to the residual variance term. In this study, we estimated variance components for teams, raters, rater status (novice or expert), cases, and their first-order interactions.
Characteristics of the novice and expert raters
5 medical doctors, 1 psychologist
1–2.5 years (student-assisted learning)
3.5 to 10 years (clinical teaching, simulation-based education, faculty development)
Internships (up to 120 days)
Measurement properties of the German translated version of TEAM
German TEAM expert rating
German TEAM novice rating
Inter-item correlation (Spearman’s rho)
Inter-rater reliabilitya (ICC)
Combination of TEAM items into a general component
We conducted the PCA to examine to which degree the individual TEAM items could be combined into a general component. Prior to conducting the PCA, the adequacy of the observed correlation matrix was evaluated using three related statistical criteria. First, the range of inter-item correlations was ranger;expert = 0.29–0.73 and ranger;novices = 0.42–0.75. Second, the Kaiser–Meyer–Olkin (KMO) criterion summarizes in how far the obtained variables share unique variance and thus might be combined into a single factor. The KMO was 0.87 for both, expert and novice ratings and therefore exceeded the commonly recommended cut-off of 0.6. Third, the Bartlett test of sphericity which was statistically significant (p < 0.001) for both experts and novices, suggesting that the correlation matrix is different from an identity matrix (that is, a correlation matrix where only auto-correlations in the diagonal are of substantial magnitude).
Taken together, the items in the TEAM were sufficiently inter-related to conduct a PCA. The according PCA was, again, conducted independently for novice and expert raters. Results were largely comparable since for both, experts and novices, a dominant first component was found which explained 59 and 65% of the observed variance, respectively.
Agreement between expert- and novice-based ratings
We calculated the inter-rater reliability between novice and expert raters based on the sum scores of TEAM and found an intra-class correlation of ICC = 0.66 (considered moderate  to good ). This resemblance between the ratings is also reflected by the finding that expert and novice raters agreed by and large on the lowest and best performing groups for a given case. That is, ratings of experts and novices were consistent in 75% of cases when comparing which teams received the 2 highest and the 2 lowest scores for each case. Furthermore, ratings of experts and novices were compared on the item-level using U-tests. On 7 of 11 items, no statistically significant difference was found (items 1, 4, 5, 6, 7, 9, 11; p = .06–.86). However, on 4 of 11 items novices rated teamwork behaviour higher than experts on average (items 2, 3, 8, 10; p = .04–.004). Furthermore, across cases, we found no statistically significant difference between the TEAM sum scores for experts and novices (Mnovice = 30.4, SDnovice = 8.6, Mexpert = 27.0, SDexpert = 8.4; t(82) = 1.8, p = .08). Finally, for the GRS, we found that novices (Mnovice = 7.1, SDnovice = 1.6) gave generally higher ratings as compared to experts (Mexpert = 6.1, SDexpert = 1.9). This difference was statistically significant with t(82) = 2.5 and p = .02.
Sources of variation of TEAM scores across stations
Variance Components and Percentage of Variance for TEAM scores
Source of variance
Percentage of variance
Case × Team
Rater Status × Case
The aim of this study was to compare the rating behaviour of novices and experts using the previously established TEAM instrument. The idea to use novices to assess practical skills is not new, though we could find only one study that examined novices evaluating teamwork behaviour. Sevdalis and colleagues compared the ratings of an expert/expert pair to a novice/expert pair assessing surgical teamwork to analyse the construct validity of the OTAS tool and found relevant differences between expert and novice ratings on almost all items . It is important to notice, though, that in this study the terms expert and novice referred to their experience in using the tool and both the two participating experts and the novice had backgrounds in psychology/human factors and were experienced in observing and rating behaviour. The present study, in contrast, defines experts and novices in terms of their content knowledge about teamwork and their practical experience. None of our raters had used TEAM before and they all received a rater training before the simulation.
When focussing on novice raters as raters who are new to or rather unexperienced in a certain area, the literature is generally in favour of novices (even students) being able to assess their peers, although the similarity to expert ratings depends on what skill is assessed and how [47, 48, 49]. A recent review  on peer assessment in objective structured clinical examinations (OSCE) showed that students awarded consistently higher ratings to their peers than experts when using GRS. Our study shows similar results when comparing the GRS scores, as novices rated the team behaviour on average 1 point higher than experts did (scale: 1–10); on some single items, novices rated significantly higher than experts, whereas in the majority of cases, including the sum score of all 11 items, there was no difference. In this context it is important to notice the large positive correlation of the sum scores of experts and novices as well as their consistent ratings of the best and worst performances, which justify the use of novices as raters. Novice raters’ tendency to give better ratings might be explained by a lower standard against which they compared their peers. Looking from the experts’ point of view, it seems plausible that experts are more aware of potentially serious consequences of bad teamwork because of their work experience and therefore rated more strictly [51, 52]. The moderate ICC of 0.66 is connected to this discrepancy between experts and novices. The 2 rater groups seem to have had different baselines, although all raters underwent the same training and anchoring process. The results of the z standardization of GRS and TEAM sum scores endorse this theory of different baselines. When each rater group’s scores were transformed to have a mean of 0 and a standard deviation of 1, their ratings showed very similar distribution patterns (similar range/interquartile range).
Unexpectedly, the teams themselves were only a very small source of the variance in performance scores (3%) and the interaction of team and case was by far the biggest source of variance (43%). In other words, a team’s performance varied considerably between the different cases and there were no superb or incapable teams per se. Importantly, since team leadership changed across cases, the 2 components (team leader and case) are confounded and thus cannot be disentangled statistically. Therefore, it is not clear whether variation in performances across cases is attributable to team leadership or the specific task. Still, our results suggest that a team’s performance depends to a considerable extent on the specifics of the situation. This finding has several implications. Firstly, it suggests that the recurrent finding of context specificity in clinical decision making of the individual is also relevant at the team level . Secondly, this further emphasises the importance of a close investigation of what teamwork behaviour by whom is beneficial in exactly what situation—as opposed to generic rules meant to characterize ‘good teamwork’. Future training should abandon statements such as ‘practice closed-loop communication’ in favour of advice such as ‘During the first minutes of cardiopulmonary resuscitation (CPR), closed-loop communication initiated by the directive team leader is beneficial for CPR quality’ [6, 14]. Thirdly, TEAM scores should not be compared across different cases. The absence of clear benchmarks and the uncertain connection of TEAM scores and objective criteria remain problems when rating teams [25, 27].
As a beneficial side effect of our study, we validated the German version of TEAM, which is now available for clinical use (Additional file 5: Figure S1). Psychometric properties were comparable to those of the English original [24, 25, 27, 30, 31, 54] and the French translation . The internal consistency for both novice and expert ratings was very high, the inter-rater reliability can be considered moderate, and the PCA confirmed 1 underlying component.
This study has several limitations. Firstly, it was a single-centre study with a small sample size. Although our number of observations (84) is similar to or even higher than in other studies using TEAM, our results are based on the ratings of 6 novice and 6 expert raters and each scenario was only observed by 2 of those 12 raters. Secondly, this study took place in a simulation setting that included different cases and changing team structure. Thirdly, our raters only observed monoprofessional teams, consisting of final year medical students. As our study is one of the first to use TEAM outside of typical resuscitation scenarios, more research is needed to decide how suitable TEAM is for rating teamwork behaviour in situations other than CPR and how to set performance benchmarks.
Teamwork behaviour can be assessed with TEAM by novices just as well as by clinically experienced raters, though novices tend to rate slightly more lenient than experts do. Further research is needed on the comparability of TEAM scores across different cases. The German TEAM is a reliable and valid tool to assess teamwork performance that closes a gap in measuring teamwork behaviour in German-speaking countries.
In this study, we use the term teamwork behaviour to highlight that we treat non-technical skills such as communication and leadership skills at the team level as a kind of ‘collective’ non-technical skill; we did not evaluate team members individually.
The authors would like to acknowledge Simon Cooper for his help in translating TEAM and Hanno Heuzeroth, David Steinbart and Dorothea Eisenmann for their support in conducting the prestudy. Furthermore, we thank all participants and raters for participating in this study. Our manuscript was revised by Anita Todd, a language editor who was paid by the Max Planck Institute for Human Development, Berlin, the host institution of the last author.
JF and FS are funded by the German Federal Ministry of Research and Education (BMBF). The sponsor did not interfere with the conception and conduction of the study, data analysis, or production of the manuscript.
Availability of data and materials
All data generated and analysed during this study are included in Additional file 6.
JF and FS designed the study, analysed and interpreted the data, and drafted the manuscript. SS contributed to data analysis and interpretation and helped to revise the manuscript. WEH and JEK contributed to the design of the study, the interpretation of the findings and revised the manuscript. All authors have read and approved of the final version of this manuscript.
Ethics approval and consent to participate
The ethics committee (EA2/172/16) and the institutional office for data protection (AZ 737/16) at Charité Universitätsmedizin approved the study. All participants and raters consented orally and in written form.
Consent for publication
Not applicable, since manuscript does not include individual person’s data.
WEH received financial compensation for educational consultancy from the AO Foundation, Zurich, Switzerland and research funding from Mundipharma Medical, Basel, Switzerland. All other authors report no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 16.Bobrow BJ, Vadeboncoeur TF, Stolz U, Silver AE, Tobin JM, Crawford SA, et al. The influence of scenario-based training and real-time audiovisual feedback on out-of-hospital cardiopulmonary resuscitation quality and survival from out-of-hospital cardiac arrest. Ann Emerg Med. 2013;62:47–56 e1.CrossRefGoogle Scholar
- 23.Kim J, Neilipovitz D, Cardinal P, Chiu M, Clinch J. A pilot study using high-fidelity simulation to formally evaluate performance in the resuscitation of critically ill patients: the University of Ottawa critical care medicine, high-Fidelity simulation, and crisis resource management I study. Crit Care Med. 2006;34:2167–74.CrossRefGoogle Scholar
- 28.Cooper S. Teamwork: what should we measure and how should we measure it? Int Emerg Nurs. 2017;32:1–2.Google Scholar
- 31.Bogossian F, Cooper S, Cant R, Beauchamp A, Porter J, Kain V, et al. Undergraduate nursing students’ performance in recognising and responding to sudden patient deterioration in high psychological fidelity simulated environments: an Australian multi-Centre study. Nurse Educ Today. 2014;34:691–6.CrossRefGoogle Scholar
- 38.Dorer B. Round 6 translation guidelines. Mannheim: European Social Survey, GESIS; 2012.Google Scholar
- 42.CoreTeam R. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2017.Google Scholar
- 48.Falchikov N. Improving assessment through student involvement: practical solutions for aiding learning in higher and further education. New York: Routledge; 2005.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.