To study the correlation between SAR and BOS ratings, a controlled team performance quasi-experiment was set up with a total of eight sessions, each including one virtual team exposed to tasks that were assessed using both SAR and BOS ratings. The teams were given tasks requiring communication and collaboration via online collaboration tools. With eight teams conducting six separate challenges each, a total of 48 sessions were recorded. Two of the recordings had to be discarded due to methodological errors, leaving the dataset with 46 usable session recordings, each with three pairs of SAR–BOS data for a total of 138 pairs of individual assessments of teamwork in the dataset.
Each virtual team consisted of one coordinator, interacting with one nearby casualty, and two medical experts assisting the coordinator remotely through online collaboration tools, see Fig. 1. At each of the three locations, an observer was colocated with the local team member. The observers monitored, and assessed, the performance of the team based on observations on the single individual located at their respective site and the interactions between that individual and the rest of the team.
All six challenges related to chemical, biological, radiological, and nuclear (CBRN) incidents. The challenges were paired in three scenarios of two parts each: first a diagnosis challenge and then a treatment challenge. The three scenarios were given to the teams in permutated ordering. The scenarios were designed to be similar in an attempt to reduce any impact that the scenario ordering might have on the results.
All three scenarios were inspired by actual CBRN incidents, with one case of radiation poisoning, one of cyanide poisoning, and one of nerve gas poisoning. The scenarios were developed to give a sense of realism, with scripted vital readings extracted from real medical cases or studies. In every scenario, the coordinator encountered a casualty of an unknown chemical or radiological incident. The casualty was roleplayed by the study controllers at the Naval Postgraduate School site, who fed the coordinator scripted readings in response to particular examinations. As the coordinator lacked medical expertise to diagnose and treat the victim, assistance was given by the two remotely located medical experts. To enforce a need for collaboration, the two medical experts were given customized handbooks at the start of the study, each containing partial information on how to diagnose and treat twelve different CBRN cases. The two handbooks contained some overlapping, and some unique, information. The handbooks were designed so that the only way for the team to successfully diagnose and treat the patient was to combine information from both handbooks based on the coordinator’s examination reports. The teams were restricted to the provided tools for communication: an online video conferencing service and a textual chat service.
As the coordinator was the only person able to see and interact with the casualty, the experts communicated instructions for acquiring specific information relevant for them to be able to diagnose the patient, e.g., blood pressure, oxygen saturation, pulse, respiratory rate, and dilation. When the experts had acquired enough information, or after a maximum of 6 min, the team had an additional 2 min to discuss and agree on a single diagnosis. Before time ran out, the teams provided their primary diagnosis and up to two alternative hypotheses that they were not able to rule out.
After completion of the diagnosis questionnaire, the correct diagnosis was revealed to give all teams the same ability to proceed based on the correct diagnosis regardless of the outcome of the preceding challenge. Thereafter, the treatment challenge began. The procedure for the treatment challenge was similar to that of the diagnosis: The medical experts had sections of the treatment manual that were incomplete and needed to combine their knowledge to instruct the coordinator to conduct the right steps to treat the casualty. The number of tasks that the team should complete varied between six and nine, depending on the diagnosis.
For each challenge, time to completion was registered (or failure to complete), as well as the outcome of their tasks. These results constitute an objective metric of the teams’ task performances. The objective scores were used for cross-referencing and correlation analysis with team performance, as positive task outcome was expected to correlate with good team performance.
The study was designed to impose ecological validity through medical accuracy and realistic scenarios. The virtual teams consisted of one forward agent (the coordinator) and multiple remote reachback experts (the medical experts). The setup thus resembles the US Navy battlefield medical scenarios where remote collaboration and distributed command and control have been investigated (e.g., Bordetsky and Netzer 2010). Similar remote just-in-time expertise has been proposed as a potential life-save in the emergency management domain, exemplified, e.g., by the Tokyo subway bombing in 1996 in which a remote physician was able to identify that the perceived bomb attack was actually a chemical attack and consequently could prevent further disaster by informing the emergency services before they entered the area (MacKenzie et al. 2007).
Eight teams of three members were selected from student volunteers. For the coordinator role, US military students at the Naval Postgraduate School (NPS) in Monterey, California, with documented skills in leadership and experience from military teamwork in the Navy, Army, and Marines. Two of these eight students were females, and six were males. Two NPS students reported previous experience from activities relating to CBRN incidents. The age range of these students was 25–40 years. The reachback roles were manned by senior nursing students at the Hartnell College in Salinas, California. This group of students was mixed gender, and they ranged between the ages of 20 and 30. As senior students in a relatively small college class, they did have prior experience of working together. None of the 16 nursing students reported any prior experience or knowledge from CBRN diagnosis and treatment.
The observer roles at the reachback sites were manned by one nursing dean and two nursing instructors from Hartnell College, all females with plenty of experience assessing nursing teamwork. Since they had several years of documented experience from teaching and grading students in nursing tasks involving teamwork, they were deemed well-equipped to assess the teams’ performances in this study. At the Hartnell site, two of the three available observers were actively involved in each challenge. At the Naval Postgraduate site, one observer was used for the duration of the study. This observer was a male graduate student at the NPS Information Sciences department with a special interest in studying military team behavior and a high proficiency in using online collaboration tools for virtual teams.
Four main data sources were used during the study: (1) team members’ self-assessments, (2) observers’ ratings, (3) communication recordings, and (4) outcomes-based task score.
The self-assessments were conducted as a 16-question survey after each challenge, using five-point Likert scales to rate their own interactions with the team and the task. The same 16 questions were repeated after all six challenges.
The observers were instructed to monitor team performance (explained as a combination of teamwork and task-based outcomes) and continuously take notes during the challenges. Following each challenge, they answered post-challenge surveys with a set of 14 BOS questions. These questions were formulated in the same way as the self-assessment questions, also using a five-point Likert scale. It should be noted that the observers were allowed to read the questions prior to the study, in order to increase their understanding of where to focus their attention while monitoring the challenges. The scenarios were kept to a maximum of 8 min and performed back-to-back in one session. During short breaks between each challenge, the participants completed post-challenge surveys immediately after finishing their tasks in order to reduce the risk that they would forget or neglect performance trends (Kendall and Salas 2004). Each observer survey was paired with the corresponding team member survey after completion to allow pairwise comparisons.
After each team had completed the sixth challenge, both the team members and the observers answered an additional survey of post-study questions to complete the data collection. In addition, a pre-study survey was completed by each participant prior to the first scenario, in order to collect some minimal demographics. The timing of each survey in relation to the scenarios is depicted in Fig. 2 below.
The post-challenge survey consisted of 16 and 14 questions, for team members and observers, respectively. All questions were inspired by the Crew Awareness Scale (McGuinness and Foy 2000), NASA task load index (TLX) (Hart and Staveland 1988), and Bushe and Coetzer’s survey instrument (1995). These instruments were selected as a baseline for the surveys both because they were available and familiar to the study controllers, and because they were considered feasible to use for assessment, in terms of effort.
Table 1 presents the survey items for both the team members (labeled x) and for the observers (labeled y). Question x
0 used an 11-point Likert scale, whereas all other questions used five-point Likert scales. The scale was designed so that a high score is expected to correlate with strong team performance for all items, except items 1, 2, and 10 where the opposite relationships were expected. All questions are related to individual or team performance. There is a significant overlap in what the team members and the observers were asked to assess, as depicted by the table.
In addition to the collected survey data, intra-team communication (auditory and textual chats) was recorded for the entire duration of each challenge. These data were collected by tapping into the collaboration tool that the team used to communicate during the challenges. These recorded interactions may prove useful for future analysis; however, they have not been included in this study and are mentioned here for the sake of completeness only. Likewise, x
14 and x
15 have not been used in this study. For each challenge, an observer assessment score of the overall performance (z
obs) was calculated as the mean of the y
16 ratings for that challenge.
The fourth and final data source is the objective outcomes-based task performance measure, which consists of a record of whether the challenge was successfully completed and the time it took the team to complete the task. For the diagnosis challenges, the teams were given a base score of 10 points if their main hypothesis was the correct diagnosis. Three points were deducted per erroneous alternative hypothesis (n
a) that they could not rule out (max. 3), and thus task score = 10 – 3 * n
a. If their primary hypothesis was incorrect, but their alternative hypotheses contained the correct diagnosis, they were instead given 3 points minus the number of erroneous alternative hypotheses: task score = 3 − n
a. A complete failure to diagnose the patient resulted in a task score of zero. In each diagnosis scenario, the base score was adjusted by a time score that was calculated as time score = (1 − t
D/120) * 10, where t
D corresponds to the time needed (in seconds) for the team to agree on a primary diagnosis, after the time to interact with the patient had run out, max 120 s.
The treatment challenges consisted of a set of tasks that the experts needed to convey to the coordinator, which then had to be performed on the patient. The base performance score for the treatment challenge was 10 points, with 1 point deducted for every error that occurred (n
e), where an error was either a failure to conduct one task, incomplete/failed attempt at conducting one task, or completion of a task that was not a part of the scripted treatment program: task score = 10 − n
e. The treatment challenge base score was then moderated by time just as for the diagnosis challenges: time score = (1 − t
T/360) * 10, where t
T represents the time needed to complete the treatment challenge (max 360 s).
Both in the case of diagnosis and treatment challenges, the total score (z
task) was calculated as task score + time score. The aggregated score represents a speed/accuracy trade-off that is commonplace both in work situations and sports (Fairbrother 2010), with the caveat that the designed metric does not necessarily correspond to the prioritizations made by the teams and as such the scoring system. These data were captured by the experiment leaders and kept for reference and benchmarking.
Before any analysis, the scale was inverted for questions 1, 2, and 10 in both sets, to get a homogenous dataset where positive correlations are expected to have a positive impact on team performance. The team members’ self-assessment scores were compared to the observers’ ratings for all process variables (variables 1–13 in Table 1). The comparison was done by calculating the bivariate correlation coefficients on each set of variables. Correlation strengths have been categorized according to Table 2 below. t tests were then computed to see whether there are significant differences in how the choice of method affects the results, at α = .05. The p values where adjusted using the Holm–Bonferroni method to compensate for the multiple hypotheses problem, to reduce the risk for false positives (Holm 1979).
The objective team performance score reflects only task-based aspects of team performance. As a contrast to this measurement, the subjective y
16 variable represents the observers’ overall assessments of team performance. The aggregation of y
16 for all observers relating to one team represents a more comprehensive take on team performance, with the caveat that the ratings are subjective. The task scores have been compared with the aggregated observer scores to determine whether there is indeed a positive correlation between team performance as interpreted by the observers and task performance. This correlation test was done at the team level, since team is the unit of analysis that both metrics were designed for. Thus, each completed challenge produced one tuple of data for a total of N = 46 samples.
The independent SAR and BOS variables were then fitted to both the z
task and the z
obs scores through multiple regressions. The glmulti package (Calcagno and de Mazancourt 2010) for R was used to identify the best-fit regression model using exhaustive search optimizing on the Akaike information criterion (AIC) instead (Akaike 1974).
All results in this study were computed using IBM SPSS and R.