1 Introduction

Human-centered computing (HCC) studies the human in relation with computing devices. It differentiates from human–computer interaction (HCI) as it deals with how the human (an individual, a team, or a society) relates to computers and other humans. Its focus is on the multi-faceted nature of humans, including emotions, social skills, attitudes, and so on. According to Clarkson et al. [21] and Canny [19], HCC is about studying “computational artifacts in support of human endeavours” and the “implications of computing in a task-directed way”, by spanning several disciplines as Computer Science and Social Sciences. One of the currently open challenges in Computer Science, specifically those related to Social Signal Processing and Affective Computing, is conceiving and building computing systems to support humans in team activities seamlessly. This process has to be informed by Social Sciences, as they have a long tradition of studying socio-affective phenomena occurring in teams.

When working together, the affective, behavioural and cognitive interaction between people often contributes to the emergence of dynamic processes called “team emergent states” [46, 55]. One of them is the transactive memory system (TMS), a cognitive team emergent state related to the specific knowledge owned by each team member. The term “transactive” highlights the relevance of exchanging information about members’ knowledge and expertise. TMS combines each member’s personal field of knowledge (e.g., Robert has a mathematical background, while Susan has a history in arts) with the awareness of each other’s one (e.g., Robert knows that Susan is specialised in arts, whereas Susan knows that Robert is good at maths) [81]. In this work, we share the conceptualisation of TMS given in [52]. That is, TMS differentiates from other forms of socially shared cognition on different aspects. First, it depends not only on understanding “who knows what” but also on the degree to which a team’s knowledge is differentiated. In addition, it includes the dynamic interplay between organised store of knowledge (TMS structure) and a set of knowledge-relevant transactive processes (encoding, storage, and retrieval processes) that occur among team’s members.

It is well known that developing TMS within a team can significantly improve performance, productivity and, therefore, efficiency [3, 46, 51, 61, 63], by enabling work sharing, thus, reducing the individual cognitive load [35]. Recent findings suggest that TMS is strongly linked to affective outcomes, such as team trust, efficacy and satisfaction [84].

While there is no joint agreement on how TMS emerges within a team, all the theories about it state the relevance of interpersonal verbal and nonverbal communication.

Individuals communicating with each other are keener to select the information they are willing to learn [33] (e.g., as John is a mechanical specialist, he will be interested in car engines only). Similarly, information retrieval is facilitated if communication happens during learning [33]. This aligns with Pavitt’s theory [68], which considers communication as a context for learning. Moreover, communication enables a better inter-members’ understanding [14] and prevents the team from applying stereotypes about each other’s expertise [36].

Although several previous studies show that nonverbal communication can predict some emergent states (e.g., cohesion) [47], to our knowledge no studies are exploring how nonverbal behaviours and TMS are related. Consequently, no work focuses on how computing systems can deal with nonverbal behaviours characterising TMS within teams. The development of such systems would envisage HCC applications to facilitate team problem-solving, and the definition of computational models for effectively supporting team collaboration.

The work presented in this paper is a first step towards this goal. Its main contribution is to investigate which nonverbal features, already exploited in studying other emergent states (e.g., leadership and cohesion, see Sect. 3), can predict TMS, both unimodally and multimodally (see Sect. 6). An overview of our approach is shown in Fig. 1. From our findings, we provide insights into the development of HCC systems leveraging TMS.

Fig. 1
figure 1

An overview of our approach. From the multimodal dataset WoNoWa (see Sect. 4), we extract nonverbal multimodal features: audio, movement and spatial arrangement of teams. The choice of these features is mainly inspired by those previously found to be relevant for estimating the team’s emergent states related to TMS. We compute team features and team scores from the extracted team member’s features and self-assessed TMS (see Sect. 5). Finally, we analyse the role of team features in modelling and predicting TMS by running multiple linear regression analyses (see Sect. 6)

2 Theoretical background

According to Moreland, a team includes at least 3 individuals sharing knowledge, activities and so on [62]. Unlike dyads, team interactions are more complex since they include one-to-one and one-to-many interactions. As a consequence, this complexity also applies to team “emergent states”. These are defined as “cognitive, affective, and motivational states of teams that are dynamic and vary as function of team context inputs, processes, and outcomes” [55, 71]. Three categories of emergent states have been identified [30, 46]. Cognitive emergent states are related to the management of collective knowledge affecting team performance (e.g., Shared Mental Models [22] and Transactive Memory System). Behavioural emergent states are related to the activities and interactions between team members (e.g., processes related to planning, monitoring, coordination and decision-making). Finally, Emotional or Affective emergent states (e.g., cohesion and trust) include psychological states relating to feelings, attitudes, and emotions of the team members [71]. While behavioural emergent states can be directly measured from the team’s behaviours, for example, by automatically extracting behaviours through sensors, and some efforts are being made towards measuring emotional emergent states from team’s dynamics [73, 80], cognitive emergent states, such as TMS, have only been measured through indirect measures such as questionnaires and recall [52, 53, 64].

2.1 The transactive memory system

The transactive memory system (TMS) is an extension of an individual’s memory to the team level. In other words, transactive memory refers to the awareness of one’s knowledge and skills. TMS develops when each team member is also aware of the knowledge and skills of the others. So, they build a mental representation of how knowledge is distributed between each other (i.e., “who knows what”), allowing them to extend their individual knowledge [81].

TMS is a multidimensional construct consisting of: (i) Credibility, that is, the trust that the knowledge possessed by any of the other members is correct and accurate; (ii) Knowledge Specialisation, that is, the differentiation of knowledge between the team members; (iii) Coordination, that is, the ability of the members to work together smoothly [51, 63]. Credibility and Coordination are key factors to affective outcomes of TMS [59, 83].

The development of TMS follows the 3 phases characterising any memory system: Encoding, Storage, and Retrieval. During the Encoding phase, the team members infer “who knows what” by having multiple information exchanges. For example, a student group has to complete a project assignment. They have known each other since the 1st year and know that Robert is good at planning, Susan is very creative and Alice is good at programming. In the Storage phase, the team members distribute the incoming information according to each other’s expertise [54]. For example, when the professor communicates the deadlines for the deliverables of the project, Robert will be particularly attentive to this information since he is the one in charge of the planning. Acceptance and shared awareness of expertise are needed in this phase. Finally, in the Retrieval phase, the team members know from whom they can obtain the knowledge they need [54]. For example, Alice is programming the software and asks Susan for hints for designing the user interface. Here, knowledge distribution in the team is necessary.

2.2 Interpersonal communication and TMS

Interpersonal communication in a team can be both voluntary and involuntary [27], and does not always imply verbal exchanges [60]. Previous work highlighted the role of nonverbal behaviours in team communication, including cues such as spatial arrangement, management of inter-member distances, speaking turn patterns, interruptions, etc. [23, 32, 37]. Interpersonal communication is crucial for TMS [69], being among the factors that precede [72] and support its development through all the 3 phases. In particular, some studies have shown that the use of nonverbal and para-linguistic cues in face-to-face communication allows members to signal and combine their knowledge more effectively compared to during non-face-to-face communication (e.g., computer-mediated) [34]. Communication during team training also facilitates the collective recall of information [64].

Other authors focused on the role of communication in the 3 dimensions of TMS (i.e., Credibility, Specialisation, Coordination). Kleanthous et al. [44] investigated how each dimension varies over time in a team navigating a 3D virtual environment collaboratively. They showed the important role of communication on Coordination and of gesticulation on Credibility.

Yoo and Kanawattanachai [82] and Rahimpour [70] noted that the amount of communication plays a crucial role in establishing TMS and, after that, its role gradually becomes less relevant. For example, to build a TMS, Yoo & Kanawattanachai asked teams to communicate remotely to manage a company’s finances and thus to share different areas of expertise (marketing, finance, production, operations and human resources). They found a positive influence of communication on the development of the TMS which stopped once it was built. Argote et al. [2] highlighted that the influence of communication on TMS changes according to the presence of a team leader. They showed that teams without a leader have more robust TMS over time which leads to a better performance of the team. This result can be explained as due to the increased communication taking place in teams without a leader.

3 Related work

As mentioned above, previous work on nonverbal behaviours and emergent states that could be exploited in HCC is neglecting TMS, preferring behavioural and emotional emergent states, e.g., emergent leadership, cohesion and trust in a team. This could be explained by the fact that TMS, being a cognitive emergent state, deals with abstract knowledge (meta-memory), so it is more difficult to investigate by looking at more concrete cues (nonverbal behaviours). In this paper, we ground on the features that have already been shown to perform well in predicting other emergent states (e.g., leadership and cohesion). We hypothesise that some of them could also be related to TMS, since psychological models in the literature show the relationship between TMS and leadership [4, 48], as well as a predictive effect of task cohesion on TMS in the context of football teams [49].

Most of the works on automatic assessment of group dynamics focused on multimodal analysis of team meetings corpora, such as the ICSI [38], AMI [41], ATR [17], NTT [67] and ELEA corpus [75]. While some works focused on the prediction of individual-related dimensions such as personality [57] or individual performance [50], in the following we focus on work on the extraction of nonverbal features and their relevance in inferring behavioural and emotional emergent states. Sanchez et al. analysed the correlation between the emergence of individual leadership in team interactions (measured through questionnaires about team members’ perception of each other) and acoustic [73], body/head [74] and attention [76] features. Results suggest that emergent leaders are those who talk the most, have more speaking turns and interrupt the most. They also show that body activity and motion are important in the perception of emergent leadership and that the combination of acoustic and visual information performs better than single modalities. Finally, they show that visual attention features are not better estimators of leadership than speaking activity.

More recent approaches for emergent leadership detection investigate other features and apply more complex machine learning models. Beyan et al. propose several approaches. First, they model emergent leadership by using features related to visual attention only (from head and body activity) [6]. They extract the same features used in [76] and a set of different ones, by leveraging a different and more accurate method based on head pose estimation. Then, the authors describe an approach for extracting features based on 2D pose estimation [8]. These features perform better in emerging leadership detection compared to the existing visual features. In a more recent work [9], they propose a sequential approach based on unsupervised deep learning generative models. Other works investigate the expression of different leadership styles [7, 24,25,26, 39].

Another emergent state is team cohesion, whose investigation in HCC is initiated by Gatica-Perez & Hung [37]. They automatically extract multimodal features (audio, visual and audio-visual) to infer low/high cohesion in task-based team meetings through machine learning techniques (e.g., SVM). Results indicate a particular relevance of turn-taking patterns. Nanninga et al. extend this work, by adding para-linguistic mimicry features and separately observing the social and task dimensions of cohesion [66]. A more recent study considers features of 3 categories: nonverbal (e.g., gaze, laughter, and so on), dialogue acts and interruptions [42]. These features are studied separately and then combined together. Results show a positive correlation with the cohesion of, among others, mutual gaze and laughter, as well as the number of speaking turns, overlaps and interruptions. More interestingly, certain behaviours that are not associated with cohesion when analysed separately, do have an impact when combined with other cues of different modalities (e.g., dialog acts with head nods). Walocha et al. explore the dynamics of task and social dimensions of cohesion, grounding on motion-capture features only [80]. They predict the decrease of cohesion over time, by using self-reported annotations of team cohesion as labels. Their results highlight a (positive or negative) impact of the maximum distance between team members, the overall posture expansion and the amount of facing between each person. In addition, some features are found to be correlated to both task and social cohesion.

To summarise, the works described above show the effectiveness of using nonverbal behaviour in addressing emergent states. In particular, multimodal approaches seem to generally perform better than unimodal ones.

4 The WoNoWa dataset

WoNoWa (Who kNows What) is a multimodal (audio and video) dataset of interactions within 15 teams, performing several activities [11]. The dataset includes automatically extracted features and manual annotations of team members’ nonverbal behaviours, as well as self-assessment measures of TMS.

WoNoWa was designed to address the 3 phases of TMS, i.e., Encoding, Storage and Retrieval (see Sect. 2). In the Encoding phase, the team was given a list with 3 fields of expertise and each member could choose the preferred one. These fields were: Logistical, Mathematical and Manual expertise. In the Storage phase, each team member watched a brief tutorial about the chosen field of expertise.

We focus, here, on the interactions related to the Retrieval phase. During this phase, the team members were together in the Interaction Area shown in Fig. 2. The Retrieval phase consisted of three steps, after which each participant filled out a TMS questionnaire (see Sect. 4.2). At the beginning of this phase, each team member was asked to accomplish a task related to the chosen field of expertise: setting up the table by following the rules described in the tutorial (Logistical expertise); computing conversions between the Imperial and the International System (Mathematical expertise); making origami (Manual expertise). Then, as a team, they were asked to modify the setup of the table and to do new origami, this time following a list of dimensions (given in the Imperial system). The participants were only provided with measuring tools in the International System (meters), thus mathematical expertise was needed to accomplish the task.

Finally, the last step of the Retrieval phase, the step on which we focus in this work, they were free to self-assign the same 3 tasks in any way they wanted (but they could not choose the one they just performed). So, the members needed each other’s expertise, resulting in collaboration and interaction between them. We focus on this step of the Retrieval phase, hereinafter called “interaction”, for two main reasons: (1) it is the last one, so the team had the time to develop TMS through the previous ones (as confirmed by the high score of Specialisation and Coordination, as well as higher scores of Credibility compared to the previous steps of the Retrieval phase [11]); (2) the team members are engaged in a collaborative task requiring a high level of interaction.

Fig. 2
figure 2

On the left: plan of the Interaction Area, measuring 3.90 by 3.87 ms. On the top right: view from the video camera placed in the North-East corner of the area (V1). On the bottom right: view from the video camera placed in the South-West corner of the area (V2). In each view, the ArUco Markers [29] (M1, M2, M3) can be viewed by the corresponding camera. Each table corresponds to one of the 3 fields of expertise: Logistical (E1), Mathematical (E2) and Manual (E3)

4.1 Technical setup

WoNoWa was collected in an experimental room depicted in Fig. 2. A table was placed in the center of the Interaction Area, while two more tables were placed in the corners of the room. Team members performed the tasks related to the different fields of expertise as indicated in Fig. 2. The team interaction was recorded via 3 video cameras at \(1920 \times 1080\), progressive scan, 50 fps. Two of them were installed at a height of 3 ms in the opposite corners of the room, so each member could be viewed by at least one camera at all times. However, each video camera could capture only a part of the room, so camera view fusion had to be performed, as described in Sect. 5.2.1. An additional frontal video camera was positioned at a lower height to provide an additional global view of the area, to facilitate the manual annotations (see Sect. 5.3). The video cameras were calibrated to correct the white balance and compensate for the lens distortion.

For tracking the team members’ positions in the room, we used ArUco markers, fiducial markers based on a seven-by-seven binary grid [29]. Three reference markers were positioned on the floor in a way that they were visible by both cameras and remained constant throughout the experiment (see M1, M2 and M3 in Fig. 2). One unique ArUco marker was placed on each of the baseball hats worn by each team member (see Sect. 5.2.1). Each participant wore a wireless microphone headset recording at 44.1k Hz, in separate channels. They also wore t-shirts of different colours to facilitate the extraction of upper body silouhette used to compute movement features (see Sect. 5.2.2).

4.2 Self-assessment scores of TMS

The team members were asked to fill out a questionnaire about their perception of TMS in the team, after each step of the Retrieval phase. The questionnaire contained Lewis’ items [51] measuring the 3 dimensions of TMS (i.e., Credibility, Specialisation, Coordination). For French participants, the French translation of Lewis’ questionnaire, validated by Michinov [58], was used. All the scores were given on a 5-point Likert scale, where 1 stands for “I totally disagree” and 5 stands for “I totally agree”.

Table 1 Descriptive statistics of the features, organised in 3 categories: Audio (A), Movement (M), Spatial (S)

For each TMS dimension, Cronbach \(\alpha \) was computed to measure the reliability of the items. Two items from the Coordination sub-scale were discarded since they were negatively correlated with the others belonging to the same sub-scale, indicating that the team members did not rightly interpret them. The \(\alpha \) computed on the remaining items indicated acceptable to very good reliability (0.83 for Credibility, 0.78 for Specialisation and 0.67 for Coordination). The scores of each item of the same sub-scale were then averaged to obtain one score per member.

To obtain one score per team and per TMS dimension (i.e., one score for Credibility, one for Specialisation, and one for Coordination), we checked whether the team members agreed about the score of TMS dimension they assigned to their team. ICCs (two-way, average) with consistency definition [45] were computed for each team, revealing a fair to excellent agreement (fair for 2 teams, good for 2 teams and excellent for 10 teams, all \(p<0.001\)) [20] except for one team (\(\textrm{ICC}=-0.66\), \(p=0.97\)), that was excluded from the analyses. Finally, for each TMS dimension and each team, we computed the mean of the team member scores.

5 Nonverbal features extraction

As mentioned in Sect. 4, we focus on nonverbal features extracted from data collected in the last step of the Retrieval phase.

In the remainder of this Section, we describe the nonverbal features organised according to the modality they belong to: Audio, Movement and Spatial. Table 1 summarises their descriptive statistics.

5.1 Audio features

Audio features were extracted from the audio recordings and are related to vocal turn-taking, which plays an important role in developing social dimensions like competition and collaboration [32, 40]. Vocal turn-taking includes silence, silence overlaps, and interruptions. In particular, interruptions are a relevant cue in face-to-face conversations: they can be considered as turn-taking violations [5], reflecting interpersonal attitudes (e.g., dominance or cooperation) as well as engagement in the interaction [65].

5.1.1 Pre-processing

To compute the audio features relative to team members’ turn-taking activity, we applied a series of transformations to the raw audio files. The audio recordings of each team member were manually synchronised with the videos by referring to a clap that the experimenter performed at the beginning and at the end of each recording. The raw files were normalised/compressed and a noise reduction filter was applied in Audacity.Footnote 1 Additionally, about 10% of the files were processed to reduce specific noises like, e.g., breathing, electromagnetic interference, and so on. To detect speaking activity, the Silence Finder function of Audacity was applied to automatically detect and mark segments exceeding a defined sound threshold. The segments were manually checked and tuned to ignore irrelevant sounds, e.g., impacts with the microphone, objects falling on the ground and nonverbal vocal behaviour (sighing, laughing, self-talking, etc). The resulting segments were binarised, with 1 representing speech and 0 non-speech.

5.1.2 Output

From the binary segmentation, we computed the following features per team member over the whole video, taking inspiration from previous work on team’s analysis [37, 74]:

  • Total Speaking Turns (number of turns per minute, per member m) - \(\textit{TST}_m\): the number of speaking turns, normalised by the interaction length in minutes;

  • Total Speaking Length (number of seconds per minute, per member m) - \(\textit{TSL}_m\): the total speaking time, in seconds, divided by the interaction length in minutes;

  • Average Speaking Turn (per member m) - \(\textit{AST}_m\): the average speaking turn duration, in seconds, with \(\textit{AST}_m = \textit{TSL}_m / \textit{TST}_m\);

  • Total Attempted Interruptions (per minute, per member m) - \(\textit{TAI}_m\): the number of attempted interruptions, normalised by the interaction length. An attempted interruption occurred if a team member started speaking while another one was already speaking, resulting in an overlap;

  • Total Successful Interruptions (per minute, per member m) - \(\textit{TSI}_m\): the number of successful interruptions, normalised by the interaction length. A successful interruption occurred if (1) a team member started speaking while another one was already speaking, resulting in overlap, and, consequently, (2) that team member stopped speaking before ending their turn;

  • Successful Interruptions Percentage (per member m) - \(\textit{SIP}_m\): the percentage of successful over attempted interruptions, with \(\textit{SIP}_m = \textit{TSI}_m / \textit{TAI}_m*100\).

The above features, computed on each team member, were then averaged to obtain the following team audio features: \(\textit{TST}\), \(\textit{TSL}\), \(\textit{AST}\), \(\textit{TAI}\), \(\textit{TSI}\), \(\textit{SIP}\).

5.2 Movement features

The following Movement features are computed: Head Velocity (\(\textit{HV}\)), Head Distance (\(\textit{HDist}\)), Head Directness (\(\textit{HDir}\)), Entropy of HV (\(\textit{HVE}\)), Quantity of Motion (\(\textit{QoM}\)), and Entropy of QoM (\(\textit{QoME}\)). The selection of these features is inspired from previous work on social interaction [16, 28, 79]

5.2.1 Head position features

Team members’ head position and rotation were tracked through a marker-based approach. Each team member wore a cap with an ArUco marker [29] attached on the top. Three additional reference markers were positioned on the floor in a way that they were visible by both cameras and remained constant throughout the experiment (see M1, M2 and M3 in Fig. 2). These markers were used as references to compute the position of the members’ head markers in the room. Since each camera performed a separate head tracking, they had to be merged before using them. The processing was carried out via a Python script using the OpenCV library [15]. We applied linear interpolation and average smoothing in case of missing frames. For each video frame and team member, the following data were extracted: the 3D head position (meters) \(\textit{HP} = (\textit{HP}_x, \textit{HP}_y, \textit{HP}_z)\) and the 3D head rotation (radians) \(\textit{HR} = (\textit{HR}_x, \textit{HR}_y, \textit{HR}_z)\).

Head velocity (HV) We computed Head Velocity HV as the magnitude of the 1st derivative of the head position:

$$\begin{aligned} HV = \sqrt{\left( \frac{d\textit{HP}_x}{d\textit{t}}\right) ^2 + \left( \frac{d\textit{HP}_y}{d\textit{t}}\right) ^2 + \left( \frac{d\textit{HP}_z}{d\textit{t}}\right) ^2} \end{aligned}$$
(1)

To reduce noise, we applied a Savitzky–Golay low-pass filter (order 1, frame size 75).

Head distance (HDist) For each team member, we averaged the Euclidean distance of their HP with each of the other 2 team members’ HP, obtaining HDist\(_i\), HDist\(_j\) and HDist\(_k\) for each team member, respectively. We then averaged HP\(_i\), HP\(_j\) and HP\(_k\) for each team to obtain the team feature Hdist for each frame.

Head directness (HDir) The Directness of movement is a feature that estimates how much direct vs indirect a trajectory is [1, 16]. We computed Head Directness on HP trajectory over time 15 s long moving windows, with 3 s overlap:

$$\begin{aligned} \textit{HDir} = \frac{||\textit{HP}_{W-1} - \textit{HP}_{0}||}{\sum _{f = 0}^{W-1} ||\textit{HP}_{f+1} - \textit{HP}_{f}||} \end{aligned}$$
(2)

where W is the window length (in frames) and \(||\,||\) is the Euclidean distance between the head position \(\textit{HP}\) in two generic frames of the window. So, \(\textit{HDir}\) tends to 1 if the length of the head trajectory in the time window tends to be equal to the distance between the position of the head in the first and the last frame of the window (i.e., the head trajectory is direct); it tends to 0 if the length of the head trajectory is greater than the distance between the position of the head in the first and the last frame of the window (i.e., the head trajectory is indirect).

5.2.2 Silhouette blob features

To exploit colour thresholding to detect upper body (head, torso, and arm movement) movement features, the team members wore coloured t-shirts and baseball hats. For each video frame, the upper body Silhouette Blob (SB) was extracted as the binary threshold of the HSV pixel data, and a median filter was applied to remove noise [18].

Fig. 3
figure 3

The movement feature extraction framework. From top to bottom, with time flowing from left to right: video frames are read, Head Positions (HP) and Silhouette Blobs (SB) are extracted, movement features (HV, QoM, HVE, QoME, HDir and HDist) are computed

Quantity of motion (QoM) Quantity of Motion (QoM) is a 2D measure of the amount of performed movement [18]. First, we computed the area of the binary image resulting from the XOR between 2 consecutive SBs (XOR image area). Then, QoM is equal to the ratio between the XOR image area and the area of the binary image resulting from the OR between the same 2 consecutive SBs. So, QoM tends to 0 if team members are still, it is greater than 0 if they are moving (the upper limit being 1).

5.2.3 Head velocity and quantity of motion entropy

As detailed in [79], Sample Entropy (SampEn) is a non-linear entropy extraction technique that was developed to quantify behaviour regularity by taking into account the “recent” movement history. Higher values of SampEn are associated with the higher disorder, while smaller values indicate regularity.

We used SampEn to estimate the degree of regularity of a team member’s HV and QoM, that we consider as an approximation of team coordination (one of the components of TMS, see Sect. 2). We adopted the SampEn Matlab implementation described in [56] with parameters: Embedding Dimension \(m = 3\), Tolerance \(r = 0.2\).

SampEn was computed on moving time windows of 15 s with 3 s overlap.

Figure 3 illustrates the movement feature extraction by providing a high-level representation of the process, and by highlighting the main data and features computed at each step. From top to bottom, with time flowing from left to right: video frames are read, and Head Positions (HP) and Silhouette Blobs (SB) are extracted.

5.2.4 Output

The resulting movement features per team member m computed over the whole video are:

  • Head Velocity Mean (\(\textit{HV}_m\)) and Standard deviation (\(\textit{HVSD}_m\));

  • Head Directness Mean (\(\textit{HDir}_m\)) and Standard Deviation (\(\textit{HDirSD}_m\));

  • Quantity of Motion Mean (\(\textit{QoM}_m\)) and Standard deviation (\(\textit{QoMSD}_m\));

  • Head Velocity Entropy Mean (\(\textit{HVE}_m\)) and Standard Deviation (\(\textit{HVESD}_m\));

  • Quantity of Motion Entropy Mean (\(\textit{QoME}_m\)) and Standard Deviation (\(\textit{QoMSD}_m\)).

Similarly to the audio ones, the above features computed on each team member were averaged to obtain the following team movement features: \(\textit{HV}\), \(\textit{HVSD}\), \(\textit{HDir}\), \(\textit{HDirSD}\), \(\textit{QoM}\), \(\textit{QoMSD}\), \(\textit{HVE}\), \(\textit{HVESD}\), \(\textit{QoME}\), \(\textit{QoMSD}\).

Additionally, Head Distance Mean (\(\textit{HDist}\)) and Standard Deviation (\(\textit{HDistSD}\)) were also computed over the whole interaction.

5.3 Spatial features

People’s arrangement in the physical space (also called, F-formation) can reflect their roles in the team and the ongoing interaction [23, 43]. Studies also show that interpersonal distance changes according to the degree of closeness among people [31]. For this reason, WoNoWa includes manually annotated features (performed by 2 raters, more details in [11]) related to the team arrangement in the experimental room and to how the members occupy the different areas of the room while performing the experiment tasks.

5.3.1 F-formations

The most frequent F-formations emerging from a visual analysis were the Semi-circular and the Triangular ones. The least frequent arrangements, that is, the L-shape or the Side-by-side one, were merged into a category called Other. An example of each F-formation is shown in Fig. 4. Two identical F-formations occurring in less than 5 s were considered uninterrupted.

Fig. 4
figure 4

An example of the F-formations annotated in the WoNoWa dataset: a Triangular, b Semi-circular, c L-shape, d Side-by-side. Since their frequency was low, c, d were merged into a category called Other

5.3.2 Task-related area occupation

We considered 3 main categories of Task-related Area Occupation, for each team member: when the member worked on their task (Personal Area) when the member was in the area related to a different task (Others Area), and when the member did common tasks not related to particular expertise, such as reading instructions, checking the table, thinking, and so on (Common Area).

5.3.3 Output

For each F-formation and Task-related Area Occupation category, we computed its frequency (i.e., the number of occurrences per minute), the mean time duration of each occurrence (computed as the sum of all the time spent in that category divided by the total number of occurrences of that category) and the percentage of the time in which the team was engaged in that category during the task (computed as the sum of all the time spent in that category divided by the overall length of the interaction), over the whole video.

The resulting final features were:

  • Semi-circular F-formations Frequency (\(\textit{SCffF}\)), Mean Time (\(\textit{SCffT}\)) and Percentage (\(\textit{SCffP}\));

  • Triangular F-formations Frequency (\(\textit{TrffF}\)), Mean Time (\(\textit{TrffT}\)) and Percentage (\(\textit{TrffP}\));

  • Other F-formations Frequency (\(\textit{OthffF}\)), Mean Time (\(\textit{OthffT}\)) and Percentage (\(\textit{OthffP}\));

  • Personal Area Occupation Frequency (\(\textit{PAF}\)), Mean Time (\(\textit{PAT}\)) and Percentage (\(\textit{PAP}\));

  • Others Area Occupation Frequency (\(\textit{OAF}\)), Mean Time (\(\textit{OAT}\)) and Percentage (\(\textit{OAP}\));

  • Common Area Occupation Frequency (\(\textit{CAF}\)), Mean Time (\(\textit{CAT}\)) and Percentage (\(\textit{CAP}\)).

6 Analyses and results

This work aims to model the three TMS dimensions as a linear combination of nonverbal multimodal features. We also seek to obtain explainable models with meaningful variables’ interpretation (that is, that could also be meaningful for a human observer). We adopt a multiple regression analysis since this method enables identifying significant relationships between a dependent variable (TMS dimensions) and independent variables (nonverbal features). In addition, this method enables computing the strength of the impact of multiple independent variables on the dependent variable.

First, we check whether the data meets the assumptions for multiple linear regression. We remove the features causing multi-collinearity issues, that is, those that are highly correlated with each other (\(r>0.8\)). This features are: \(\textit{TAI}\); \(\textit{OAP}\); \(\textit{OthffP}\); \(\textit{OAT}\); \(\textit{TrffP}\); \(\textit{SCffP}\); \(\textit{HVSD}\); \(\textit{HDirSD}\). Then, for each dependent variable (i.e., TMS dimension), we remove the features violating the linearity assumption, as follows:

  • Dimension 1—Credibility: \(\textit{TSL}\), \(\textit{OthffF}\), \(\textit{SCffF}\), \(\textit{QoM}\), \(\textit{QoME}\);

  • Dimension 2—Specialisation: none;

  • Dimension 3—Coordination: \(\textit{TSL}\), \(\textit{OthffF}\), \(\textit{OthffT}\), \(\textit{SCffF}\).

Finally, we perform regression diagnostics to check for the normality and the homoscedasticity of the residuals, by running a Shapiro–Wilk and a Breusch–Pagan test, respectively. For all the results presented in this Section, the 2 latter assumptions are met (all \(p>0.05\) for Shapiro–Wilk and Breusch–Pagan tests; all correlations between observed residuals and expected residuals under normality \(\ge 0.9\)). Since the data fit the assumptions, we use multiple linear regression.

As only a single TMS self-assessment score is given by each member at the end of each task, an issue arises as we do not have continuous assessment scores. Moreover, the number of features is significantly higher than the number of teams, and, consequently, the given assessment scores. Thus, we follow a stepwise approach to select the best predictors for each target variable.

We perform the regression models with 1, 2, or 3 modalities (i.e., Audio, Movement, or Spatial only, 2 of them or all the 3 modalities together) and, at most, 4 features (due to the small number of teams compared to the number of features). Then, from all the significant regression models (i.e., those having a p-value \(<0.05\) for every feature), we identify the best ones (i.e., those having the highest \(R^2\) score) for each number of features. We then checked the predicting performance of these models by running a 10-run 5-fold cross validation. These values were chosen according to previous work [10].

Table 2 reports the selected regression models for each dimension: Credibility, lines 1–8; Specialisation, lines 9–17 and Coordination 18–22.

Table 2 The significant regression models for each TMS dimension, according to the number of features and modality: Audio (A), Spatial (S), Movement (M)
Table 3 \(\beta \) and I values for the regression models with the highest \(R^2\) discussed in Sect. 7

7 Discussion

For each modality and TMS dimension, we discuss here the regression models with the highest R\(^2\). Since the feature values vary in different ranges, the \(\beta \) values cannot be directly compared among them. So, the reported \(\beta \) values only provide the direction of the correlation with the corresponding TMS score.

Let us consider a feature \(F_1\) that varies in [0, 1000] and has a \(\beta = 0.001\): so, \(F_1\) has a contribution of 0.001 on the dependent variable (one of TMS dimensions), that is, on average, it contributes for \(500 * 0.001 = 0.5\). Let us now consider another variable \(F_2\) that varies in [0, 1] and has a \(\beta = 1\): so, \(F_2\) has a contribution of 1 on the dependent variable, that is, on average, it contributes for \(0.5 * 1 = 0.5\). So, the two variables, despite having highly different \(\beta \) values (0.001 vs 1), cause the same amount of change on the dependent variable (0.5).

To quantify the impact of each feature on the TMS score dimensions, and thus enable comparison, we also report a coefficient I, representing the contribution of the features on the dependent variable, computed by multiplying \(\beta \) by the mean of the feature. \(\beta \) and I coefficients are reported in Table 3.

7.1 Dimension 1: credibility

Credibility was defined as the trust that the knowledge possessed by any of the other members is correct and accurate (see Sect. 2.1). Audio is the only modality that allows for obtaining significant unimodal models of this dimension. The feature that best models Credibility (model 1) is \(\textit{TST}\) (total speaking turns). The feature is negatively correlated with Credibility (\(\beta = -0.31\), \(I= -2.06\)), which could indicate that high Credibility implies that participants need less to ask and discuss the task among them, so they trust each other.

When looking at 2 modalities together, the best models include Spatial separately combined with Audio and Movement.

Concerning Audio and Spatial, with 2 features (model 2), \(\textit{TST}\) is significant (\(\beta = -0.33, I= -2.19\)), as well as \(\textit{OAF}\) (other’s area occupation frequency), that is positively correlated with Credibility (\(\beta = 0.16, I= 0.40\)). This result could be explained by the presence of helping behaviour and the team members trusting each other’s expertise: one member who needs help enters the area of the person with the needed expertise.

Considering 3 features (model 3), \(\textit{TST}\) and \(\textit{OAF}\) are still significant (\(\beta = -0.36, I= -2.34\) and \(\beta = 0.17, I= 0.43\), respectively) and Credibility is also negatively correlated with \(\textit{PAF}\) (personal area occupancy frequency), \(\beta = -0.09, I= -0.24\)). In line with what we observed with the 2-feature model, it could mean that people working alone do not seek the help of the other team members.

Moving to Spatial and Movement, the best model is the one with 4 features (model 6): \(\textit{HV}\) (\(\beta = -4.37\), \(I= -0.74\)), \(\textit{HDist}\) (\(\beta = -0.56\), \(I= -0.69\)), \(\textit{QoMESD}\) (\(\beta = 455.08\), \(I= 1.36\)) and \(\textit{SCffT}\) (\(\beta = -0.02\), \(I= -0.47\)). All these features are negatively correlated with Credibility. That is, team members tend to interact in a calm and steady manner.

The best model for Credibility is obtained when combining the 3 modalities together and considering 4 features (model 8). In particular, the significant features are: \(\textit{TST}\) (\(\beta = -0.33, I= -2.19\)), \(\textit{OAF}\) (\(\beta = 0.17, I= 0.43\)), \(\textit{PAF}\) (\(\beta = -0.13, I= -0.35\)) and \(\textit{HVE}\) (head velocity entropy mean, \(\beta = 157.28, I= 0.08\)). This result is similar to one of the models involving the 2 modalities described above.

To summarise, results show that we can estimate Credibility by looking at how much the team members communicate with each other in a confident way, which results in a low number of speaking turns and movements.

7.2 Dimension 2: specialisation

Specialisation was defined as the differentiation of knowledge between the team members (see Sect. 2.1). Spatial is the only modality that allows for obtaining significant unimodal models of this dimension. The features that best model Specialisation (model 9) are \(\textit{PAP}\) (personal area occupation percentage) and \(\textit{TrffT}\) (triangular f-formation mean time), which are both positively correlated with Specialisation (\(\beta = 0.0004, I= 0.003\) and \(\beta = 0.008, I= 0.96\), respectively). So, the impact of \(\textit{PAP}\) is very low, while a higher impact of \(\textit{TrffT}\) indicates that in a specialised team, the members tend to arrange in the space by following triangular configurations.

The best model using Audio and Spatial features (model 13) includes \(\textit{AST}\) (average speaking turn, \(\beta = 0.32, I= 0.94\)), \(\textit{CAP}\) (common area occupation percentage, \(\beta = 0.015, I= 0.5\)), \(\textit{OthffT}\) (other f-formation mean time, \(\beta = 0.015, I= 0.07\)) and \(\textit{PAF}\) (personal area occupation frequency, \(\beta = -0.13, I = -0.35\)). This result could mean that when Specialisation is high, the team members spend more time together in the common area of the room, engaging in longer speaking turns (e.g., for explaining tasks).

The best model using Movement and Spatial features (model 15) includes \(\textit{HVESD}\) (head velocity entropy standard deviation, \(\beta = 445.29, I= 1.33\)), \(\textit{HDir}\) (head directness mean, \(\beta = -6.76, I= -1.21\)), \(\textit{OAF}\) (\(\beta = 0.26, I= 0.65\)) and \(\textit{PAF}\) (\(\beta = -0.2 I= -0.5\)). The positive correlation between Specialisation and \(\textit{HVESD}\), and the negative one between the same dimension and \(\textit{HDir}\), might indicate that team members’ movements constantly vary following non-linear trajectories. These results could mean that members with high expertise go and help other members in their area of expertise.

Considering 3 modalities, the best model (model 17) includes: \(\textit{AST}\) (\(\beta = 0.57, I= 1.68\)), \(\textit{OAT}\) (other’s area occupation mean time, \(\beta = -0.03, I= -0.08\)), \(\textit{SCffT}\) (semi-circular f-formation mean time, \(\beta = -0.015, I= -0.35\)) and \(\textit{HDist}\) (head distance mean, \(\beta = -0.67, I= -0.83\)). This result is complementary with the findings about Credibility, indicating that people go to others’ areas to share their expertise.

Results show that movement features of the team members across the different areas can be used to estimate Specialisation. In particular, for high values of Specialisation, the team members’ movements continuously vary following non-linear trajectories.

7.3 Dimension 3: coordination

Coordination was defined as the ability of the members to work together smoothly (see Sect. 2.1). Audio is the only modality that allows for obtaining significant unimodal models of Coordination. The Audio feature that best models Coordination (model 18) is \(\textit{AST}\) (\(\beta = 0.72, I= 2.13\)). So, in a highly coordinated team, speaking turns last longer.

Moving to Audio and Movement, we obtain 2 significant models having 2 or 3 features, respectively. The former (model 19) includes \(\textit{TST}\) (\(\beta = -0.48, I= -3.19\)) and \(\textit{HDist}\) (\(\beta = -1.19, I= -1.48\)); the latter (model 20) includes \(\textit{AST}\) (\(\beta = 0.71, I= 2.1\)), \(\textit{TSI}\) (total successful interruptions, \(\beta = -0.52, I= -0,88\)) and, again, \(\textit{HDist}\) (\(\beta = -1.13, I= -1.4\)). Results show that high Coordination corresponds to a small number of speaking turns between the team members and a decreased distance between them.

Another significant model with 2 modalities (model 21), combines Movement with Spatial features. In particular, it includes \(\textit{HDist}\) (\(\beta = -1.34, I= -1.66\)) and \(\textit{CAF}\) (common area occupation frequency, \(\beta = 0.32, I= 1.19\)), similarly to what we obtained by combining Audio and Movement.

The best model for Coordination is obtained by combining 3 modalities (model 22). In line with the results obtained by considering models with 2 modalities, the significant features are \(\textit{TST}\) (\(\beta = -0.37, I= -2.46\)), \(\textit{CAF}\) (\(\beta = 0.22, I= 0.82\)) and \(\textit{HDist}\) (\(\beta = -1.37, I= -1.70\)).

On the whole, results show that, in a highly coordinated team, members engage in fewer and longer speaking turns, with few interruptions. Moreover, the members tend to stay close to each other and perform activities related to the coordination of the tasks (by working in the common area).

8 Conclusion and perspectives

This paper provides the first insights on how to automatically model the three dimensions (Credibility, Specialisation and Coordination) of TMS as a linear regression of nonverbal features of small teams. More specifically, we focus on features of 3 modalities: audio, movement and spatial arrangement.

Moreover, linear regression is chosen to obtain explainable models. Therefore, we focus on achieving high-performance scores while maintaining the readability of the models at the same time. We envision that such knowledge could be applied to the development of Human-Centered applications to monitor teams’ TMS and provide real-time feedback to improve their performance and affective outcomes on collaborative tasks. For example, an intelligent agent could monitor the interactions of a team performing a brainstorming task and intervene if a decrease in Specialisation, Credibility or Coordination between the members is detected. Previous studies showed that the intervention from an agent playing the role of team leader is perceived to potentially improve the TMS of a team [12, 13]. In this case, if for example a lack of Coordination is detected, the agent could mediate the interaction and suggest ways to find a common agreement between the members. The features we found to be the most relevant in estimating TMS dimensions can be easily computed in real-time and be used by the agent to decide when and how to intervene.

Similarly to previous work on automatic analysis of team emergent states, we found that features about turn-taking are also good estimators of TMS. For example, a small number of speaking turns per minute may reflect trust between team members (i.e., they do not need to reply to each other) and so can be used to estimate Credibility. A small number of interruptions, which in turn are related to longer average speaking length, may reflect fluid interaction and can therefore be used to estimate Coordination. Features related to the spatial arrangements are also good estimators of the TMS dimensions, as they might reflect the tendency of team members to seek (Credibility) and provide (Specialisation) help according to their expertise, as well as the fluidity of the interaction (Coordination). In addition, results show that, in general, by combining multiple modalities (i.e., audio, movement and spatial), we obtain better performances compared to the unimodal and bimodal models, keeping the same number of features.

The difficulty in automatically modelling Coordination could be related to the low reliability of the self-reported scores, indicating that this task is difficult also for humans. This result could also be linked to the difficulty for the team members to self-assess Coordination, which could be easier estimated by external observers. In the future, we will consider collecting additional annotations of TMS given by external observers.

Our work faces the following limitations. First, the WoNoWa dataset contains a relatively small number of observations, compared to the set of features available. This often occurs when dealing with human behaviour analysis. We show, however, that the 3 dimensions of TMS can be effectively detected as a linear combination of multimodal features. Using simple models such as multiple linear regressions also allowed for interpreting the role of each feature. Second, the generalisability of our findings may be limited to tasks similar to the ones realised in the WoNoWa dataset (i.e., knowledge-based tasks, or “process” tasks according to the classification given in [52]). TMS dimensions could be better modelled using different nonverbal cues in different tasks, such as decision-making or problem-solving. Additionally, the self-reported scores provided by participants show relatively low variability, which is not desirable when running regression models. Finally, we analysed nonverbal behaviour by averaging feature values over large time windows. As a future perspective, our work could be improved by modelling temporal dynamics, for example by computing histograms of co-occurrences [77, 78]. The previous steps of the Retrieval phase, which were not considered in this work, could also be included in the analyses to investigate the development of TMS over time.