Interaction analysis has become central to psychological research and practice. Every social interaction is characterized by basic rules and principles of interpersonal communication, a dynamic process in which verbal and non-verbal messages overlap and are characterized by principles of reciprocity and interdependence (Burgoon et al. 2016; Gouldner 1960). Regarding groups, for example, interaction describes the simultaneous and sequential verbal and non-verbal behavior of the group members acting in relation to each other and to the respective task and its goal (McGrath and Altermatt 2001). Following psychological theories such as the Luhmannian approach (e.g., 2018), interpersonal communication is the key to understand how decisions evolve above and beyond individual perceptions and therefore needs to be put into the focus of research (cf. Coreen and Seidl 2019). Interaction analysis is a systematic observation of such interactions between individuals and makes it possible to gain access to mediating interaction processes by breaking down the interaction into its smallest units (e.g. the statement of a person). Usually a predefined coding system is used as a basis for this, with which the observable behavioral units are assigned to a category (e.g. the category “solution naming”) of the coding system (McGrath and Altermatt 2001; Meinecke and Kauffeld 2016).

In both group and dyadic interactions (e.g., team meeting, appraisal interviews, or therapeutic sessions), the principle of the interaction analysis can be used to analyze actual behavior recorded on video or audio in a very detailed and systematic way which can neither be achieved via self-reports such as questionnaires nor via unsystematic observations not even by experts due to observer errors (Meinecke and Kauffeld 2016; see also Eck et al. 2008). However, despite the immense added value of interaction-analytical observation of behavior patterns in the various contexts, the analysis effort with the current methods is very laborious and time-consuming compared to subjective methods. As a result, only a few researchers actually use the methodology of interaction analysis and even fewer practitioners apply it in their organizational practice (Meinecke and Kauffeld 2016; Kauffeld and Meinecke 2018). Considering this circumstance, it is desirable to keep up with the technical advancements in digitalization and automation to automate these processes more and more in order to provide researchers and practitioners with the best possible support via low-threshold tools that are easy and quick to use and thus encourage to practice (more) interaction analysis.

Although there are great advances in signal processing and machine learning, the popular annotation programs do not offer automatized processes and analyses (e.g., speech segmentation, transcriptions, emotion recognition or gesture detections) to support researchers and practitioners with the annotation process, which would result in enormous time savings. Among the long list of available annotation software, to the best of our knowledge, only the ELAN tool (Wittenburg et al. 2006) contains some automatized processes for audio and video signals such as speech segmentation (Auer et al. 2010; Matheja et al. 2013). However, the algorithms of the ELAN tool are developed for single-channel audio recordings, which is with respect to a group interaction still a challenging task for audio processing and leads to limitations of quality and more complex applicable analyzing methods.

Therefore, in this work we present a group interaction annotation tool (GIAnT), which we developed explicitly for the analysis of face-to-face group interactions based on multichannel high-quality audio recordings, but is indeed not limited to those. GIAnT offers the possibility to integrate arbitrary coding schemes and includes an algorithm for speaker activity detection that automatically segments the individual speech utterances of all persons that are involved in the interaction. This can lead to significant time savings and improves the accuracy of the marked speech segments compared to manual segmentation. Moreover, with the aid of high-quality audio recordings, more powerful analysis tools can be applied to the data to improve the annotation and analysis process with respect to both quality and time investment.

In the following, we briefly introduce the idea and benefits of an interaction analysis and point out the various challenges of a typical interaction analysis process. Afterwards, the concept of GIAnT, including possible solutions, is showcased, followed by some conclusions containing further necessary steps in this domain.

1 Interaction analysis in psychological research and practice

An interaction analysis is a powerful and systematic method to shed light on communication processes. For both research and practice, interaction analysis results can reveal important implications. Regarding groups, for example, interaction analysis helps researchers to investigate overall underlying processes in team meetings, which might mediate the relationships between input (e.g., the persons’ attributes) and outcome (e.g., performance) variables. For practitioners involved in team development, identifying patterns in the individual team meeting processes such as circles of dysfunctional statements that hinder solution-focused processes in the respective team can help to elaborate specific actions for improvement together with the team (cf. Meinecke and Kauffeld 2016). Beyond group interactions, by means of an interaction-analytical approach, it can be analyzed what happens in interpersonal interactions, how people talk to each other, how a certain constructive or destructive conversational behavior is initiated, how information is exchanged or how decisions are being made (cf. Kauffeld and Meinecke 2018).

1.1 General procedure for interaction analysis

The basis for the interaction analysis is usually a coding system that assigns a category to each observation unit, e.g., act4teams (Kauffeld and Lehmann-Willenbrock 2012; Kauffeld et al. 2018) for the analysis of team meetings. After defining the behaviors that are of interest in the respective research or practice field and selecting useful sense units to which a unique function can be assigned (from speaker turns to single statements) on the basis of an existing or newly created coding scheme, interaction data are gathered usually via video- or audio recordings. Subsequently, trained and sufficiently consistent coders have to annotate the entire video and speech material, which is typically done manually and which is very time-consuming by means of (commercial) annotation software (e.g., Interact, Mangold 2017; Observer XT, Zimmerman et al. 2009; for an extensive review of interaction coding/analyzing software applied to video, audio, and/or text interaction data see Glüer 2018). After the annotation process, the single behaviors can be analyzed, e.g., by means of sequence-, pattern- or cluster analyses to identify recurring interaction patterns, or they can be summarized (e.g., frequencies) and associated with other variables such as team success (Glüer 2018; Meinecke and Kauffeld 2016).

1.2 Typical speaker detection challenges (… which could be automated)

The general procedure for (preparing) the analysis, whether in research or practice, involves that all audio/video material must be viewed and annotated (coded). In this process, the first step at the highest level represents the segmentation and allocation of the individual speech components to the corresponding persons (i.e., speaker turns). Particularly in group interactions, however, it can be difficult to distinguish the voices of many people.

For example, in Bavendiek et al. (2015), we observed a team consisting of five students over a period of four months during a product development project in order to examine the interaction of these students. The first step for the interaction analysis was to cut the material into speaker units (before cutting it into sense units for each speaker). The main challenge relying just on the video signals here was to detect who is actually speaking. The similar voices of all same-sex male team members were often hard to distinguish, and moreover, the team members occupied the entire room in order to fulfill their task. This leads to limitations of the recorded speech quality and even worse, not all faces were always visible on the videotape, so that the actual speaker was not unambiguously detectable (cf. Fig. 1).

Fig. 1
figure 1

Overview of GIAnT including example data of five persons (5 microphone channels) and a visualization of the current scene. All speech portions of each speaker are segmented. Thereby, the colors yellow, green and red denote non-annotated, annotated, and the selected segment, respectively. GIAnT offers also a mute- and solo-function for each channel, which is shown by the muted channel 5

In addition, exact cutting, which would allow simultaneous speaking or silence to be analyzed such as in Meinecke et al. (2016), is often omitted for reasons of time. Moreover, while the content interpretation and categorization of individual statements (sense units) must be carried out by trained experts, the preceding segmentation and allocation of statements to persons is a step that is necessary but does not require any content expertise. As such, this segmentation of speaker turns is a step that can be automated.

2 Group Interaction Annotation Tool (GIAnT)

Motivated by the previously described benefits of a detailed interaction analysis and the associated disproportionate time investment, GIAnT was developed, which we introduce in the following.

2.1 Goals and development

GIAnT not only depicts a software for the transcription and coding of a meeting, it also includes a recording strategy with the target to improve both the quality and the speed of the evaluation of a meeting as well as the quality of the audio signals for further processing. In the following, we describe the three main contributions of GIAnT, which are also summarized in Table 1.

Table 1 Overview of the main functionalities of GIAnT and the corresponding problem descriptions

2.1.1 Recording setting

In order to improve the audio quality and simplifying the coding of a meeting, GIAnT is developed for multichannel close-talk audio recordings. In practice it is quite common to use only one or two video cameras and maybe an additional table-top microphone for data acquisitions of meetings, but this can lead to several problems. Persons can be covered by a suboptimal positioning of the camera, whereby the dedication of spoken utterances to the associated speaker gets more complicated. Furthermore, additional information (e.g., the content of a “side-talk”) gets lost due to overlapping speech or low audio quality, which further makes a transcription, especially during double-talk situations, difficult for humans and almost impossible for machines.

Therefore, GIAnT is developed for multichannel audio recordings with headsets or lapel microphones, which are assigned to each participant and recorded in a separate channel. Thus, we obtain speech signals from each person with high audio quality, which facilitates the transcription and the understanding for humans, even during double-talk or “side-talks” and allows a suitable use of further analyzing tools (e.g., for automatic transcriptions or an automatic emotion recognition). The prerequisites for using GIAnT are that the microphones are closest to the assigned participant and the recorded signals are synchronized in time. The latter can be achieved with a multitrack recorder (e.g., Zoom H6). We further recommend to use a sampling rate of at least 16 kHz and omnidirectional headsets (e.g., Shure MX53), since they ensure a robust recording of the speech signals and prevent user errors. However, any other kind of headset or lapel microphone can be used as well, as long as the two prerequisites are met.

2.1.2 Interaction overview

Most software for coding is based on video signals, a few use a single audio channel in addition. A video recording is very helpful in order to see what is (visually) going on in a meeting and of course improves the quality of coding. But coding, which is based only on a video signal withholds the important dimension of time. Fig. 1 depicts an example of a segmentation and annotation of a five-person conversation in GIAnT. With the aid of close-talk recordings and depicting each person in a separate audio channel, GIAnT allows a screen view of around 30 s of the dialog so that in combination with a video, it is easy to capture the course of the meeting, providing a comprehensive overview of the current scene (e.g., “who is talking?”, “who is interrupting?”) to the coder. This, in combination with the segmented speech components of each speaker makes it much easier to capture the current state of a considered meeting (cf. Fig. 1).

2.1.3 Automatic speaker activity detection

The most time-consuming process where no knowledge about psychology or coding is needed, is the segmentation process. In fact: “Who is talking at which time?”. For this purpose, GIAnT includes a multichannel automatic speaker activity detector (Meyer et al. 2018), which marks coherent spoken words of each speaker as a segment, as illustrated in the conversation example in Fig. 1.

In this context, the multichannel recordings are a special issue. Even if the use of (wireless) headsets or lapel microphones delivers high quality speech recordings, it is well known that the speech of a considered speaker does not only couple into his assigned microphone, but also into all other microphone channels. This effect is known as crosstalk, whereby its energy level depends on the position of the speakers, the loudness of the spoken speech, as well as the microphone level settings. Microphone channel number 5 in Fig. 1, which contains no speech portions of speaker 5, demonstrates the effect of crosstalk in a real recording of a group conversation. As a consequence, crosstalk complicates the segmentation process, since a common single-channel based voice activity detection reaches no suitable results. For that reason, we developed a multichannel speaker activity detection (MSAD), which considers not only one specific single microphone channel but all recorded microphone channels together. With the aid of the MSAD, we can obtain a suitable pre-segmentation of the speech turns of each speaker. Fig. 2 shows an example of the performance of the developed MSAD in a challenging three-person crosstalk scenario. For a detailed description and evaluation of the MSAD, please refer to Meyer et al. (2018).

Fig. 2
figure 2

Example of the multichannel speaker activitiy detection (MSAD) for a three-person conversation including single-, double- and triple-talk. Thereby, blue denotes the target speech signal \(s_{m}(n)\)of each channel \(m=\{1,2,3\}\) and time index \(n\), while black denotes the real microphone signals \(y_{m}(n)\) containing crosstalk from the other active speakers as well. The results are depicted by means of the colors of the background areas: True speech activity (green), true speech pause (white), wrong speech activity (gray), and wrong speech pause (red)

By means of the MSAD, GIAnT offers a precise pre-segmentation of the all recordings, improving both quality and speed of the speaker segmentation process. Since no automatic method is perfect, the user can verify and edit the pre-segmentation easily per drag and drop or short cuts. At this point, the overview of GIAnT, based on the multichannel recordings, again simplifies the correction of the pre-segmentation.

2.2 Application and access

GIAnT is developed for a fast segmentation and annotation of multichannel audio recordings with a focus on psychological analysis of face-to-face group interactions (cf. Fig. 2). A single video signal, which is automatically time-aligned with the audio signals, can further be integrated into the tool for a better annotation. Inspired by Roy and Roy (2009), GIAnT includes an automatic multichannel speech segmentation to offer a pre-segmentation for a sequence sampling, which can easily be verified and corrected by the user. Therefore, it is possible to select, edit and play back a segment via both short-cuts and by drag and drop, which saves time. Furthermore, the annotation of a segment can be done again with the aid of short cuts or by direct input into the annotation field. For this purpose, GIAnT offers the opportunity to define and use customized coding schemes, which can be further grouped in classes, and moreover, arbitrarily assigned to a key on the keyboard. In addition, GIAnT verifies if an annotation is valid by comparing it with the defined coding scheme and also allows to write a comment about each segment. Thereby, annotated segments are colored green, while non annotated segments are colored yellow in order to obtain a better overview. Apart from that it is possible to select the closest segment, which is not annotated yet, via short cut. The time code is depicted in hours, minutes, seconds, as well as in milliseconds. The data can be exported as an XLSX file to obtain a tabular overview in Microsoft Excel or can import the results in an analyzing software (e.g., Interact) for subsequent interaction analysis such as cluster-, pattern, or sequence analyses. Since there is lots of professional software for statistical analyses, GIAnT does not offer any methods for analysis.

GIAnT is developed and implemented for noncommercial use. The software is free of charge, source code and a pre-compiled Windows version are available. For access, please refer to Github (https://github.com/ifnspaml/GIAnT).

3 Conclusions and future work

In this paper, we presented a group interaction annotation tool (GIAnT), which contains an automatized pre-segmentation of speech portions of all participants and allows to annotate single speech components with existing or customized coding schemes. The main focus of the GIAnT concept relates to the use and recording of multichannel high-quality audio signals of interpersonal interactions, since it facilitates the annotation and analyzing process for both humans and machines. Moreover, good audio and video recordings form a promising basis for further automatized analyzing methods in order to improve the quality of annotation and evaluation, while the time investment of researchers and practitioners are decreasing. With today’s technology, e. g., with wireless, attachable microphones, it is possible to produce good recordings without disturbing the focal subjects and is therefore also practicable in the field. In future work we will additionally integrate methods for multichannel speaker interference reduction in GIAnT to further enhance the quality of the recorded speech signals, especially during double-talk situations, by eliminating the crosstalk in each microphone channel. Thus, further analyses of the speech signals (e.g., automatized emotion recognition or general voice analysis) can be significantly improved. Due to the provided audio signals with high speech quality, analyzing tools such as PRAAT can be used in combination with GIAnT as well, whereby GIAnT can also support the detection of relevant speech turns for analyses by means of the MSAD. However, GIAnT as well as ELAN represent only a first step in order to allow practitioners and researchers deeper insights of a group interaction with a justifiable expenditure of time. The good news is that automatic analysis of audio-visual meetings has meanwhile become a vital research field in machine learning, whereby algorithms are developed for aspects such as automatic meeting transcriptions, audio-visual emotion recognition, relationships and attention of the participants, gestures, eye contact, and many more. Nevertheless, even though significant progress has been made in this area in recent years, analyses of meetings are still one of the most challenging tasks for audio-visual signal processing. Therefore, it is very important for users to improve their data acquisition with the aim of providing high-quality audio and video signals, since these form the basis of all further processing.