Motion and voice capture setup and calibration
A group of 20 actors was selected and combined into ten pairs: five experienced (at least five years of acting experience) and five non-experienced (no acting experience) pairs. The mean duration of acting experience for experienced actors was 9.68 years, ranging from 5 to 25 years, and they all reported to have practised improvisation as an essential part of their acting training. Ma et al. (2006) and Rose and Clarke (2009) argued that experienced actors tend to systematically exaggerate emotional expressions, a trait which emerges from their theatrical training. Roether et al. (2009) found no differences between experienced and inexperienced actors in terms of acting quality. Still, Ma et al. (2006) highlighted that exaggerated behaviour could be a part of natural expression and it is sometimes difficult to draw the line between genuine expression and exaggeration. However, Busso et al. (2008) argued that experienced actors typically perform better than inexperienced actors during scripted scenarios. We used both experienced and inexperienced actors in order to address some of the ambiguities in the existing studies regarding the actors’ experience.
All the actors were English-speaking, UK-born males, with a mean age of 26.12 years, ranging from 17 to 43 years. Our goal was to capture wider interpersonal variance in emotional actions, rather than to explore inter-gender effects, and so we recorded only male dyadic interactions. Two actors participated in every session, they knew each other moderately well (e.g., they were colleagues but not partners) and they were paid for their time. Before each session the actors were briefed on the purpose of the study and signed a consent form.
Motion capture took place at the University of Glasgow in the School of Psychology, using 12 Vicon MXF40 cameras (Vicon, 2010) which offer online monitoring of 3D motion signals. At all times, the system was recording at a rate of 120 frames per second (fps). The audio capture used a Tescam HD-P2 two-channel digital audio recorder connected to an AKG D7S Supercardioid Dynamic Microphone, and it recorded at 44.1kHz with a 24-bit sampling rate. During the recording, the audio capture was fully synchronised with the motion capture via the Vicon Analogue Card (Vicon, 2010). The entire capture setup, including floor measurements and the location of cameras, microphone and actors, is illustrated in Fig. 1. Vicon Nexus 1.3 (Vicon, 2010) was used for most of the capture operations including the calibration, capturing, storage, and post-processing of raw capture data.
After calibration of the motion capture system, each capture session started with the taking of actors’ measurements and the placement of 39 retroreflective, 14mm, spherical markers on specific anatomic locations on their bodies. These anatomical locations were defined by the Plug-in Gait Model (black dots on Fig. 2a) which is based on the widely accepted Newington-Helen Hayes gait model. It uses a defined marker set and a set of subject measurements to create outputs of the joint kinematics and kinetics for each gait analysis participant (Kadaba et al. 1990; Davis et al. 1991). Supplementary Table 3 describes the exact anatomical locations of the markers.
During the capture session actors were positioned, one facing the other, at a distance specified by a marked position on the floor, approximately 1.3 metres. This interpersonal distance varied between 1 - 1.6 metres (Fig. 1) and it flexibly changed during the capture trials, depending on how much actors moved when interacting. At the beginning of each single capture trial actors were asked to come back to the start position marked on the floor. The overall space of interaction was limited to around 2.5 x 2 metres (Fig. 2b), but since the participants were within the comfortable personal space as defined by Hall (1966), we expected that their natural interaction would not be affected by proxemics.
We captured three types of emotional interaction: angry, happy and neutral. Angry and happy interactions were captured at three different intensity levels: low, medium and high. Actors were given relative freedom in expressing the emotions during interactions (Rose and Clarke 2009). They were encouraged to act naturally, but they were instructed to avoid touching each other and we were careful to give them only verbal instructions rather than performing actions ourselves (Clarke et al. 2005; Ma et al. 2006; Roether et al. 2009). People typically use touch to share their feelings with others, and to enhance the meaning of other forms of verbal and non-verbal communication (Gallace and Spence 2010). Touch also appears very early in human development and naturally becomes on its own a powerful indicator of affect (Harlow 1958).
To help the actors convey angry and happy emotions at different levels of intensity, they were given short and simple emotional scenarios and asked to imagine themselves in those situations. Supplementary Table 1 describes the exact scenarios given to actors. The order of scenarios given and the order of emotions to be conveyed was randomised for each pair. Actors were also instructed to recall any past situations that they might have associated with the relevant emotional scenario to help them induce the emotion. The hypothetical scenarios were based on simple common situations (Scherer and Tannenbaum 1986). The neutral condition served as a control, and here actors were asked to interact in a neutral, emotion-less manner. In all other conditions, the actors received a verbal explanation of what emotion they should play in a specific scenario. We took care to avoid using any symbolic gestures or other non-verbal suggestions. All actors had a short practice time of up to one minute (if required) to refine their actions before each recording.
Actors were always asked to use the same dialogues when interacting in each single capture trial (i.e. happy, angry or neutral in all intensity levels), with the dialogues being either inquiry (question and answer; actor 1: “Where have you been?”; actor 2: “I have just met with John”) or deliberation (two affirmative sentences; actor 1: “I want to meet with John”; actor 2: “I will speak to him tomorrow”). We purposely chose inquiry and deliberation as the two formats of dialogue, as specified in Krabbe and Walton (1995), because we wanted to ascertain whether those different formats of dialogue influenced the identification of emotional interaction between the actors by the observers. We also picked relatively neutral words for the dialogues, so that they were easy to articulate in either a happy or angry emotional manner.
Each single capture trial lasted no longer than 610 seconds. In each trial, the recording started around 1 second before actors were given a signal to begin the interaction. To signal the start of each capture trial, a 1-second long digital square wave sound was played. Recording stopped around 23 seconds after the actors stopped their interaction. For each pair of actors we completed 10 practice trials before the capture trials. Practice trials were included to give actors more time to prepare, to adjust to their roles and for us to check if the motion and voice capture system had been calibrated correctly. Immediately after the practice trials we initiated the capture trials, during which we collected the material used for creating the stimulus set. For each actor pair we obtained 84 capture trials. These comprised 2 emotions (happy, angry), 3 intensities (low, medium, high), 2 dialogue versions (inquiry, deliberation), 2 actors order, 3 repetitions plus 12 neutral conditions (2 dialogue versions * 6 repetitions of each action). This resulted in a total of 756 film clips for all nine couples. There were another 100 data trials from 10 practice captures for each couple, but these practice captures were excluded from further post-processing.
There were five main stages of post-processing: (1) calculating the 3D position data from 2D camera data; (2) automatically labelling the reconstructed markers based on the Plug-in Gait model; (3) automatically interpolating missing data points; (4) exporting raw coordinates and creating point-light displays in MATLAB 2010 (Mathworks, 2010), and (5) exporting raw audio dialogues in order to process them in Adobe Audition. The first three operations were executed automatically in Vicon Nexus 1.3 (Vicon, 2010). Creating final point-light displays required a few additional steps. From the trajectories of the 39 original markers, we computed the location of ‘virtual’ markers positioned at major joints of the body. The 15 virtual markers used for all the subsequent computations were located at the joints of the ankles, the knees, the hips, the wrists, the elbows, the shoulders, at the centre of the pelvis, on the sternum, and in the centre of the head (white dots on Fig. 2a). Commercially available software Vicon BodyBuilder (Vicon, 2010) for biomechanical modelling was used to achieve the respective computations. A similar approach has been used in the past by Dekeyser et al. (2002), Troje (2002) and Ma et al. (2006). The advantage of this procedure was that it was a quick and automated way of creating the virtual joint centres for both actors without the need for manual adjustments (Dekeyser et al. 2002; Ma et al. 2006).
After attaching virtual markers, the 3D (x, y, z) position coordinates for those markers were exported from Vicon Nexus 1.3 (Vicon, 2010) as a tab-delimited text file. Those coordinate files were formatted in such a way that position coordinates were represented in columns, while each frame of data record was represented in rows. Coordinate files were imported into MATLAB 2010 (Mathworks, 2010) and an algorithm was applied to generate the final point-light displays. The algorithm was based on that used by Pollick et al. (2001) which converted 15 virtual markers from each actor into point-light displays, generated as white dots on a black background from the side view, as seen in Fig. 2c. The algorithm exported point-light displays in the Audio Video Interleave (AVI) format, with a frame size of 800 by 600 pixels. The frame rate of exported displays was reduced from the original 120 fps to 60 fps, because MATLAB 2010 (Mathworks, 2010) and Adobe Premiere 1.5 (Adobe Systems, 2004), which were used for creating the final displays, only allowed editing of the movie up to 60 fps.
The audio dialogues recorded with the Vicon Analogue Card were all saved by Vicon Nexus 1.3 (Vicon, 2010) in the Audio Interchange File Format (AIFF), and each audio dialogue was automatically linked with the corresponding capture trial. Adobe Audition 3 (Adobe Systems, 2007) was used to post-process the dialogues. Every audio dialogue was first amplified by 10dB and then a noise reduction was applied. Following this all audio dialogues were normalised to create a consistent level of amplitude, and to obtain the average volume of around 60dB. Finally, each audio dialogue was exported as a Waveform Audio File Format (WAV) file with a resolution of 44.1kHz and 24-bit sampling rate.
Creation of final stimulus set
Adobe Premiere 1.5 (Adobe Systems, 2004) was used to create a final stimulus set. The AVI point-light displays produced by MATLAB 2010 (Mathworks, 2010) were imported into Adobe Premier 1.5 (Adobe Systems, 2004) together with the corresponding WAV dialogues post processed with Adobe Audition 3 (Adobe Systems, 2007). Initially, each point-light display was combined with its corresponding WAV dialogue. The initial recording of the interaction was signalled by a sound (one second long, square-wave buzzer signal). The end of the recording occurred 500 ms after the end of the actors dialogue. The length of the final, truncated display varied between 2.5 and 4.5 seconds. Thus the edited clip represented the original interaction of the actors in its entirety after shortening the initial and final segments where nothing occurred. The eliminated segments before and after the selected clip were initially recorded only to enable us to obtain longer ‘technical margins’ for post-processing the length of final displays.
All displays with truncated start/end points were exported to AVI format in three versions: auditory-only (dialogues), visual-only (point-light displays) and audio-visual (dialogues combined with point-light displays). The final, non-validated stimulus set was composed of 238Footnote 1 unique displays which consisted of: 9 actor couples, 2 emotions (happy and angry), 3 intensities (low, medium, high), 2 dialogue versions (inquiry, deliberation), 2 repetitions plus 26 neutral displays. However, each display was created in three modality formats: visual point-light displays, auditory dialogues and a combination of point-light displays and dialogues. Therefore the final count of all displays in stimulus set with three modality formats was 714.
The stimulus set (Supplementary Material 2) can be downloaded from the following sources: http://motioninsocial.com/stimuli_set/. The stimulus set is organised into nine folders, with each folder being labelled with a letter representing a different actor couple. Within each folder, every single display is represented by five files:
three AVI files in audio-visual, auditory, and visual versions;
one TXT file with unprocessed motion capture coordinates for the corresponding display (Supplementary Material 1 in the Appendix includes R routine with exact description of what each column stands for);
one WAV file with unprocessed dialogue capture for the corresponding display.
Supplementary Table 3 includes detailed characteristics for each display. This table is also available as Microsoft Excel XLS file together with Supplementary Material 2 for easier browsing.
It is worth noting that the reason why we created only 238 unique clips from 756 original captures was due to the technical quality of the motion and voice captures, and the quality of the acting. The most common issues that occurred during motion capture were: marker occlusion during the capture, distortion of audio or visual noise from ambient light in the capture volume, and errors made in dialogue by actors. Those issues lowered the quality of the displays or made their further processing impossible for the final stimuli set. We were aware prior to the experiment that such issues might occur, hence we recorded the same interactions six times to maximise the number of usable high quality displays.