Multisensory integration of musical emotion perception in singing

We investigated how visual and auditory information contributes to emotion communication during singing. Classically trained singers applied two different facial expressions (expressive/suppressed) to pieces from their song and opera repertoire. Recordings of the singers were evaluated by laypersons or experts, presented to them in three different modes: auditory, visual, and audio–visual. A manipulation check confirmed that the singers succeeded in manipulating the face while keeping the sound highly expressive. Analyses focused on whether the visual difference or the auditory concordance between the two versions determined perception of the audio–visual stimuli. When evaluating expressive intensity or emotional content a clear effect of visual dominance showed. Experts made more use of the visual cues than laypersons. Consistency measures between uni-modal and multimodal presentations did not explain the visual dominance. The evaluation of seriousness was applied as a control. The uni-modal stimuli were rated as expected, but multisensory evaluations converged without visual dominance. Our study demonstrates that long-term knowledge and task context affect multisensory integration. Even though singers’ orofacial movements are dominated by sound production, their facial expressions can communicate emotions composed into the music, and observes do not rely on audio information instead. Studies such as ours are important to understand multisensory integration in applied settings. Supplementary Information The online version contains supplementary material available at 10.1007/s00426-021-01637-9.


I)
were averaged to build the composite score of emotion expression. V) Figure S1: Histograms for the ratings of crossmodal stimuli in the expressive face condition for (A) laypersons and (B) experts. Figure S2: Histograms for the composite score for laypersons and experts.

Result Section
VII) Report of the three-way ANOVAs taking all three presentation mode into account at the same time, Tables S4, S5, S6 VIII) More calculations of the reliability of evaluations (ICC, inter-rater agreement, Shrout & Fleiss, 1979), Tables S7, S8, S9

Composer Piece Selection
(Bars)

Opus Singer
No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Sadness X X X X X X X X X X X Note. All cells per column containing an "X" were included in the composite score, whereas empty cells were excluded. See Appendix Table S1 for a list of stimuli.

V) Figure S1
Histograms for the ratings of crossmodal stimuli in the expressive face condition for laypersons and experts. The distribution of expressive intensity was right skewed, but the content-based emotion categories showed high numbers of "not-at-all" ratings. The

Account at the Same Time
As a supplement, we provide here results that take all data depicted in Figure 3, main text, into account, fitted in a three-level ANOVA with the factors presentation mode (A, V, AV), facial expression (expressive, suppressed), and expertise.

VIII) Reliability of Evaluations
We provide here information on the reliability of evaluations (Shrout & Fleiss, 1979; ICC(2,1)). We used different ways to calculate ICCs simply to make our results comparable to other studies. However, we think that the first account is the most appropriately one. For the first account, we calculated ICCs to estimate inter-rater agreement, with k raters and 15 objects (stimuli) for each mode of presentation and each of two interpretations (expressive, suppressed facial expression) and each scale (eleven content scales, one intensity scale) separately. ICCs were based on individualized z-scores of the raw ratings. We decided on separating ratings due to the nested structure of the data (full repeated measures design).
This account results in separate ICCs for each item of the scale for different conditions (presentation mode, facial expression). We also report the mean for the specific conditions across the eleven content-based items and the means for specific ratings across the different conditions (presentation mode, facial expression). Second, we calculated ICCs but did not take the nested structure into account. ICCs were calculated across all scales and stimuli, but separately for each condition of the full 3-by-2 (presentation mode; facial expression) design. All calculations of the ICCs were done in R (R Core Team, 2019) with the irr package (Gamer, Lemon, Fellows, & Singh, 2019) as two-way random effects models, and reliability was defined as inter-rater agreement.   Table S5 is analogous to   Table S4. Even not reported here, confidence intervals were rather large for the data of both groups, numerical difference between laypersons and experts are mostly within confidence ranges of the estimate. Some commonalities seem to show in both data set: Some variables seem to result in higher agreement (1−anger, 5−pain, 9−contempt, 10−desperation, 11−sadness) and other lower (2−cheekiness, 8−joy), in this respect, negative emotions seem to be easier to decode than positive emotions; reliability seems to be higher when expressive faces are presented (V1, A1V1) in comparison to when expressions are suppressed (V0, A0V0), but are about the same for visible expressive faces (V1, A1V1) and the auditory stimuli (A0, A1); content-based items (R1 to R11) seems to have higher overall reliability than the intensity rating (R12).  Table S4 and S5, the reliability measures in Table S6 are slightly higher than the mean (R1−R11) and more similar between conditions and groups.