1 Introduction

Multisensory experiences have mostly been studied in the psychology field where interactions between smell and taste have been explored. Multisensory integration occurs between two or more sensory modalities including touch, sound, vision, smell and taste. Mulsemedia (Multiple Sensorial Media) incorporates more than visual and audio information, it includes new media types such as haptics, olfaction and gustatory. It has led to new opportunities as well as challenges in research, academia, industry, and for immersive technologies [18, 68].

In this emerging field, there have been several explorations on the practicality and possibility of integrating different media types into applications. Thanks to the advent of novel technologies and innovative devices to artificially produce sensory effects along with systems able to deliver this kind of experience to the users [57] the addition of multiple sensory effects has been essential to improving immersion and presence in the user’s environment.

Ghinea et al. [17] believe it can be achieved by the user’s sensation perception, classifying it as a result of a complex set of processes in which biological sensors send structured electrical signals to the brain (except for specific chemoreceptors), which in turn, frame unconscious sensations patterns. Thereby, they help to determinate whether an upcoming sensory input is authentic. Additionally, Möller and Raake [41] consider that perception goes through two stages before being completely realized: (i) conversion of stimuli through the respective sensory organ into neural signals, and (ii) processing and transmission of these neural signals from the central nervous system to the cerebral cortex, resulting in specific perceptions in the person’s perceptual world. It all will permeate what is called Quality of Experience (QoE) of users.

QoE stems from the combination of the achievement of users’ expectations regarding the utility, the level of enjoyment considering their personalities, and their current state [4]. Users exposed to multisensory experiences have reported a noticeable increase in QoE [27, 42, 46, 47, 49, 71, 76, 79, 80]. Although there have been studies in the cognitive and digital world with regard to the perception of individual senses, there are hitherto unsettled questions when it comes to crossmodal correspondences. In crossmodal correspondence, a stimulus in one modality can be associated with another. For example, in the non-digital world, the smell of lemon and high pitch audio can be associated to sharp objects [21, 64]. However, it is not yet clear whether or not the multisensorial effect of the component modalities generated out of such crossmodal associations would enhance the users’ QoE in the digital world.

In this article, we report on an experiment designed to explore whether cross-modally mapped multisensorial effects (olfaction, sound, and auto-generated haptic) from visual features of videos enhance the users’ QoE. We hypothesize that taking into account crossmodal mappings whilst creating mulsemedia systems could lead to more immersive and effective experiences for the users.

This article is organized as follows. Section 2 brings related work, focusing on auditory-visual crossmodal correspondences research in psychology, computer graphics and human computer interaction, mulsemedia and QoE. Section 3 presents the user study on QoE in crossmodal mulsemedia. Section 4 depicts the results and discusses the work. Finally, Section 5 provides a concluding summary and underlines topics for future investigation.

2 Related work

2.1 Auditory-visual and olfactory-visual crossmodal correspondences research in psychology

Past experiences shape unconscious sensations patterns, which in turn, will influence the way humans feel the upcoming experiences. Thus, a new stimulus in one modality might be associated with another one; for instance, the pitch in audition can be associated to visual features like brightness. Outside the digital world, crossmodal correspondences have been observed between different sensory modalities such as visual, sound, touch, smell, and taste [7, 15, 58, 60, 63, 77].

Non-arbitrary crossmodal correspondences mappings between auditory and visual stimuli have been found through experimental approaches in simple stimulus dimensions such as loudness and brightness, as well as in more complex stimuli such as shapes/images and words. Marks [37] detected an association between lighter colors with higher pitches and louder sounds. Besides, sound has also been linked to other compound characteristics such as shapes. In the same study, Marks [37] gathers an evidence that high pitched tones are related to angular shapes and low-pitch sounds are connected to rounder shapes. Hagtvedt and Brasel [20] found an association between the frequency of music and the lightness of a colored object. With the help of an eye-tracker, they concluded that visual attention was steered in the direction of light-colored objects under the influence of high-frequency sounds. It makes evident that sound can be employed to exploit users’ attention.

Nonetheless, over the last decades, researchers have started to document the existence of crossmodal corre- spondences also between olfactory and visual stimuli. For instance, in [19], the authors provided one of the first examples of olfactory-visual correspondences, showing that there are strong correlations between odors and colors. Bergamot smell was associated with yellow, cinnamon with red, pine with green, etc. In [31], the authors investigated how color lightness varies with perceived odor intensity and found an inverse correlation. Pleasantness and quality of odors were also analyzed in studies, such as [54, 66]. In [11], the authors investigated the robustness of these crossmodal associations for a random sequence of odors (strawberry vs. spearmint) and color patches (pink vs. turquoise) and found these correspondences both systematic and robust. In [58] authors took a different approach and investigated the crossmodal associations between the abstract symbols designed for the representation of an odor and the correspondent odor. They showed that the matching exists and is mediated by hedonic valence of cues. In [10], participants were asked to select a color they were associating with an odor. They observed that when odors were described in abstract terms it was less likely to find a color match, while when the participants were describing the odor with a source-based term (“smells like banana”) their color choices reflected more accurately the odor source. This and other studies like [26, 62] show that the mechanisms underlying these associations could be related to semantics, emotions or natural co-occurrence.

If synaesthesia is unidirectional, crossmodal correspondences are bidirectional: e.g., hearing high-pitched sound is matched with small objects and seeing small objects is paired with high-pitched sounds. The fact that crossmodal correspondences are bidirectional might mean that at least some of the crossmodal correspondences are also transitive, which is again different from synaesthesia [12]. Though the multidimensionality of the precepts at stake seems to indicate the possibility to predict the relationship between different attributes, transitivity should not be expected in every case. For instance, we know that louder sounds correspond to bigger objects and that lower pitch corresponds to larger size, thus louder sounds should correspond to higher pitch. However, this was not observed in related studies [12].

2.2 Auditory-visual and olfactory-visual crossmodal correspondences research in computer graphics and human computer interaction

There has been little work related to crossmodal correspondences between visual and auditory media beyond the area of cognitive sciences. The studies of Mastoropoulou et al. [39] and Mastoropoulou [38] on the effect of auditory stimuli over visual perception pointed out that when only sound emitting objects are delivered in high quality and the rest of the scene in lower quality, the visual quality is not impacted.

In [3], the authors focused on different senses for investigating crossmodal correspondences: sight and olfaction. They found out that the scent of fresh cut grass can distract viewers from the task of identifying the animation quality (flyover of a grass terrain). Hulusić et al. [24] aimed at discovering the influence of beat rates in static scenes. They found out that lower beat rates impact the perception of low frame rates. Thereafter, they investigated how camera movement speed and the sounds influence the smoothness of the animation [25]. Ramic-Brkic et al. [50] were concerned about how viewers perceive the graphics’ quality in the presence of distinct modalities such as auditory, olfactory, and ambient temperature. What they realized was that strong perfume, high temperature, and audio noise have an influence on the users’ perceived rendering quality. Apart from selective rendering, Tanaka and Parkinson [35] studied the crossmodal mapping between digital audio and the haptic domain dedicated to audio producers with visual impairments. To do so, they created a device called Haptic Wave, an input/output interface that renders audio data as kinesthetic information. In [35], the authors explored the impact of audio on haptic to improve the quality of eating for denture users. They built a device to increase the food texture using sound. Ranasinghe et al. [52] applied crossmodal perception to create Vocktail, a system to introduce flavor as a digitally controllable media involving color, smell, and taste modalities. In [23], the authors found associations between sweetness and red rounded shapes, and sourness and green angular shapes with a fast animation speed in the literature. Then, they also found out that specific combinations of visualizations and animation types have an influence on yogurt’s taste perception. Tag et al. [70] explored cross-modal correspondence between haptic and audio output for meditation support. The goal of the haptic/audio design was to guide the user into a particular rhythm of breathing. In [28], the authors discuss the effect of scented material on physical creations showing that odor-shape correspondence exists in an active, free association creation session. Moreover, it also indicates the potential of using crossmodal correspondences for HCI in the design of future interactive experiences.

The multisensory user experience is also a semiotic process [29] and designing for it can take different stances depending on the experimental goals. Positive emotional outcome is dependent on the context of the design and its appraisal is strongly connected to multisensory integration. Expectations have an important role in HCI, thus crossmodal correspondences could be one of the underlying dynamics of a positive experience [53]. As can be seen, studies on crossmodal correspondences research in computer graphics and human-computer interaction provide insights about sensory replacement/combination under different circumstances. These mappings have a promising potential in designing interfaces and displays that tap into a user’s mental model [72]. Thus, we believe that crossmodal mappings could reveal insightful information in other contexts to help to understand the users’ perception and therefore improve human-computer interaction.

2.3 Mulsemedia and QoE

There has been an increasing interest in creating multimedia applications augmented with media on top of the traditional audio-video (AV) content [18]. They aim at stimulating other senses beyond sight and hearing such as touch [14], smell [16] or taste [51, 52] with the aim to increase the user’s QoE and to explore novel methods for interaction [44]. Therefore, the term mulsemedia refers to the use of at least three different media types, that is, multimedia and at least one non-traditional media [18].

Mulsemedia systems generally undergo a workflow for (i) production, (ii) distribution, and (i) rendering [6]. First, different sensory effects metadata are produced or automatically generated in synchronization with an AV content. This process can be performed by a human or acquired through various sensors (e.g. camera, microphone, motion capture) that capture real-world information, or synthesized using computers (e.g. a virtual 3D space in a game) [56]. Many tools have been developed to aid this process, such as SEVino [75], SMURF [32], RoSE Studio [5], and Real 4D studio [59]. The works of Kim et al. [33] and Oh and Huh [48] are endeavors to automatically produce mulsemedia metadata. Although haptic effects can be captured [9], making a reliable and lasting record of taste and smell from the real world is still a challenge.

Following that, the mulsemedia effects can be encoded for transport, processed and emitted for distribution to providers, distributed to the end-users and then decoded by systems, and finally, rendered by different devices, which in turn, will deliver them to the end users. Mulsemedia players and renderers to be used with other multimedia applications have also been created to reproduce and deliver mulsemedia experiences, notably SEMP [75] and PlaySEM [55], which are open-source. A mulsemedia system entails weaving multiple technologies to connect different entities, distribute the sensory signals, and render sensory effects appropriately Saleme et al. [56]. Whilst developing mulsemedia systems, it is crucial to have ways to deliver different sensory content consistently as well as of paramount importance to be aware of the challenges that might arise when delivering mulsemedia [57]. The main motivation behind adding mulsemedia components is to augment the level of immersion and QoE of users [44].

QoE is defined as the level of delight or displeasure a user feels whilst experiencing an application or a service in computers taking into account mainly subjective measures such as their personalities and current state. It can be assessed either by conducting subjective surveys [2, 76, 78, 79] or objective evaluation [13, 30]. In addition, technical recommendations have been used together such as ITU-R-BT.500–13, ITU-TBT.500, and ITU-T-P.910. Therefore, mulsemedia systems’ evaluations can lead to a high degree of qualitative differentiation in terms of QoE. Although objective evaluations are low-cost and carried out faster than subjective ones, they might put researchers on the wrong track if they consider just a few parameters. For instance, researchers should know if the user has some heart-related problems before they measure the user’s heart rate because it can lead to misleading conclusions. Thus, taking current emotional states into consideration from different perspectives could reveal useful insights. The work of Egan et al. [13] is a sample of the combination of objective and subjective QoE evaluations. They correlated the results of both and found out that high values for heart rate and electrodermal activity had to do with physiological arousal- one of the factors associated with user QoE. Another work [30], showed the potential and benefits of using these objective metrics as indicators of user QoE for immersive experiences in augmented reality applications. Indeed, if used appropriately, physiological measures can be useful in affective state monitoring, chiefly in a multimodal setup [34].

By satisfying users’ expectations and incrementing the levels of utility/enjoyment of applications or services, mulsemedia has not only contributed directly to QoE, but also indirectly such as presented in the studies of Yuan et al. [79], Yuan et al. [80], and Ademoye et al. [2]. They have pointed out that mulsemedia can partially mask an AV sequence’s decreased quality as well as synchronization skews, thus enhancing the user’s perceived QoE. Furthermore, mulsemedia has the capacity to aid memory [1], to improve virtual realism, to more easily convey information between physical and digital environments [81], and to contribute to pattern recognition [67].

The question of how to improve the user experience in immersive systems is still an open one. Adding sensory modalities seems to be a reasonable way according to the literature. However, it is also relevant to pay attention to crossmodal correspondences, which have seldom been considered when designing mulsemedia systems although our perceptual experiences are affected by them. Very little is known about the combination of senses in the digital world and what occurs as soon as one stimulus is stronger than the others. Indeed, crossmodal interactions could be handy when it comes to getting over a specific sensory deprivation or situational impairment such as to see or feel something in darkness [22]. Given this, mulsemedia appears as a prospective scenario to develop the knowledge on crossmodal correspondences hitherto limited to setups on traditional multimedia. By understanding crossmodality applied to mulsemedia systems, this comprehension could be also beneficial to prepare effective mulsemedia experiences.

3 User study: Quality of experience in crossmodal mulsemedia

The experiments we designed are aimed to investigate the potential influence of using crossmodal correspondences concepts in designing mulsemedia on the QoE experienced by the users. More specifically, we used six videos characterized by dominant visual features: color (blue, yellow), brightness (low, high), shape (round, angular). Participants viewed these videos enhanced with crossmodally matching sound while wearing a haptic vest with vibration motors. We chose to use the vibrotactile display because literature has shown that participants exhibit an increased emotional response to media with haptic enhancement [73].

3.1 Participants

Twelve participants (7 males, 5 females) took part in the experiment and were randomly assigned to either one of an equal-sized Experimental (EG) or Control Group (CG), respectively. Users were aged between 18-41 years old and hailed from diverse nationalities and educational backgrounds (undergraduate and postgraduate students as well as academic staff). All participants spoke English and self-reported as being computer literate.

3.2 Experimental apparatus

\The videos were displayed on a computer monitor with a resolution of 1366x768 pixels, and a viewing area of 1000x700 pixels in the center of the screen. An EyeTribe eye tracker controlled by a custom written Java code was employed to record eye-gaze patterns on a Windows 10 Laptop with 8GB RAM powered by an IntelCore i5 processor. The viewing screen was placed between 45-75 cm from the eyes of the participants, as this was the recommended distance for Eye Tribe calibration.Footnote 1 We chose to use the EyeTribe eye tracker because this was demonstrated to be accurate enough in studies on gaze points and fixations [8].Participants sat in a chair without armrests facing the screen. All participants wore i-shineFootnote 2 headphones, a vibrotactile KOR-FXFootnote 3 gaming vest, and a Mio Link heart rate wristband.Footnote 4 To facilitate the vibrotactile experience we chose the KOR-FX gaming vest that utilizes 4DFX based acousto-haptic signals to enable haptic feedback to the upper chest and shoulder regions. The vest is wirelessly connected to a control box meant to accept the standard sound output of the sound card of a computer.

The olfactory emitting device was provided by the Exhalia SBi4,Footnote 5 which was considered by previous research more reliable and more robust than existing devices [45]. This was placed at 0.5 m from the assessor, allowing her/him to detect the smell in 2.7–3.2 s, as shown in [44]. The SBi4 can store up to four interchangeable scent cartridges at a time, but we used a single slot in our experiments to prevent the mixing of scents. These cartridges contain scented polymer through which air is blown (through four built-in-fans). The synchronized presentation of the olfactory data was controlled through a program built using Exhalia’s Java-based SDK. Users of this type of devices obtain additional information about environmental factors while becoming more immersed/involved in their experience [43]. A snapshot of the experimental setup is shown in Fig. 1.

Fig. 1
figure 1

Experimental setup. The users were wearing: (1) i-shine headphone, (2) the KOR-FX haptic vest, their eye gaze was captured with (3) the EyeTribe eye tracker, while their heart rate was measured with (4) Mio Link; olfactory effects were diffused using Exhalia (5)

3.3 Audio visual olfactory content

As illustrated in Table 1 there were six videos selected based on their dominant visual features such as color, brightness and angularity of objects. The olfactory content consisted of six scents: bergamot, lilial, clear lavender (low intensity), lavender (high intensity), lemon and raspberry. All videos in our experiment were 120 s long. For the EG, the audio was adjusted to a frequency of 328 Hz (high pitch condition) and 41 Hz (low pitch condition).

Table 1 Snapshots from the six videos used during the experiment with their themes, dominant visual cues and the conditions for the EG in each case. The CG experienced only visual content, without any type of crossmodally generated content (olfactory, auditory or vibrotactile)

The accompanying auditory and olfactory content was modified in line with principles of auditory-visual and olfactory-visual crossmodal correspondences that were previously shown in the literature. The video with dominant yellow images (V1) was watched accompanied by high pitch sounds and bergamot odor, while the one dominantly blue (V2) by low pitch sounds and lilial odor [19, 61, 69].

In V3, where brightness was considered the dominant visual cue, low pitch sounds and low intensity lavender odor were delivered concurrently to the users, while in V4, where the brightness was high, the auditory content consisted of high pitch sounds and the olfactory content of high intensity lavender odor, based on [19, 36]. Finally, V5, the video displaying angular shapes, was matched with high pitch sounds and lemon odor, whilst V6, where the dominant shape was round, was delivered with low pitch sounds and raspberry odor [21, 64].

3.4 Procedure

Pre-experiment study

Before the experiments, we carried out a small pilot study with two participants to get feedback on their thoughts and experience while trying our system. This was aimed to give us feedback on the experimental process and research instruments employed. Since participants reported that the high pitch audio volume was loud, we lowered its intensity to enhance user comfort during the experiment.

Conditions

There were two conditions that differed in the provided content:

  1. (1)

    In the experimental condition (associated with the EG) users were exposed to altered audio (modified pitch) which it matched the corresponding dominant visual features. The dominant visual cue was also accompanied by crossmodally correspondent olfactory cues.

  2. (2)

    In the placebo condition (carried out by the CG), the users were only exposed to the visual content. Thus, although they wore headphones and a haptic vest and the fan of the olfactory device was running, no type of content (auditory, vibrotactile nor olfactory) was distributed to users.

Eye-tracking calibration

At the beginning of the experiment, participants underwent an eye-tracking calibration exercise in which they were asked to focus on 9 equally spaced points situated on a 3 × 3 grid. Participants were randomly divided in two groups of 6 each and watched the six videos in a random order for both EG and CG. All participants used the devices identified in Fig. 1. The experimental sessions were conducted individually and lasted between 24 to 37 min.

Collected data

For each participant we collected two objective measures:

  • Gaze points - as a measure of visual attention and interest. These were collected as a set of (x,y) pixel co-ordinates, with a sampling frequency of 30 Hz, matching the frame rate of the videos.

  • Heart rate - as a measure of user emotional arousal whilst experiencing the system. The Mio Link wristband consists of an optical heart rate module (OHRM) that utilizes photoplethysmography (PPG) to measure continuous heart rate alongside an accelerometer unit to measure and correct for movement artifacts [74]. Accelerometer data assessing a user’s movement is entered into an algorithm that compensates for movement artifacts in the optical signal. The raw data provided comprised heart rate readings sampled once every second.

Participants also completed a subjective questionnaire (Table 2) at the end of the experiment. Each question was answered on a 5-item Likert scale, anchored at one end with “Strongly Disagree” and with “Strongly Agree” at the other.

Table 2 Self-reported QoE questions

4 Results and discussion

In this section, we present analysis and discussion of results of the data obtained from eye-tracker, heart-rate monitor, and on-screen QoE questionnaire (Table 2). Data were analyzed with the IBM Statistical Package for the Social Sciences (SPSS) for Windows version (release 23.0). An ANalysis Of VAriance (ANOVA), suitable to test the significant differences of three or more categories, as well as one sample t-test and independent sample t-test, suitable to check whether a sample mean is statistically different from a hypothesized population mean, and, respectively, to identify significant differences between two categories [65], were applied to analyze the participants’ responses. A significance level of p < 0.05 was adopted for the study.

4.1 Analysis of eye-gaze data

The eye gaze data was collected at a sampling rate of about the same as the frame rate and hence we obtained a total of 3600 eye gaze (30 eye gaze/s × 120 s) locations per each video clip. As mentioned in Section 3.2, the viewing area for the videos measures 1000 × 700 pixels and it is centered on a 1366 × 768 pixels screen.

$$ \sum \limits_{i=1}^N\mid \varDelta Gazei\mid, where\;1<i<N,N=400\; viewing\kern0.17em cells/ frame $$
(1)

For analysis purposes, this viewing area is partitioned in 20 equal segments across both the X and Y axes, resulting in a total of 400 eye gaze cells of 50 × 35 pixels each. For each such cell of a particular video frame, we first counted the number of individuals, in the CG and EG respectively, whose eye gaze fell into it. We then calculated, for each video frame, the summation of the absolute differences in eye gaze count between the EG and CG across all cells, as shown in eq. (1).

In this regard, the minimum and maximum eye gaze difference count between the EG and CG are Min ∆ = 0 and Max ∆ = 12, respectively. For example, Fig. 2 shows the eye gaze count at the 50th frame of video 1 observed from participants in both CG and EG.

Fig. 2
figure 2

Points where the participants gazed at the 50th frame of video V1 (X ∈ EG, O ∈ CG)

The eye gaze data for all the videos is represented in heat maps in Fig. 3. This is split into EG (on the left side) and CG (on the right side). The videos are sequenced in rows from V1 to V6. As can be seen in V1, the EG seemed to explore the scenario whereas the CG focused in diversified points. In contrast, EG participants had broader scan patterns in V2. In V3 and V4, which contain low brightness and high brightness respectively, the EG focused on the lower part of the viewing area where white standouts, although most of the times V3 presents a dark area. V5 presents the angular shapes in dynamic sequences, which means they were spread out. Here, CG participants examined the video with more dispersed gaze patterns compared to the EG. The heat map suggested that the latter was more focused when exposed to angular shapes, high pitch, and lemon. Finally, in V6, both groups focused their attention on the circular shapes in different positions on the screen.

Fig. 3
figure 3

General heat map across the video clips. Red means most viewed and most fixated on. Yellow refers to some views, but less fixation. Green indicates less views and fixations. Blue suggests least viewed and hardly any fixations. White indicates hardly any views and no fixations

In order to analyze the eye gaze data, a one sample t-test of the eye gaze difference count was performed and is shown in Table 3. The result reveals that there are statistically significant differences in eye gaze between the EG and CG for all the six videos (p < 0.05). However, as the difference between the groups was the audio soundtrack (the CG had no soundtrack, whilst the EG had a mapped high/low pitch sound), and smell effects (the CG had no smell whilst the EG had congruent smell), we cannot deduce that the difference in eye gaze count is due to the difference in between groups in the experienced audio (and haptic effect), smell or both. Thus, further analysis is provided in the subsequent sections to identify the impact of each.

Table 3 One sample t-test of eye gaze difference count

4.2 Analysis of heart rate data

As a physiological metric, we employed heart rate data which was collected at the rate of one reading per second and measured in beats per minute (bpm). Accordingly, we collected 120 heart rate readings for each video. The heart rate readings from the CG varied between 60 bpm and 100 bpm whilst, for the EG, these ranged between 75 bpm and 110 bpm, with the means for each video illustrated in Fig. 4.

Fig. 4
figure 4

Average heart rate data for all video clips

In Fig. 5 we present the mean heart rate gathered every second for each of the six videos in both CG and EG. We observe a tendency for a higher heart rate in the EG for the whole duration of the videos In order to understand whether this tendency is statistically significant, we undertook an independent samples t-test, the results of which are shown in Table 4. The results in Table 4 evidence a statistically significant difference between the heart rates of the two groups for all the videos. This indicates that the two groups experienced a different mood in the two setups: (i) the one using crossmodally matching sound and smell (EG) and (ii) the one where no sound and smell were provided to the visual dominant features (CG). We remind the reader that the sound served as an input also for the vibrotactile feedback.

Fig. 5
figure 5

Average heart rate data (bmp) of the participants for each video

Table 4 Independent samples t-test of heart rate data

4.3 Analysis of self-reported QoE

Participants self-reported QoE by answering a series of 20 Likert scale questions, as shown in Table 2. For analysis, we converted the scores of each negatively-phrased questions (Q2, Q3, Q5, Q6, Q9, Q10, Q11, Q16, and Q17) to the equivalent score associated with a positively-phrased counterpart.

Initially, we performed a two way ANOVA with group type and video type as independent variables and the responses to the 20 QoE questions as the dependent variables, the results of which are presented in Table 5. As can be seen, there is a statistically significant difference between the EG and CG (Group) for all questions except Q3, Q5, Q14, and Q17; and the difference in QoE between the videos (Video) is statistically insignificant. Table 5 also shows that the interaction of the independent variables (Group*Video) has generally statistically insignificant effect for all questions on the self-reported QoE (dependent variable) except for Q15. Accordingly, a Post Hoc Tukey test analysis was conducted on all questions (except Q15), which also resulted in statistically insignificant values.

Table 5 ANOVA multivariate test result for each question

The mean and standard deviation in the self-reported QoE is 3.07 and 1.18 for the EG, respectively; and 2.91 and 1.16 for the CG, respectively. Farther explanation corresponding to each of the questions with respect to the results in Table 5 and Figure 6 is presented next.

  • In the case of Q1, the mean response is significantly higher (2.69) in EG than the CG. This implies that respondents have noticed the relevance of the various smells rendered for the respective video clips.

  • In Q2, the average is significant (3.83) in the CG which shows that there was intensity variation in the rendering of the smell effect across the video clips.

  • The mean of Q3 is slightly higher (statistically insignificant) in the CG which means that the smells were generally less distractive.

  • In Q4, the mean response is significantly higher (3.22) in the EG than the CG. This implies that the smell was consistent across the videos.

  • The average value of Q5 for EG is slightly higher (3.25) than the CG. This implies that the smells were perceived quite pleasant.

  • In Q6, the mean is significantly higher in the CG which means that the lingering effect of the smells was noticeable as compared to the CG.

  • The mean response corresponding to Q7 is significantly higher (3.25) in EG than the CG which means that the smell effects (congruent smells) have significant contribution to the overall QoE when viewing the video clips.

  • In Q8, the mean response is significant (3.31) in the EG than the CG. This indicates that respondents have noticed the relevance of the high/low pitch audios for the respective video clips.

  • The mean response for Q9 significantly higher (4.03) in the CG which means that there was noticeable loudness variations of the sound across the video clips.

  • In Q10, the mean is significant (3.78) in the CG which shows that the high/low pitched sound were generally less distractive.

  • The mean of Q11 for the CG is significantly higher (3.64) than the EG. This implies that the high/low pitched sound were generally found to be not annoying by experimental participants.

  • In Q12, the average response is significantly higher (3.11) in the EG than the CG. This means that the high/low pitched sound (which were congruent to the visual features of the video clips) has triggered a sense of reality that significantly enhances the overall QoE.

  • The mean answer for Q13 is significant (3.17) in the EG which signifies that the sound effect contributed to the overall QoE when viewing the video clips.

  • In Q14, the average response of EG is slightly higher (3.31) than the CG. This denotes that the haptic effects which were automatically generated out of the content-congruent sound have contributed to the enjoyment.

  • The mean score of the EG in Q15 is significant (2.86) which shows that respondents have noticed the relevance of the haptic effect for the respective videos.

  • In Q16, the mean QoE is significantly higher (3.69) in the CG than the EG which means that the vibrations on the chest while wearing the haptic vest had certain distractive effects.

  • The mean of Q17 is slightly higher (3.61) in the CG. This implies that the haptic effects generated out of the high/low pitched sound were generally not significantly annoying.

  • In Q18, the mean is significant (3.08) in the EG which indicates that the haptic effect (generated out of the high/low pitched sound which is congruent to the visual features of the videos) has significantly enhanced the sense of reality while watching the video clips.

  • The mean value corresponding to Q19 is significantly higher (3.25) in EG than the CG. This means that the haptic effects generated out of the content-congruent sound have significant contribution to the overall QoE when viewing the video clips.

  • In Q20, the mean is significantly higher (3.86) in the EG than the CG. This implies that the combined multisensorial effect of the content-congruent smell, sound, and the auto-generated haptic has contributed to the enjoyment while watching the video clips.

Fig. 6
figure 6

Average QoE for the EG and CG

Because the interaction of the independent variables (Group*Video) for Q15 showed a statistically significant value, we conducted simple main effect analysis (Table 6). Thus, V3 and V5 showed statistically significant lower scores obtained from the EG compared to their CG counterparts (F(1,60) = 12.140, p < .05 and F(1,60) = 14.448, respectively) which implies that the haptic effects generated out of the content-congruent sound was significantly less relevant to the video clips having more dark and angular features than the other four video clips. However, in the case of V1, V2, V4, and V6, differences in participant scores between the two groups were not significant.

Table 6 Simple main effects analysis (Q15)

The results corresponding to most of the self-reported QoE questions indicated that the content-congruent smell, sound, and the auto-generated haptic effects have enhanced the users’ QoE while watching the video clips. This is substantiated by the mean responses of the EG and CG for all the questions (3.07 and 2.91, respectively); and the statistically significant difference values corresponding to most of the questions in (Table 5) which implies that the cross-modally mapped (overall) multisensorial setting has enhanced the QoE.

In general, our analysis of the difference in eye gaze count (Table 3) and heat map of the eye gaze patterns (Fig. 3) showed that the cross-modally mapped multisensorial effects have significantly influenced the users’ perception. Significantly high heart rate recording is also observed due to the introduction of multisensorial effects in the EG of participants (Table 4, Fig. 5). Additionally, analysis of the self-reported QoE evidenced the eye gaze and heart results revealing that the multisensorial effects involving content-congruent high/low pitch sound, smell, and haptic have significantly enhanced the QoE.

The findings also indicate that the positive impact of multisensorial effects on users’ QoE is substantiated by integrating cross-modally mapped component effects in a mulsemedia context. This implies there exists a noticeable cross-modal correspondence in a digital world between the visual features of videos and audio pitches which substantiates studies in [19, 61, 69]. Similarly, such correspondence exists between the visual features of the videos and smell effects [19, 21, 61, 64, 69].

5 Conclusions

This paper presents an exploratory study that begins to establish how crossmodal correspondences could be systematically explored for multisensory content design. In our study, we examined the impact of crossmodal mappings between visual features and auditory media, and visual features and olfactory media on user QoE. These mappings were previously shown to be favorable to design interfaces and displays that tap into users’ mental model leading to more immersive and effective experiences [40].

By employing multimedia video clips, eye tracker, haptic vest and heart rate monitor wristband in our experiment, we gathered results from both subjective surveys and objective metrics. The use of the eye tracker exposed that there were significant differences in both EG and CG. Gaze heat maps showed that the EG was more focused when experiencing mulsemedia, except when exposed to the combination of yellow, high pitch and bergamot smell. Although we cannot draw strong conclusions based on the gaze patterns of the participants, we observe that when the olfactory content is crossmodally congruent with the visual content, the visual attention of the users seems shifted towards the correspondent visual feature (e.g., exploration and focus on the blue sky for V1; wider exploration area for the round shapes (more balls) for V6).

The heart rate responses were also significant. This could be due to users experiencing different moods, not only that the heart rate was much higher in the EG as opposed to the CG. One of the possible reasons could be that the use of high vs low pitch may have affected the users’ viewing experience, whereas in the CG there was no sound limiting the immersion as well as the experience. By reflecting on both groups, it shows that the use of sound and smell did have a positive effect and increased users QoE to a certain degree.

The self-reported responses support the eye gaze and heart rate results, revealing that the multisensory effects involving crossmodaly mapped (content-congruent) smell, sound (high/low pitch), and auto-generated haptic have enhanced the QoE compared to a visual only condition. This also implies that there exists a noticeable cross-modal correspondence from visual features to audio pitches and smell effects.

Overall, our results might be indicative of causality between visual attention and the presence of additional content that matches the dimensions meant to be attended, but further work needs to be done in order to validate this. Indeed, one of the limitations of this study is that it does not look into differences between the effects of content created using crossmodal principles and other types of multisensory content (e.g., where correspondences are semantic). Thus, although we show that the attention and the QoE benefit from the multisensory content, it is not obvious if this is caused by employing crossmodal principles. Another limitation of the study is the relatively small number of participants, which makes it unclear how our findings would generalise in other setups. Also, the study reported here is an exploratory one, which has raised many interesting paths for future investigation. Among these, worthy of mention are repeating the experiment when users view videos with other, non-coherent (neutral) stimuli as well as when viewing content with non-congruent stimuli. All are valuable future pursuits. Further work could also be done to explore what content is more appealing to users. Categorizing the content into different topics and carrying out a pilot study amongst few users will provide us with what types of media content they would prefer to watch. Moreover, odors influence mood, work performance, and many other forms of behavior and this has been evidenced in our study. We intend to further investigate in the future by comparing original sound with altered high and low pitch as well as looking at employing additional, different odors for crossmodal matching.