1 Introduction

The field of affective computing, which is understood as the process of estimating human emotions by means of computational tools [35], is becoming progressively more relevant in the video game industry owing to the inherent relationship between the emotions evoked in the player and the overall user experience [30], with the objective being to create personalized game scenarios for each player [25]. Video games are currently considered to be possibly the main Human-Computer Interaction environment in which users are more open to an alteration in emotional state in order to enhance their own experience [49]. The actual estimation of the user’s emotional state is, therefore, a key element in the process of providing proper feedback to the system [7].

In this regard, scientific literature reports several proposals with which to estimate such emotional states and engagement levels. It is generally possible to broadly divide these proposals into two main families: (i) the so-called invasive techniques, in which this process is carried out using sensing devices, most typically data from electroencephalography (EEG) headsets [6], although electrocardiogram, facial electromyography, electrodermal activity and respiration arterial data have also being considered [13], and (ii) non-invasive approaches, in which the premise is to gather the information required without introducing external elements suchas tackling the problem as a facial recognition task [3], as the face is considered the most expressive part of the body [32, 37], or considering the use of eye-tracking technologies [23].

Of all the different emotions a video game user may experience, frustration is one of the key elements to estimate because it is significantly correlated with engagement success [19]. Frustration appears when the user is not able to achieve a goal and, if not properly monitored, may lead the user to disregard not only the goal but also the actual game [12]. The study of this particular problem is, therefore, clearly beneficial for the video game industry.

Despite the aforementioned relevance of quantifying a user’s frustration level, this particular task has been poorly addressed and the few works can be found in literature basically differ as regards the principle used to estimate this emotion. For instance, Miller and Mandryk [28] studied this problem by assuming that the players’ affective state is related to the touch pressure they apply to game controllers. Other authors, however, address this task by relying on the use of audio recordings, video captures, and combinations of both sources of information obtained from the actual game users [12, 41, 42].

Regardless of the actual proposal used to estimate user emotions, most approaches analyze the data collected by considering Machine Learning (ML) within a general framework. Furthermore, the current Deep Learning paradigm, which is represented by Deep Neural Networks (DNNs), is the current trend in most recent emotion-estimating proposals owing to its demonstrated effectiveness and great capacity for generalization in highly disparate tasks [11, 21, 36, 45].

In this regard, it is necessary to point out that the use of DNNs to solve the frustration recognition problem is not new, since state-of-the-art approaches already consider them [41, 42]. Nevertheless, these proposals do not take advantage of the complete potential of these learning techniques, signifying that there is significant room of improvement that should be further explored and studied.

This work, therefore, proposes a non-invasive multimodal approach based on DNNs that detects frustration by making use of audiovisual data. More precisely, we consider the separate exploitation of the audio and video data in order to take advantage of the nuances of each modality so as to then explore different policies with which to synergistically combine both sources of information. We specifically propose two fusion modes with the aim of improving the results of the single-source methods, which deal with audio and video separately, along with the results provided by the existing approaches described in Section 2. Our results for real-world benchmarking data for frustration detection show that this multimodal approach is beneficial as regards addressing the proposed task and that it outperforms existing state-of-the-art proposals.

While it could be argued that this proposal is possibly limited in terms of the number of data modalities considered and the sophistication of the neural architectures, it should be noted that, as mentioned previously, the results are already better than those of state-of-the-art strategies. In this regard, and as will be discussed below, additional sources of information, along with other learning schemes, may further improve these results, thus making them of great relevance for future work.

The present work is organized as follows: Section 2 provides a literature review on emotional recognition, focusing on the particular case of frustration, and also on multimodal learning systems. Section 3 then goes on to describe the multimodal approach proposed in this work, while Section 4 provides details on the corpus and metrics considered for the evaluation of the proposal. Section 5 presents the results of the experimentation carried out, and finally, Section 6 concludes the work and proposes future lines with which to further study the topic.

2 Background

This section provides the background required for the remainder of this work. It first explores the topic of emotion analysis so as to then concentrate on that of frustration recognition, after which the topic of multimodal exploitation of information is presented.

2.1 Emotion recognition

The emotion recognition task is highly complex, and scientific literature, therefore, provides a wide range of methods with which to tackle it [5]. Of these, speech analysis constitutes quite a common framework in which to perform this task. Kwon et al. [20] proposed a combination of specific features such as pitch, energy, and Mel Frequency Cepstral Coeffecients (MFCCs), among others, to be later processed by a Support Vector Machines (SVM) classifier in order to recognize the emotion presented in the audio recording. Another example is the work by Yang et al. [48], which detects emotions in songs by using regression algorithms, obtaining the best results with Support Vector Regression (SVR).

As mentioned previously, more recent works rely on DNN architectures. For example, the work by Wootaek et al. [24] presents an approach based on a Convolutional Recurrent Neural Network (CRNN) that automatically extracts potential features to be used in the classification of emotions obtained from recordings of speech. Another similar example is the work by Mirsamadi et al. [29], which biases the feature extraction process by using a weighted-pooling strategy to promote those features that best represent the emotions in question.

Although speech is commonly employed to study emotions, other sources of information have also been explored. For instance, the work by Ebrahimi et al. [8] presents an approach based on Recurrent Neural Networks (RNNs) in order to analyze facial expressions and classify them according to a set of predefined categories. In their work, Bahreini et al. [2] similarly recognize facial emotions but employ a fuzzy logic approach. Finally, as stated above, other invasive approaches base their performance on the acquisition of additional data such as brain activity by using EEG devices and processing them [16, 33, 39], but they have the constraint of specific hardware requirements.

2.1.1 Analysis of frustration

Since frustration is a particular type of emotion, the general frameworks presented above can be adapted to its sole detection and recognition. However, given the relevance of this topic, several strategies have been especially devised in order to tackle this problem. For example, Fernandez and Picard [10] proposed a method based on Hidden Markov Models in order to recognize frustration in speech signals. More recently, Malta et al. [26] presented a work in which the frustration of drivers was detected by using a Bayesian network, considering the correlation between frustration and several inputs, such as speech, video recordings and the driver’s use of pedals.

In the particular context of this work, despite being scarce, there are, nevertheless, some works that analyze frustration in the context of video games. Song et al. [42] proposed a multimodal approach with which to estimate the frustration level by combining audio and video inputs through the use of neural networks. However, the authors relied on hand-crafted facial features extracted from the video and MFCCs, which may arguably not be the best descriptors for the task in hand. The approach uses a standard Long Short-Term Memory (LSTM) model to process both audio and video features, and does not, therefore, completely exploit the capabilities of both DNNs and the different sources of information available. This work was further improved by Meishu et al. [41], who employed more complex neural networks with residual connections but relied exclusively on speech data, thus ignoring the information provided by the video images.

2.2 Multimodal audiovisual analysis

Multimodality [44] is the trend in ML of exploiting different sources of data in order to then carry out a certain combination, which results in a more robust and proficient model. Of all the different combination possibilities, namely fusion policies, the following are highlighted:

  • Early fusion: this combines the data sources before they are processed by the learning-based model. Its main advantage is that only one model has to be trained, but it requires a proper preprocessing stage for the data sources to enable them to be combined. The high degree of source variability, therefore, hinders the creation of a proper model that is able to correctly classify data.

  • Late or decision fusion: this is based on the processing of each data source by an independent model and then combining their individual classification decisions. In contrast to early fusion, each model learns a specialized set of features, which is a much easier to achieve. This strategy is typically used when the sources are significantly different from each other.

  • Intermediate or feature fusion: this is a feature-level fusion of the learning models, and is typically carried out by concatenating the features obtained before the final decision is made. This allegedly makes it possible to obtain more robust classifiers. However, this scheme increases the complexity of the model, since it consists of a single model with several inputs and one output.

Early and late fusion are fairly common in the literature. For example, Snoek et al. [38] compare both strategies in semantic video analysis, concluding that late fusion tends to provide a slightly better performance. Another comparison is presented in the work by Gunes and Piccardi [15], which combines facial and body-gesture features in order to recognize emotions by employing two traditional ML algorithms, namely Decision Trees and Bayesian Networks. These authors conclude that fusion modalities are better than unimodal models, and that feature fusion performs particularly well. Another example is the work by Wimmer et al. [46], which proposes the use of the SVM classifier with a low-level combination of features extracted from audiovisual data for emotion recognition, thus improving performance with regard to unimodal scenarios. It also highlights the work by Pantic et al. [34], in which an adaptive neural network classifier is presented and assessed in different study cases, such as the combination of hand gestures and facial features or the combination of speech and video features. Güçlütürk et al. [14] studied the fusion of audiovisual and textual information for first impression analysis using Deep Residual Networks. All these works endorse the benefits of multimodal fusion approaches. More related works can be found in the work of Wu et al. [47], in which there are multiple strategies for multimodal methods for emotion recognition.

This work further explores the idea of frustration recognition by considering multimodal strategies in Game-Play scenarios using audiovisual data. More precisely, in contrast to previous works tackling this task, we propose the use of DNNs in order to automatically extract a set of meaningful descriptors from the audio and video sources of information so as to then assess the synergistic capabilities of different data fusion policies. This particular strategy is a new approach in this respect and, as stated in Section 5.2, outperforms the results achieved by state-of-the-art unimodal speech-based [41] and audio-and-video multimodal [42] approaches to a remarkable extent.

3 Methodology

In this section, we present our multimodal proposal for frustration detection during Game-Play, which considers the information from both audio and video recordings. We, therefore, first describe the approach considered for the audio source of information, and then do so for the video input. Finally, we introduce the proposal employed to combine both sources of information in order to determine the presence of frustration.

3.1 Audio classification

With regard to the audio data, previous work [42] proposes an approach based on LSTM to process a set of MFCCs previously extracted from the raw signal. While MFCCs have been considered to a great extent in audio speech analysis [31], it has been proved that a configuration based on Convolutional Neural Networks (CNNs) applied to an initial time-frequency representation of the signal is a more appropriate way in which to find suitable features with which to detect frustration in audio recordings [41]. Our audio analysis will, therefore, also consider this idea rather than that of using hand-crafted features.

Formally, let \(\mathcal {X}_{a}\) be an audio recording in raw format and let \(\mathcal {S}_{a}\) be its associated time-frequency representation. We consider a neural network architecture based on CNN layers in order to process \(\mathcal {S}_{a}\) and automatically extract the most suitable features for the frustration classification. This scheme is shown in Fig. 1.

Fig. 1
figure 1

Scheme of the unimodal frustration classifier approach based on speech data

The main difference between the audio processing performed by [42] and our method is that we propose the use of convolution layers to automatically extract those features that are most appropriate from the point of view of the neural network. Our premise is that this will support the performance of the classification task, since the neural network is responsible for it. In contrast, [42] directly uses those features provided by the MFCCs, which, although they may appear adequate according to visual perception, may not be the most suitable for processing by the neural network.

Since the use of time-frequency representations is common in audio processing, note that different representations other than the MFCCs can also be considered. In Section 5.1.1, therefore, we shall study the input representation, along with other parameters, such as sample rate or the specific classifier model to be used, in order to discover the most suitable configuration for the task in hand.

3.2 Video images classification

Let \(\mathcal {X}_{v}\) be a video composed by a sequence of N image frames such that \(\mathcal {X}_{v}\) = {xv,i | 1 ≤ iN}.

As shown in Fig. 2, the objective of the proposed method is to detect frustration in single-frame video images \(\mathcal {X}_{v}\). However, since only the facial expressions are useful as regards this detection, our system preprocesses each frame xv,i in order to obtain its trimmed and resized version sv,i and subsequently create a new video \(\mathcal {S}_{v}~=~\{s_{v,i}~|~1~\leq {}~i~\leq {}~N\}\) in which only the face is present. This makes it possible to automatically extract features with CNN rather than having to employ hand-selected features. The details of this preprocessing are described in Section 5.1.2.Footnote 1

An RNN model then provides the decision concerning the presence of frustration in the \(\mathcal {S}_{v}\) trimmed version of the video. This kind of architectures receives a time-correlated sequence of data, in this case, each single sv,i image frame, and makes the classification decision after processing all the frames in the sequence.

Owing to the nature of the data in question and the relatively high sampling resolution, the particular facial expression is barely modified in consecutive frames, signifying that a model able to learn these long-term dependencies is required. We consequently considered the use of the well-known LSTM architecture [43], since it is capable of modeling such dependencies and is also considered in the work by Song et al. [42].

Please note that in this latter work, the set of hand-crafted features are a manual selection of the features from the Facial Action Coding System [9], which defines a series of facial features through the use of specific action units (AUs). The method selects 18 of these AUs features and performs the extraction for each frame xv,i of the initial video data \(\mathcal {X}_{v}\), which are then handled by an LSTM in order to perform the frustration classification task. Nevertheless, as explained for the audio case in the previous section, since these features may not be the most representative for the task in hand, we propose to automatically learn them by employing CNN layers from the trimmed version \(\mathcal {S}_{v}\), since we assume that these features may provide a better overall performance given that they are estimated for the actual task in question.

3.3 Multimodal fusion

The key aspect of our proposal is the fusion of both audio and video information as described above. In this regard, we considered two particular types of combinations to be studied in the experiments: decision fusion and feature fusion.

This decision fusion case (see Fig. 3a) is based on the combination of the single decisions made by the audio and video models through the use of a weighting factor α. This fusion can be mathematically represented as

$$ P_{f}(y|\mathcal{S}_{v},\mathcal{S}_{a}) = \alpha P_{a}(y|\mathcal{S}_{a}) + (1-\alpha) P_{v}(y|\mathcal{S}_{v}), $$
(1)

where \(P_{a}\left (y|\mathcal {S}_{a}\right )\) \(P_{v}\left (y|\mathcal {S}_{v}\right )\) represent the unimodal scores obtained by the DNNs over the time-frequency audio representation \(\mathcal {S}_{a}\) and the face-based trimmed video data \(\mathcal {S}_{v}\), respectively, while α ∈ [0,1] represents a weighting factor.

Fig. 3
figure 3

Multimodal fusion schemes with which to compute \(P(y \mid \mathcal {S}_{v}, \mathcal {S}_{a})\)

The feature fusion alternative (see Fig. 3b) consists of designing an actual neural architecture with two inputs (audio and video) and one output with the classification result. The idea is based on two parallel streams that process \(\mathcal {S}_{a}\) and \(\mathcal {S}_{v}\) separately in order to obtain a particular representation of each modality by employing the neural network. These neural-based features, denoted as \(\mathcal {F}_{a}\) and \(\mathcal {F}_{v}\) for the audio and video streams, respectively, are then concatenated before making the final decision. Note that, in contrast to the previous case, this feature fusion is performed implicitly, given that the neural network is trained simultaneously with \(\mathcal {S}_{a}\) and \(\mathcal {S}_{v}\).

Both fusion modes considered will be evaluated and compared with their unimodal versions and other state-of-the-art results in Section 5.

4 Materials and metrics

This section presents the experimental setup considered in order to assess the frustration detection method proposed. More precisely, we shall describe the corpus used and the set of evaluation metrics.

In a technical sense, we considered the following libraries and toolkits for the proposed experimentation:

  • Python.The research has been carried out in the Python programming language (v. 3.6.9).

  • Tensorflow [1]: Framework for the implementation of the DNNs models (v.2.3.1).

  • Keras: Collection of functions that make it possible to design architectures for neural networks. It also includes the tools employed to train the models and use them to evaluate performance (v. 2.4.3).

  • NumPy. This is an open source library in the Python programming language used for the creation of vectors and multidimensional matrices. It provides powerful data structures that are capable of carrying out vector operations easily and efficiently. In this research, it will be used to deal with the structures of data used with the different neural architectures (v. 1.19.5).

  • librosa library [27]: Audio analysis library used for the extraction of part of the time-frequency representations considered (v.0.8.0).

  • Cassani toolkit [4]: Audio analysis toolkit used for the extraction of the Modulation-Spectral representation (v.0.1).

  • dlib library [17]: Image analysis library used as a face detector when trimming the initial video data (v.19.18.0).

4.1 Corpus

For the evaluation of our approach, we have considered the Multimodal Game Frustration Database [42] of real-world recordings used by both the unimodal [41] and multimodal [42] state-of-the-art methods. The database comprises over 5 hours of 960 × 540-pixel video recordings at 30 frames per second split into 10-second excerpts and annotated as either representing or not representing frustration. This corpus was created thanks to the participation of 67 students from the Shanxi Province in China, with ages ranging from 12 to 16.

This dataset is already provided in three separated partitions for its correct benchmarking: a training set of 3,979 videos, a validation partition of 1,326 videos, and a test set of 1,328 elements. This configuration is maintained for comparison purposes in our experiment. Table 1 provides a summary of the details of the corpus.

Table 1 Amount of audiovisual recordings of the corpus considered with training, validating and testing partitions, according to the presence of frustration

4.2 Metrics

Since this work tackles an imbalanced scenario, as shown in Table 1, the evaluation requires a metric that is able to avoid any bias towards a particular class. We have, therefore, considered the use of the F-measure (F1). In a two-class classification problem, as in our case, F1 is described as

$$ F_{1} = \frac{2\cdot\text{TP}}{2\cdot\text{TP} + \text{FP} + \text{FN}} $$
(2)

where TP represents the True Positives or correctly classified elements, FP represents the False Positives or type I errors and FN represents the False Negatives or type II errors.

Nevertheless, since the works compared consider the recall metric, we shall also introduce it for comparison purposes. In the same terms as the F1, the recall \(\mathcal {R}\) metric can be defined as

$$ \mathcal{R} = \frac{\text{TP}}{\text{TP}+\text{FN}} $$
(3)

Moreover, since it may provide some additional insights into the performance of our proposal, we also consider the precision \(\mathcal {P}\) metric, which is defined as:

$$ \mathcal{P} = \frac{\text{TP}}{\text{TP}+\text{FP}} $$
(4)

Finally, in order to be consistent with the other works dealing with this problem, all these metrics will be computed by taking the minority class, i.e., the frustration class as the positive class.

5 Experimentation

This section presents the different results obtained after considering several DNNs models in several scenarios. We first consider an initial stage so as to correctly adjust the parameters of our proposed models in order to then assess their performance on the test partition and compare them with the results reported by the state-of-the-art works in this field.

Please note that the models were trained until 115 epochs, and that the weights of these epochs were maintained, which maximizes the results in the validation set. We used the well-known Adam optimizer [18] and the categorical cross-entropy loss function. We considered batch sizes, taking the maximum allowed by the memory restrictions up to 32.

5.1 Model optimization

As mentioned previously, in this first part of this section our objective is to correctly optimize the different parameters of the model proposed. The stand-alone audio and video models are, therefore, first analyzed and optimized in order to subsequently study the multimodal cases proposed. Note that this optimization study considers only the training and validation partitions of the data in question.

5.1.1 Audio model

With regard to the implementation of the audio processing, as described in Section 3, our approach first processes the audio in order to extract the corresponding time-frequency representation. We considered the use of the well-known Mel spectrogram, as occurred in the work by Meishu et al. [41] in which its effectiveness for this particular task was proven. We additionally included two other representations widely applied in the field of audio-based emotion recognition: the Modulation-Spectral one [50] and MFCCs [22].

Both the Mel spectrogram and the MFCCs were obtained using the librosa library, whereas the Modulation-Spectral representation was attained by employing the Cassani toolkit. With regard to the neural architectures, we selected ResNet50 and Xception, since both have been used in the reference works considered [40, 41]. Note that, since we are modeling a two-class problem (frustration/non-frustration), the output of these models consists of two neurons representing each of these labels.

As mentioned above, we trained the models using the training and validation partitions of the corpus. The hyper-parameter tuning took place in a two-step fashion: we first studied the most suitable type of time-frequency representation for the task in hand and we then analyzed different values of hyper-parameters in order to eventually optimize the classification results.

For the first analysis, we fixed a sample rate (sr) of 22,050 Hz and a hop length of 512 samples, which resulted in temporal and frequency resolutions of 23.2 ms and 43 Hz, respectively. When the input corresponded to an MFCCs spectrogram, we fixed a total of 39 common coefficients: 13 MFCCs, 13 delta-MFCCs, and 13 second-order delta-MFCCs.

The results obtained with the validation partition using the aforementioned conditions are shown in Table 2.

Table 2 Results on the validation partition in terms of \(\mathcal {R}\), \(\mathcal {P}\), and F1 measured in % for audio classification according to the type of input data representation and the neural network model

Note that, although both ResNet50 and Xception attain high-performance figures, the Mel spectrogram is the scenario with the best recall results, with \(\mathcal {R} = 93.7\%\), when compared to the MFCCs case which obtains a maximum recall of \(\mathcal {R} = 91.4\%\) and the Modulation spectral which obtains \(\mathcal {R} = 82.2\%\), all of which correspond to the Xception model. While ResNet50 also attains high-performance results, Xception outperforms all of them. For instance, upon considering the MFCCs input, the recall decreases from the \(\mathcal {R} = 91.4\%\) obtained by the Xception model to the \(\mathcal {R} = 90.9\%\) obtained with ResNet50; the Modulation spectral case also shows a reduction in recall from \(\mathcal {R} = 82.2\%\) to \(\mathcal {R} = 78.0\%\). With regard to the Mel spectrogram, which provided the best results of all the metrics considered, there was the same tendency as with the other results, decreasing the recall from the \(\mathcal {R} = 93.7\%\) obtained by Xception to the \(\mathcal {R} = 85.3\%\) obtained by ResNet50. Note also that the \(\mathcal {P}\) and F1 metrics are highly correlated with the previous recall analysis, obtaining the best result in all cases for the Mel spectrogram representation processed by the Xception model. All of the above eventually led us to select the Mel spectrogram as the input data for the audio classifier.

The second step was the optimization of the hyper-parameters of the input representation, i.e., the hop length and sample rate parameters. For this analysis, we considered three different sample rate values, along with two different hop lengths. Table 3 shows the results obtained from the previously selected input representation for the two neural models considered in the work.

Table 3 Results obtained with the validation partition for the Mel spectrogram hyper-parameter tuning

One relevant difference between this and the previous experiment was that the best results for all the metrics considered were not consistently attained by one particular DNNs and hyper-parameter configuration. In general, the Xception model outperformed the ResNet50 for all metrics and hyper-parameter configurations considered. More specifically, according to the \(\mathcal {P}\) and F1 metrics, Xception achieved the best overall results for the task in hand with a hop length of 512 samples and 22,050 Hz of sample rate, attaining \(\mathcal {P} = 87.8\%\) and F1 = 90.6%. With regard to ResNet50, the best results were \(\mathcal {P} = 79.3\%\) and F1 = 81.0%. The first was obtained with 1,024 hop length samples and a sampling rate of 22,050 Hz, whereas the second was obtained with 512 hop length samples and 11,025 Hz. Since the baseline state-of-the-art approaches [41, 42] reported the results in terms of \(\mathcal {R}\), we consequently decided to focus on this particular metric in order to determine the configuration for the final experiments. The hyper-parameter optimization using the validation partition shows that Xception attained the best recall results (\(\mathcal {R} = 94.3\%\)) when considering a hop length of 512 samples and a sample rate of 11,025 Hz, which resulted in a temporal and frequency resolution of 46.4 ms and 21.5 Hz, respectively. We, therefore, considered this configuration for the final comparison with the state-of-the-art proposals.

5.1.2 Video model

As described in Section 3.2, our proposal first trims the frames in order to obtain smaller images that focus on the face of the individual using the face detector of the aforementioned dlib library. We have resized the resulting images to 64 × 64 pixels for reasons of simplification.Footnote 2

The classifier used in this case is a combination of CNN and LSTM, whose details are shown in Table 4. We considered three variations of the model, henceforth denominated as \({\mathscr{M}}_{1}\), \({\mathscr{M}}_{2}\) and \({\mathscr{M}}_{3}\), with an increasing number of layers, respectively.

Table 4 Architecture of the neural networks considered for the video classification task

One of the most important hyper-parameters to adjust in this section is the number of frames of the video data to be introduced into the network. As mentioned previously, the data collection contains 10-second video excerpts recorded at 30 frames per second, which results in videos of 300 frames. However, since the difference between consecutive frames in terms of expression is usually almost imperceptible, some of them could be ignored.

On that premise, in our experiments, we subsampled the number of frames so as to decrease the complexity of the learning task. More precisely, we experimented with two particular subsampling rates: taking one frame either every five or ten of the initial frames. This preprocessing results in excerpts of 30 and 60 frames, depending on the subsampling rate selected.

The results obtained with the different network models proposed for each of the subsampling policies considered are shown in Table 5.

Table 5 Results obtained for the validation set of the video classification for the different models and subsampling policies considered

We discovered that the most complex model—\({\mathscr{M}}_{3}\)—provided the best results with the 60-frame policy for all the metrics computed. With regard to the other models, \({\mathscr{M}}_{2}\) attained better results than \({\mathscr{M}}_{1}\) for both subsampling policies, with a maximum recall of \(\mathcal {R} = 90.1\%\) for \({\mathscr{M}}_{1}\) with 30 frames and \(\mathcal {R} = 93.7\%\) for \({\mathscr{M}}_{2}\) with 30 frames. However, since \({\mathscr{M}}_{3}\) with 60 frames obtained the best overall recall results with \(\mathcal {R} = 94.2\%\), this particular configuration was selected for the final experiments. The other metrics were found to follow a similar trend, and it consequently became clear that \({\mathscr{M}}_{3}\) was superior to the other alternatives for this task.

5.1.3 Fusion model

Having optimized the individual models for audio and video classification, we then combined them to build the multimodal audiovisual approach. As mentioned in Section 3.3, we considered two possible combinations: decision fusion and feature fusion.

With regard to the decision fusion, according to (1), it is necessary to study the optimal value of α for our multimodal method. Figure 4 shows the result of this combination with α ∈ [0,1] and a granularity of 0.1. Note that α = 0 corresponds to the unimodal audio model, while α = 1 represents the unimodal video model.

Fig. 4
figure 4

Effect of the hyper-parameter α on the multimodal decision fusion when compared with the unimodal approaches in terms of \(\mathcal {R}\) on the validation partition

In this graph, the red curve represents the performance of our multimodal proposal in terms of the recall metric. The maximum goodness of the method when the fusion is performed equally is evident, i.e., α = 0.5, obtaining a result of \(\mathcal {R} = 97.6\%\). If this value is compared with the second-best result—95.5% when α = 0.4—, it will be noted that there is an important difference between them. Indeed, the error rate is reduced by over 46%, from 4.5% (when α = 0.4) to 2.4% (α = 0.5), thus proving the need to properly adjust this hyper-parameter. Furthermore, the frustration detection of our multimodal proposal outperforms both the unimodal models considered. After this analysis we, therefore, eventually selected α = 0.5 for the final experiments. However, it is worth highlighting the wide range of values of α with which our multimodal approach based on decision fusion outperforms the performance of the unimodal methods, particularly when α ∈ [0.1,0.8]. Moreover, the experiment proves that this combination never provides worse figures than the unimodal approaches, the worst case being when α = 0.9, in which the performance of the fusion model equals that of the video unimodal. These results reinforce the idea that our combination of audio and video data may be beneficial for the detection of frustration, and this premise, in fact, holds true for the majority of values of α.

The second fusion scenario is the feature fusion. This consists of combining the probability predictions of the two unimodal approaches described above in order to make a single decision about the presence of frustration aided by the individual descriptors extracted for each type of data. Table 6 shows a comparison between the best decision fusion model obtained (α = 0.5) and this second type of multimodal fusion.

Table 6 Comparison between the two fusion modalities considered on the validation partition of the corpus

Focusing on the recall metric, the decision fusion provides a performance of \(\mathcal {R} = 97.6\%\), whereas the feature fusion remains at \(\mathcal {R} = 95.6\%\). The other metrics evaluated also attained a superior performance when comparing the decision fusion with the feature fusion.

5.2 Final results and comparison with the state of the art

In this section, we show the results obtained in the test partition of the corpus considered and comparatively assess them using those obtained by the state-of-the-art approaches mentioned above. As commented on previously, the reference works in literature that address the frustration detection task report only the recall metric, and we shall, therefore, consider only this particular measure for comparative purposes.

Table 7 shows the results obtained with the test partition of the corpus considered for all the different unimodal and multimodal proposals studied in this work, along with the results reported by the reference state-of-the-art methods.

Table 7 Results regarding the test partition of the corpus

Upon studying the results reported for the state-of-the-art methods, it will be observed that the multimodal approach by Song et al. [42] provides the worst recall results, with a value of \(\mathcal {R} = 60.3\%\), while the unimodal method [41] outperforms those results with \(\mathcal {R} = 93.1\%\). This remarkable difference between the scores obtained suggests that the aforementioned multimodal approach may not be properly exploiting its available sources of information, as a unimodal approach clearly outperforms it. Note that [41] performed this classification task through the use of more complex models, thus exploiting the capabilities of the neural networks, but that only audio was considered for the experiments, signifying that a valuable amount of information that could have been useful in the classification task was missed.

With regard to the unimodal audio and video strategies proposed in this work, note that they also attain competitive results, with recall values of \(\mathcal {R} = 91.7\%\) and \(\mathcal {R} = 89.6\%\), respectively. Note also that, when compared with [41], our audio-based model attains slightly worse results, since the recall decreases from \(\mathcal {R} = 93.1\%\) to \(\mathcal {R} = 91.7\%\). Although our unimodal approaches do not achieve the results obtained by the unimodal state-of-the-art method, they significantly outperform the existing multimodal method. The benefits of the influence of the automatic learning performed by the convolution layers when compared with the hand-crafted feature extraction is, therefore, demonstrated. It is important to state that, while the actual audio architecture proposed in literature was reproduced here for the sake of comparability, our results are slightly lower than those reported in the reference work.Footnote 3

However, when analyzing the two multimodal approaches proposed in this work, it is evident that they are consistently better than the figures attained by the unimodal architectures. With regard to feature fusion, it yields a recall score of \(\mathcal {R} = 93.8\%\), which is thus better than the results of both \(\mathcal {R} = 60.3\%\) by [42] and \(\mathcal {R} = 93.1\%\) by [41]. The decision fusion attains the best overall recall value, with a score of \(\mathcal {R} = 95.9\%\).

While the improvements made may appear to be relatively limited, it should be noted that decision fusion obtained only a 4.1% recall error which, when compared with the 39.7% error provided by the state-of-the-art multimodal approach, implies a relative improvement of 89.7% for this figure of merit. A similar analysis comparing the decision fusion with the audio-based state-of-the-art result also yields a relative improvement of slightly more than 40%. This shows that our multimodal method is much more reliable as regards providing feedback about users’ game-play experience.

Although the feature fusion does not achieve the best performance, it is also worth highlighting its high recall, with 93.8%, which from the point of view of the error made, supposes 6.2% of the absolute recall error, or in other words, a relative reduction in the error of over 10% with respect to the best state-of-the-art method, i.e., the audio-based one. These results reinforce the premise on which we this research is based, i.e., that a multimodal approach may be a more appropriate means to carry out the classification task than unimodal models, since it is able to leverage the information provided by the two data sources involved—audio and video—in order to make remarkable improvements to the detection of the frustration when compared with single-source based models.

The results obtained, therefore, confirm that both multimodal fusion methods presented in this work are considerably better than unimodal approaches, which only tackle either audio or video information. Moreover, the results obtained also outperform those attained by the state-of-the-art works addressing this same task, including the multimodal and the unimodal proposals. Finally, note that the experimentation presented validates our proposed multimodal strategies as it attains the best recall scores of all the benchmarked methods, with the decision fusion scheme being that which obtains the best overall results.

6 Conclusions

Frustration detection constitutes the discovery of an emotion of particular interest in the video game industry, since it is directly correlated with the users’ engagement. However, its estimation and tracking remains an open research question, especially when invasive tracking devices are not considered. In this context, this work introduces a new approach with which to detect frustration in non-invasive scenarios by considering multimodal strategies that fuse the information extracted with a feature-learning stage based on Deep Neural Networks (DNNs) obtained from the different individual data sources. More precisely, when considering audiovisual data, the idea is to extract meaningful descriptors from the audio and video sources of data and combine them in order to eventually perform frustration detection. Note that this fusion synergistically exploits the capabilities of DNNs to obtain a suitable set of features with which to detect the frustration emotion for each particular data source.

We specifically propose two multimodal approaches with which to merge the audio and video pieces of information: a decision-level approach, which combines the individual decisions made with each data source, and a feature-level policy, which combines the individual features extracted by the DNNs from each type of data in order to then make a single decision. The experiments reveal that the two proposed multimodal fusion methods outperform unimodal strategies, along with providing better results than the state-of-the-art schemes obtained from the related literature. The best results were specifically obtained with the decision-level fusion, with a recall score of 95.9%, thus improving the error rate by almost 90% in comparison to the multimodal state-of-the-art approach, and by over 40% when compared to that of the unimodal audio-based method.

The remarkable improvement obtained with our approach validates not only the use of multimodal approaches as regards merging different sources of information in a synergistic manner, but also the use of DNNs as feature extractors for emotion recognition tasks other than those related to frustration. However, this proposal still has considerable constraints, such as the limited amount of sources of information or the simple neural architectures considered. In this respect, future work should consider the inclusion of other complementary data i.e., eye gazing or information related to playing time. We also aim to further study other fusion modalities, such as early fusion or to explore feature fusion methods in greater depth. Finally, a further objective is that of exploring other neural architectures based on residual connections since, as proved by other works in literature, they may further improve the results obtained.