Speech QoE can be assessed either subjectively or objectively [12, 27]. Subjective testing typically involves user interviews, ratings and surveys to obtain insights about the end-user’s perception, opinion and emotions about speech quality and their overall experience, thus forming the ‘ground truth’. Objective assessment, on the other hand, replaces the listener with a computational algorithm that has learned complex mappings between several key factors and previously-recorded subjective ratings. Existing objective methods have been “technology-centric”, thus relying mostly on technological and contextual factors [12]. In order to develop QoE assessment methods, however, human influential factors also need to be incorporated. More recently, neurophysiological monitoring tools have been used to develop objective models which try to estimate the ground truth. Having this said, in this section we describe the methodology and experimental setup used in our study.
Subjective assessment methods
Quantitative subjective assessment methods typically involve the construction of questionnaires with rating scales, surveys, and user studies which can be conducted either in laboratory or “real-world” settings. The International Telecommunications Union (ITU), for example, has developed subjective study guidelines for perceptual speech quality evaluations. Recommendation P.800 [28] describes how to conduct the widely-used mean opinion score (MOS) listening test. The human-computer interaction domain has also covered guidelines on subjective testing methods for speech interface quality evaluation [29]. For speech intelligibility assessment, subjective tests are conducted that explore syllable, word, or sentence recognition.
In the affective computing domain, in turn, human affect is considered to manifest itself through multifaceted verbal and non-verbal expressions. Therefore, one common approach is to categorize affective factors using two broad dimensions comprising valence (V) and arousal (A) on two-dimensional plots [30]. Valence refers to the (un)pleasantness of an event, whereas arousal refers to the intensity of the event, ranging from very calming to highly exciting [31, 32]. Using the valence-arousal (VA) model, various emotional constructs have been developed, as depicted by Fig. 2 [30, 32, 33]. In order to quantitatively characterize these two emotional primitives, the Self Assessment Manikin (SAM) pictorial system is commonly used, as shown in Fig. 3 [30, 32]. As can be seen, the SAM for valence ranges from a smiling, happy manikin to a frowning, unhappy one. For arousal, in turn, SAM ranges from very excited, eyes-open manikin to a sleepy, eyes closed one [34]. It is important to emphasize that a third dimension, dominance, has also been proposed and refers to the controlling/dominant nature of the felt emotion [31]. While dominance has shown to be useful in characterizing emotions felt by subjects viewing pictures [17] and watching movies [35], it has shown limited use with speech stimuli, thus is omitted from our studies.
Objective assessment methods
Objective assessment methods are also often referred to as instrumental measures. QoE insights are normally estimated either using technology-centric speech metrics or, more recently, via neurophysiological monitoring tools (i.e., aBCIs), as detailed below.
Technology-centric speech metrics
Technology-centric models replace the human rater by a computer algorithm which has been developed to extract relevant features from the analyzed signal (speech, audio, image, or video) and map a subset/combination of such features into an estimated QoE value. For speech technologies, models can be further categorized as full-reference (also known as double-ended, intrusive) or no-reference (single-ended, non-intrusive), depending on the need, or not, of a reference signal, respectively. The ITU, for example, has standardized several objective models over the last decade, such as PESQ (recommendation P.862 [36]) and POLQA (recommendation P.863 [37]) as full-reference models and ITU recommendation P.563 [38] as no-reference.
For hands-free speech communications, one non-intrusive method called reverberation to speech modulation energy ratio (RSMR) has been shown to outperform the abovementioned standard algorithms, thus will be used in our studies. A description of the metric is beyond the scope of this paper and the interested reader is referred to [39, 40] for more details. Moreover, for TTS systems, studies have shown the importance of signal-based metrics [41], such as prosody and articulation [42]. Recently, two quantitative parameters were shown useful [42], thus are used in our TTS study: the slope of the second order derivative of the fundamental frequency (\(sF0''\)) and the absolute mean of the second order mel frequency cepstrum coefficient (\(MFCC_{2}\)). While the \(sF0''\) feature models the macro-prosodic or intonation-related properties of speech, \(MFCC_{2}\) models articulation related properties [42]. In our experiments, the openSMILE toolbox [43] was used to extract these features using the default window length of 25 ms and frame shift of 12.5 ms.
aBCI features
Typical EEG-aBCI features involve the calculation of specific EEG frequency subband powers, such as delta, theta, alpha, beta, or gamma sub-bands, as well as their interactions [44]. To characterize human affective states, the human prefrontal cortex (PFC) region has been widely used. Seminal studies have shown differential involvement of right and left hemispheres in emotional processing, where the right hemisphere is linked with unpleasant emotions and the left with pleasant emotions [45, 46]. As such, an asymmetry index has been developed which measures the difference in EEG activity in the alpha band (8–12 Hz) from the left to the right hemisphere; the index has been shown to be correlated with the valence emotion primitive [47, 48]. Moreover, the beta frequency band (12–30 Hz) power at the medial prefrontal cortex (MPC) has been associated with arousal [49].
Therefore, in order to objectively characterize affective factors, two features were extracted, namely an alpha-band asymmetry index (AI) and the MPC beta power (MBP), as correlates of valence and arousal, respectively. More specifically, the AI feature was computed as the difference between the natural logarithm of the alpha power of the left (\(\alpha _{AF3}\)) and right frontal electrodes (\(\alpha _{AF4}\)), as highlighted in the electrode map depicted by Fig. 4 and suggested by [47]:
$$\begin{aligned} AI=\ln (\alpha _{AF4})-\ln (\alpha _{AF3}). \end{aligned}$$
(1)
The MBP feature, in turn, was computed as the beta-band power in the AFz position (central electrode highlighted in Fig. 4), as suggested by [17, 50].
Experimental setup: dataset 1 (hands-free communications)
Participants
Fifteen naive subjects participated in this study (eight female, seven male; mean age = 23.27 years; SD = 3.57; range = 18–30); all of them were fluent English speakers (participants for whom their first or second language was English). All participants reported normal auditory acuity and no medical problems. Participants gave informed consent and received monetary compensation for their participation. The study protocol was approved by the Research Ethics Office at INRS-EMT and at McGill University (Montreal, Canada).
Stimuli
As stimulus, a clean double-sentence speech file created from the TIMIT database [51] was used. The clean file was then convolved separately with room impulse responses typical of three practical environments. The first represented a living room environment with a reverberation time (RT) of approximately 400 ms. The second represented a classroom environment (\(RT=1.5\) s) and the third a large auditorium (\(RT=2\) s). Higher RT values indicate rooms with greater reverberation levels, which in turn, are more detrimental to perceived speech quality. For consistency, all files were normalized to \(-\)26 dBov using the ITU-T P.56 voltmeter [52]. The sentence was uttered by a male speaker and digitized at 8 kHz sampling rate with 16-bit resolution. Speech files representative of the four hands-free conditions were presented to the participants over several trials, as detailed in the sections to follow. More details about this database can be found in [53].
Experimental protocol
The experiment was carried out in two phases. In the first phase, participants were asked to fill a demographic questionnaire and to report their perceived QoE for each file using a 5-point MOS scale (1 = bad, \(\ldots\),5 = excellent), as well as their perceived arousal and valence affective states using a continuous 9-point SAM scale, as shown in Fig. 3. Stimuli were repeated thrice for each speech quality condition. Whereas in the second phase, participants were placed in a listening booth and 64-channel EEG data was collected using an Active II Biosemi device with electrodes arranged in the modified 10–20 standard system (see Fig. 4). Four electrodes for electro-oculography (EOG) and two mastoid electrodes (right and left) were used for reference. The test consisted of an oddball paradigm, where the clean speech served as the standard stimulus and the reverberant files served as deviants. Clean and reverberant speech files were delivered in a pseudo-randomized order, forcing at least one standard stimulus to be presented between successive deviants, in sequences of 100 trials. Stimulus was presented with an inter-stimulus-interval varying from 1000 to 1800 ms. Participants were seated comfortably and were instructed to press a button, whether they detected the clean stimulus or one of the deviants. Stimulus was presented binaurally at the individual’s preferred listening level through in-ear headphones.
Experimental setup: dataset 2 (TTS systems)
Participants
Twenty-one fluent English speakers (eight females) with average age 23.8 (\(\pm\)4.35) years were recruited for the study. None of them reported having any hearing or neuro-physiological disorders. Insert earphones were used to present the speech stimuli to the participants at their individual preferred volume levels. The study protocol was approved by the INRS Research Ethics Office and participants consented to participate and make their de-identified data available freely online. The participants were also compensated monetarily for their time.
Stimuli
Table 1 lists the speech stimuli used for this study along with certain important aspects. The stimuli consisted of four natural voices and seven synthesized voices, obtained from commercially available systems, namely: Microsoft, Apple, Mary TTS Unit selection & HMM, vozMe, Google and Samsung. Tested systems cover a range of different concatenative and hidden Markov model (HMM) based systems. A non-identifying code is provided for each of the seven TTS systems in Table 1. Speech samples were generated from two sentence groups (A and B), each comprising four sentences. Thus, the total number of stimuli used in this study were forty-four [(4 natural voices + 7 synthesized voices) \(\times\) 4 sentence sets]. The speech stimuli consisted of both male and female voices for five of the seven TTS systems. The speech stimuli were presented to listeners at a sampling rate of 16 KHz and a bitrate of 256 kbps. Table 1 also details the duration range of the speech stimuli for each system. More details about this database can be found in [54].
Table 1 Description of the stimuli used for the listening test in dataset 2
Experimental protocol
The experimental procedure was carried out in accordance with ITU-T P.85 recommendations [55]. Participants were first comfortably seated in front of the computer screen inside a soundproof room. Participants were then fitted with 62 EEG electrodes (AF7 and AF8 were not used) using a compatible EEG cap. Insert earphones were placed comfortably inside the participants’ ears to deliver the speech stimuli. The experiment was then carried out in two phases: a familiarity phase and an experimental phase. In the familiarity phase, participants were presented with a sample speech file followed by the series of rating questions, thus illustrating the experiment procedure and giving them the opportunity to report any problems and/or concerns. Next, the experimental phase consisted of several steps as shown in Fig. 5. First, data from a baseline period was collected for 1 min in which the participants were advised to focus only on the cross bar in the middle of the screen and not think about anything else. This was followed by a 15-s rest period followed by the presentation of randomized speech stimuli, one sentence set (approximately 20 s long) at a time. The rest period was provided to allow neural activity and cerebral blood flow to return to baseline levels prior to TTS stimulus presentation. Moreover, following each stimulus participants were presented with rating questions on the screen where they scored the stimulus using a continuous slider on the 5-point MOS scale and the 9-point SAM scales for valence and arousal. This rest-stimulus-rating combination is referred to as an experimental ‘block’. The procedure is repeated 44 times, where each block corresponds to one of the 44 speech stimuli available in the dataset.
EEG data processing
For data analysis, the MATLAB-based EEGLAB toolbox was used [56]. Data was recorded at 512 Hz but down-sampled to 256 Hz and band-pass filtered between 0.5 and 50 Hz for offline analysis. All channels were re-referenced to the ‘Cz’ channel. For the first dataset, continuous EEG data were divided into epochs of 3000 ms, time locked to the onset of the stimuli with a 200 ms pre-stimulus baseline. For the second dataset, the EEG data was divided into epoch-length corresponding to the speech stimulus length with a 300 ms pre-stimulus baseline. In order to remove artefacts from the EEG signals (e.g., eye blinks), a combination of visual inspection and independent component analysis was performed. Features were then extracted from the artefact-free segments.
QoE model performance assessment
In order to assess QoE model performance, three tests were conducted for each study. First, we explored the goodness-of-fit (\(r^2\)) achieved by using only the technology-centric speech metric as a correlate of the QoE score reported by the listeners (denoted as \(QoE_{Tech}\)). Second, we investigated the gains obtained by including HIFs into the QoE models. Here, we measured the \(r^2\) obtained from a linear combination of the technology-centric speech metric combined with the subjective valence and arousal (‘ground truth’) ratings reported by the listeners (denoted as \(QoE_{HIF}\)). Gains in the goodness-of-fit metric should indicate the benefits of including HIFs into QoE perception models. Lastly, we replaced the ground truth HIFs by the aBCI features that are used as correlates of the listener’s emotional states (denoted as \(QoE_{aBCI}\)). It is expected that the \(r^2\) achieved will lie between those achieved without and with HIFs, thus signalling the importance of aBCIs in QoE perception modelling.
Towards this end, the goodness-of-fit measures were obtained by developing linear regression equations for each of the three proposed tests (\(i=1,\ldots ,3\)). Linear regression model ‘i’ had dependent variable \(y_{i}\) as a linear combination of ‘p’ independent variables (or regressors, \(x_{ip}\)) weighted by regression coefficients (\(\beta _{p}\)) and error (\(\epsilon _{i}\)). The linear regression is formulated as follows:
$$\begin{aligned} y_{i} = \epsilon _{i} + \beta _{1}x_{i1} + \beta _{2}x_{i2} ... + \beta _{p}x_{ip} = \varvec{x^{T}_{i}\beta } + \epsilon _{i} . \end{aligned}$$
(2)
The values of \(\beta\) and \(\epsilon\) are estimated using least squares fitting on training data.