The effectiveness of the ViSQOL model is demonstrated with performance evaluation with five experiments covering both VoIP specific degradations and general quality issues. Experiment 1 expands on the results on clock drift and warp detection presented in [5] and includes a comparison with subjective listener data. Experiment 2 evaluates the impact of small playout adjustments due to jitter buffers on objective quality assessment. Experiment 3 builds upon this to further analyze an open question from [28,42], where POLQA and ViSQOL show inconsistent quality estimations for some combinations of speaker and playout adjustments. Experiment 4 uses a subjectively labeled database of VoIP degradations to benchmark model performance for clock drift, packet loss, and jitter. Finally, Experiment 5 presents benchmark tests with other publicly available speech quality databases to evaluate the effectiveness of the model to a wider range of speech quality issues.
Experiment 1: clock drift and temporal warping
The first experiment tested the robustness of the three models to time warping. Packet loss concealment algorithms can effectively mask packet loss by warping speech samples with small playout adjustments. Here, ten sentences from the IEEE Harvard Speech Corpus were used as reference speech signals [43]. Time warp distortions of signals due to low-frequency clock drift between the signal transmitter and receiver were simulated. The 8-kHz sampled reference signals were resampled to create time-warped versions for resampling factors ranging from 0.85 to 1.15. This test corpus was created specifically for these tests, and a subjective listener test was carried out using ten subjects (seven males and three females) in a quiet environment using headphones. They were presented with 40 warped speech samples and asked to rate them on a MOS ACR scale. The test comprised four versions each of the ten sentences and there were ten resampling factors tested, including a non-resampled factor of 1.
The reference and resampled degraded signal were evaluated using PESQ, POLQA, and ViSQOL for each sentence at each resampling factor. The results are presented in Figure 8. They show the subjective listener test results in the top plot and predictions from the objective measures below. The resample factors from 0.85 to 1.15 along the x-axis are plotted against narrowband mean opinion scores (MOS-LQSn) for the subjective tests and narrowband objective mean opinion scores (MOS-LQOn) quality predictions for the three metrics.
The number of subjects and range of test material in the subjective tests (40 samples with ten listeners) make detailed analysis of the impact of warp on subjective speech quality unfeasible. However, the strong trend visible does allow comparison and comment on the predictive capabilities of the objective metrics.
The subjective results show a large perceived drop off in speech quality for warps of 10% to 15%, but the warps less than 5% seem to suggest a perceptible change but not a large drop in MOS-LQSn score. There is an apparent trend indicating that warp factors less than 1 yield a better quality score than those greater than 1 but further experiments with a range of speakers would be required to rule out voice variability.
The most notable results can been highlighted by examining the plus and minus 5%, 10%, and 15% warp factors. At 5%, the subjective tests point towards a perceptible change in quality, but one that does not alter the MOS-LQSn score to a large extent. ViSQOL predicts a slow drop in quality between 1% and 5%, and POLQA predicts no drop. Either result would be preferred to those of PESQ which predicts a rapid drop to just above 1 MOS-LQOn for a warp of 5%.
At 10% to 15%, the subjective tests indicate that a MOS-LQSn of 2 to 3 should be expected and ViSQOL predicts this trend. However, both POLQA and PESQ have saturated their scale and predict a minimum MOS-LQOn score of 1% from 10% warping. Warping of this scale does cause a noticeable change in the voice pitch from the reference speech but the gentle decline in quality scores predicted by ViSQOL is more in line with listeners’ opinions than those of PESQ and POLQA.
The use of jitter buffers is ubiquitous in VoIP systems and often introduces warping to speech. The use of NSIM for patch alignment combined with estimating the similarity using warp-adjusted patches provides ViSQOL with a promising warp estimation strategy for speech quality estimation. Small amounts of warp (around 5% or less) are critical for VoIP scenarios, where playout adjustments are commonly employed. Unlike PESQ where small warps cause large drops in predicted quality, both POLQA and ViSQOL exhibit a lack of sensitivity for warps up to 5% that reflect the listener quality experience.
Experiment 2: playout delay changes
Short network delays are commonly dealt with using per talkspurt adjustments, i.e., inserting or removing portions of silence periods, to cope with time alignment in VoIP. Work by Pocta et al. [42] used sentences from the English speaking portion of ITU-T P Supplement 23 coded-speech database [44] to develop a test corpus of realistic delay adjustment conditions. One hundred samples (96 degraded and four references, two male and two female speakers) covered a range of 12 realistic delay adjustment conditions. The adjustments were a mix of positive and negative adjustments summing to zero (adding and removing silence periods). The conditions comprised two variants (A and B) with the adjustments applied towards the beginning or end of the speech sample. The absolute sum of adjustments ranged from 0 to 66 ms. Thirty listeners participated in the subjective tests, and MOS scores were averaged for each condition.
Where Experiment 1 investigated time warping, this experiment investigates a second VoIP factor, playout delay adjustments. They are investigated and presented here as isolated factors rather than combined in a single test. In a real VoIP system, the components would occur together but as a practical compromise, the analysis is performed in isolation.
The adjustments used are typical (in extent and magnitude) of those introduced by VoIP jitter buffer algorithms [45]. The subjective test results showed that speaker voice preference dominated the subjective test results more than playout delay adjustment duration or location [42]. By design, full-reference objective metrics, including ViSQOL, do not qualify speaker voice difference reducing their correlation with the subjective tests.
The test conditions were compared to the reference samples for the 12 conditions, and the results for ViSQOL, PESQ, and POLQA were compared to those from the subjective tests. These tests and the dominant subjective factors are discussed in more detail in [28,42].
This database is examined here to investigate whether realistic playout adjustments that were shown to be imperceptible from a speech quality perspective are correctly disregarded by ViSQOL, PESQ, and POLQA.
The per condition results previously reported [42] showed that there was poor correlation between subjective and objective scores for all metrics tested but this was as a result of the playout delay changes not being a dominant factor in the speech quality. The results were analyzed for PESQ and POLQA [42] and subsequently for ViSQOL [28], showing MOS scores grouped by speaker and variant instead of playout condition. The combined results from both studies are presented in Figure 9. Looking at the plot of listener test results, the MOS-LQS is plotted on the y-axis against the speaker/variant on the x-axis. It is apparent from the 95% confidence interval bars that condition variability was minimal, and that there was little difference between variants. The dominant factor was the voice quality, i.e., the inherent quality pleasantness of the talker’s voice, and not related to transmission factors. Hence, as voice quality is not accounted for by the full-reference metrics, maximum scores should be expected for all speakers. PESQ exhibited variability across all tests, indicating that playout delay was impacting the quality predictions. This was clearly shown in [42]. The results for ViSQOL and POLQA are much more promising apart from some noticeable deviations e.g., the Male 1, Variant A (M1A) for ViSQOL; and the Female 1, Variant B (F1B) for POLQA.
Experiment 3: playout delay changes II
A follow-up test was carried out to try and establish the cause of the variability in results from Experiment 2. This test focused on two speech samples from Experiment 2 where ViSQOL and POLQA predicted quality to be much lower than was found with subjective testing.
For this experiment, two samples were examined. In the first, a silent playout adjustment is inserted in a silence period and in the second, it is inserted within an active speech segment. The start times for the adjustments are illustrated in the lower panes of Figure 10. The quality was measured for each test sentence containing progressively longer delay adjustments. The delay was increased from 0 to 40 ms in 2-ms increments. The upper panes present the results with the duration of the inserted playout adjustment on the x-axis against the predicted MOS-LQOn from POLQA and ViSQOL on the y-axis.
ViSQOL displays a periodic variation of up to 0.5 MOS for certain adjustment lengths. Conversely, POLQA remains consistent in the second test (aside from a small drop of around 0.1 for a 40-ms delay), while in the first test, delays from 4 up to 14 ms cause a rapid drop in predicted MOS with a maximum drop in MOS-LQOn of almost 2.5. These tests highlight the fact that not all imperceptible signal adjustments are handled correctly by either model.
The ViSQOL error is down to the spectrogram windowing and the correct alignment of patches. The problems highlighted by the examples shown here occur only in specific circumstances where the delays are of certain lengths. Also, as demonstrated by the results in the previous experiment, the problem can be alleviated by a canceling effect of multiple delay adjustments where positive and negative adjustments balance out the mis-alignment.
Combined with warping, playout delay adjustments are a key feature for VoIP quality assessment. Flagging these two imperceptible temporal adjustments as a quality issue could mask other factors that actually are perceptible. Although both have limitations, ViSQOL and POLQA are again performing better than PESQ for these conditions.
Experiment 4: VoIP specific quality test
A VoIP speech quality corpus, referred to in this paper as the GIPS E4 corpus, contains tests of the wideband codec iSAC [46] with superwideband references. The test was a MOS ACR listening assessment, performed in Native British English. Within these experiments, the iSAC wideband codec was assessed with respect to speech codec and condition. The processed sentence pairs were each scored by 25 listeners. The sentences are from ITU-T Recommendation P.501 [47] which contains two male and two female (British) English speakers sampled at 32 kHz.
For these tests, all signals were down-sampled to 8-kHz narrowband signals. Twenty-seven conditions from the corpus were tested with four speakers per condition (two males and two females). Twenty-five listeners scored each test sample, resulting in 100 votes per condition. The breakdown of conditions was as follows: 10 jitter conditions, 13 packet losses, and four clock drifts. The conditions cover real time, 20 kbps and 32 kbps versions of the iSAC codec. Details of the conditions in the E4 database are summarized in Table 1. While the corpus supplied test files containing the four speakers’ sentences concatenated together for each condition, they were separated and tested individually with the objective measures. This dataset contains examples of some of the key VoIP quality degradations that ViSQOL was designed to accurately estimate as jitter, clock drift, and packet loss cause problems with time-alignment and signal warping that are specifically handed by the model design.
The results are presented in Figure 11. The scatter of conditions highlights that PESQ tended to under-predict and POLQA tended to over-predict the MOS scores for the conditions while the ViSQOL estimates were more tightly clustered. Correlation scores for all metrics are presented in Table 2.
Table 2
Statistics for Experiments 4 and 5
Experiment 5: non-VoIP specific quality tests
A final experiment used two publicly available databases to give an indication of ViSQOL’s more general speech quality prediction capabilities.
The ITU-T P Supplement 23 (P.Sup23) coded-speech database was developed for the ITU-T 8 kbit/s codec (Recommendation G.729) characterization tests [44]. The conditions are exclusively narrowband speech degradations but are useful for speech quality benchmarking and remain actively used for objective VoIP speech quality models, e.g., [48]. It contains three experimental datasets with subjective results from tests carried out in four labs. Experiment 3 in [44] contains four speakers (two males and two females) for 50 conditions covering a range of VoIP degradations and was evaluated using ACR. The reference and degraded PCM speech material and subjective scores are provided with the database. The English language data (lab O) is referred to in this paper as the P.Sup23 database. As stated in Section 4.3, the subjective results from the other labs (i.e., A, B, and D) were used in the model design for the similarity score to objective quality mapping function.
NOIZEUS [49] is a narrowband 8-kHz sampled noisy speech corpus that was originally developed for evaluation of speech enhancement algorithms. Mean opinion scores (MOSs) for a subset of the corpus were obtained using the ITU-T Recommendation P.835 [50] methodology for subjective evaluation. It uses three ratings for each speech sample: the quality of the speech signal alone on a 5-point scale; the intrusiveness of the background noise on a 5-point scale; and the overall signal quality as a MOS ACR. This method was designed to reduce a listener’s uncertainty as to the source of the quality issue, e.g., is it the speech signal itself that has been muffled or otherwise impaired or is it a background noise or a combination of both. Further work carried out by Hu and Loizou studied the correlation between objective measures and the subjective quality of noise-suppressed speech [29] and compared PESQ with a range of segmental SNR, LPC, and distance metrics. For the experiments in this paper, only the overall MOS scores were analyzed. Speech subjected to enhancement algorithms, as in the NOIZEUS database, was omitted from the validated scope of POLQA and PESQ. Although the NOIZEUS dataset was not included in the validation testing of POLQA, the specification does not specifically exclude voice enhancement, as was the case for PESQ [25].
Four noise types from the full NOIZEUS corpus were tested: babble, car, street, and train. Each noise type was tested with 13 speech enhancement algorithms plus the noisy non-enhanced speech at two SNR levels (5 and 10 dB). This gave a total of 112 conditions (four noise types, 14 enhancement variations and two SNR levels). Thirty-two listeners rated the overall quality for each condition with 16 sentences. The MOS scores were averaged for listeners and sentences across each condition. For objective metric testing, the results were calculated in a corresponding manner, with a mean score for the 16 sentences calculated per condition.
Hu and Loizou [29] used the NOIZEUS database to evaluate seven objective speech quality measures. They also investigated composite measures by combining other measures in a weighted manner with PESQ as they did not expect simple objective measures to correlate highly with signal/noise distortion and overall quality. The methodology in this work follows the same experiment design and performance evaluation as Hu and Loizou [29]. They measured Pearson’s correlation coefficient across the 112 conditions for each measure as well as the standard deviation of the error. For predicting overall quality, they found that PESQ generated the highest correlation of the metrics tested. Absolute values of Pearson’s correlation coefficient, |ρ|, can be calculated using
$$ \rho=\frac{\sum_{i}(o_{i}-\bar{o})(s_{i}-\bar{s})}{\sqrt{\sum_{i}(o_{i}-\bar{o})^{2}}\sqrt{\sum_{i}(s_{i}-\bar{s})^{2}}} $$
((6))
where i is the condition index, o is the objective metric score, s is the subjective quality rating (MOS) score, and \(\bar {o}\) and \(\bar {s}\) are the mean values of o and s, respectively. The standard deviation of the error, \(\hat \sigma _{e}\), was also measured as a secondary test,
$$ \hat\sigma_{e}=\hat\sigma_{s}\sqrt{1-\rho^{2}} $$
((7))
where \(\hat \sigma _{s}\) is the standard deviation of the subjective quality scores, and s and ρ is the correlation coefficient. The Spearman rank correlation was also computed, replacing the quality scores o and s in 6 with their ranks. Hu and Loizou [29] split their data for training and testing. Subsequent evaluations by Kressner et al. [51] repeated the experiments using the full dataset of 1,792 speech files, which is the approach adopted in this study.
The NOIZEUS and P.Sup23 corpora were tested with ViSQOL, PESQ, POLQA, and two additional simple objective metrics, LLR and fwSNRSeg (details of which can be found in [29]). Results were averaged by condition and compared to the average MOS scores per condition. Figure 12 shows the results for each objective quality measure. The scatter shows 112 NOIZEUS conditions and 50 P.Sup23 conditions. The statistical analysis is summarised in Table 2.
As noted by Hu and Loizou in their tests [29], the two less complex metrics, LLR and fwSNRSeg, performed almost as well as PESQ in estimating the quality for the range of background noises evaluated. While they exhibit good correlation for the NOIZEUS tests, their correlation with MOS quality scores for the P.Sup23 and E4 database is much lower (see Table 2). As these are simple measures, it is understandable that while they may perform well for background noise, even if it is not homogeneous, they perform poorly when quantifying more subtle and temporally short-quality degradations such as packet loss or jitter. LLR and fwSNRSeg are simple distance metrics and do not perform any signal alignment, only signal comparison. They have no temporal alignment of signals, leveling, or other pre-processing steps before comparison. They were included in this test to highlight their limitations for VoIP speech quality conditions, and the lack of correlation in the Figure 12 scatter plots illustrates the performance variability between the difference datasets.