1 Introduction

With the rapid development of high speed wireless communication technologies, multimedia services, especially the high quality video services over wireless channels, have become more and more popular in recent years [18]. Packet loss is the real bottleneck in the transmission of effective multimedia traffic over wireless networks. However, a same amount of packet loss does not always lead to same video quality loss. On the other hand, due to limitations of network bandwidth and storage devices, video compression technology has been also widely used. The whole video compression and transmission process introduces a variety of distortions, resulting in lower user-perceived quality and therefore its quality evaluation is becoming increasingly important. In this work, we aim at determining the relationship between objectively measurable features and the predicted subjective quality, based upon Partial Least Squares Regression (PLSR) [3].

In related work, packet loss is the most influential factor for perceived quality of streaming video. The majority of such studies only consider a single parameter, i.e. the packet loss rate [15,22]. Furthermore, few studies consider the complexity of packet loss in wireless networks. In [10,25], packet-loss distribution, for both distributed and burst cases of packet loss, is investigated; but the packet encoding type is taken into account. The author of [12] indicated that packet loss with various encoding types results in different visual perception, and analyzed the difference in video quality induced by I, B, P encoding types and packet header loss. Considering the position of lost packets in the video clips, the work in [11] put forward a full-reference metric which can measure the video quality degradation introduced by both packet loss and compression. However, full-reference video quality assessment is not always possible due to the lack of reference signal in a relay or receiving end. The work in [21] proposed a no-reference (NR) metric to estimate video quality for a number of macroblocks (abbreviated as MB hereinafter; with 16 × 16 pixels) which contain errors. Yamagishi et al. proposed a NR metric in [24] through using quality features derived from both received packet headers and video signals, which only used simple spatial and temporal activity for pixels. In [19], two parameters of the decoded picture are analyzed to improve the subjective quality estimation accuracy, i.e. compressed bitstream and the baseband signal. In [5], Farias et al. considered block effect and blur effect in addition to bitstream features, with only videos in CIF resolution. Developing a mixed video quality indicator is the focus of the video quality expert group (VQEG) [17]. In [4], F. Zhang et al. proposed the additive log-logistic model (ALM) to capture such a multidimensional nonlinear problem that combines both bitstream-based features and pixel layer information.

Our study is distinct from the prior modeling studies with respect to feature selection and PLSR model aspects. In this paper, we first formulate new functional forms to better represent the relation of visual quality with features from visual signal. Then, we propose a new NR video quality metric incorporating both bitstream and network-based information in a hybrid manner (i.e., from both compressed video and network conditions). The rest of this paper is organized as follows. This section will be followed by a description of the video quality metric with PLSR. The extraction of bitstream and network features is discussed and the design of the hybrid NR assessment model is detailed in Section III. Section IV describes experimental results by the NS2 network simulator with explanation. Finally, Section V summarizes our findings.

2 Building quality metric with partial least squares regression (PLSR)

In our approach, we do not assume a priori specific function between visual quality and the bitstream/network-based features, but derive the function by analysing the available data. Firstly, a data matrix X is constructed, in which the rows correspond to data from individual sequences and columns represent the network-based or bitstream features. With n sequences and m features, X is a n × m matrix; the n × 1 column vector Y represents subjective visual quality scores. Our purpose is to find the unknown m × 1 regression weight vector b, mapping the features to the visual quality: Y = Xb.

In this paper, the Partial Least Squares Regression (PLSR) which is an extension of the principal component regression method (PCR) is used to estimate the weight b, by minimizing the difference between Y and Xb. Because both X and Y are to be projected to new spaces, the PLS family of methods are known as bilinear factor models. If Y is binary, Partial Least Squares Discriminant Analysis (PLS-DA) is used; the corresponding PLS1 only considers a single class label at a time, so we have a single vector of dependent variables Y; PLS2 has multiple class labels so there is a whole matrix Y of dependent variables.

y n can be predicted if y n − 1 and y n + 1 are in the training data. In order to obtain the real error estimation and the optimal number of potential components, data should be divided into a training set and a validation set; this means that for example, in 5-fold cross-validation, the first validation set consists of samples n = 1, 2, . . ., N/5(N represents the total number of sequences in the database), and the next one from n = N/5 + 1 to 2 N/5, etc.

If there are sufficient training data, we can also use sentence-wise division. Taking training data of 20 sequences as example, we can first use sequences 2–3 as validation data and 3–20 as training data. In the next round sequences 3–4 serve as validation data and the models are trained from sequences 2–3 and 5–20 etc.

We now consider the PLS2 algorithm in this study. We assume that X and Y are data matrices. As suggested by Ref [16], one may use PLS1 separately for each analysis (Y-column), which allows a separate optimal model to be constructed for each analysis. However, it could be helpful if including information from other analysis when predicting any specific analysis. This may be done by constructing an overall model describing Y as a function of X, and we may use the PLS2 method for this purpose.

Algorithm: PLS2

For initialization, PLS2 sets j = 1, X 1 = X and Y 1 = Y, and then performs procedures to find the first g term as follows:

  1. 1.

    Initialize vectorμ j to be an arbitrary column of Y j .

  2. 2.

    ω j  = X T j μ j /‖X T j μ j ‖.

  3. 3.

    t j  = X j ω j .

  4. 4.

    q j  = Y T j t j /‖Y T j t j ‖.

  5. 5.

    μ j  = Y j q j .

  6. 6.

    If μ j remains unchanged, jump to Step 7; otherwise, return to Step 2.

  7. 7.

    \( \overset{\wedge }{{\displaystyle {c}_j}}={\displaystyle {t}_j^T}{\displaystyle {\mu}_j}/{\displaystyle {t}_j^T}{\displaystyle {t}_j} \).

  8. 8.

    p j  = X T j t j /t T j t j .

  9. 9.

    X j + 1 = X j  − t j p T j and \( {\displaystyle {Y}_{j+1}}={\displaystyle {Y}_j}-\overset{\wedge }{{\displaystyle {c}_j}}{\displaystyle {t}_j}{\displaystyle {q}_j^T} \).

  10. 10.

    If j = g, stop; otherwise, set j = j + 1 and go back to Step 1.

Using columns ω j , t j , q j , μ j and p j , we can form matrices W, T, Q, U and P, respectively. Also we can construct the g × g diagonal coefficient matrix \( \widehat{C} \), using diagonal elements \( {\widehat{C}}_j \). With the extracted features from the videos, we can estimate the regression matrix (\( \overset{\frown }{B} \)) which represents the features of a specific video sequence over time on the quality vector y, for regression of a feature slice of X(:, :, :). Therefore, we can describe the quality estimation as:

$$ \widehat{Y}={\widehat{b}}_0+ X\widehat{B}+ e $$
(2.1)

The algorithm ensures that (2.1) is satisfied.

3 Features extraction and hybrid No reference (NR) assessment model

When using H. 264 codec, a video is first encoded into a bitstream slice-by-slice, then encapsulated into data packets, and finally transmitted by the user datagram protocol (UDP) usually. Visual quality degradation is often caused by the compression loss and the transmission network errors. Therefore, we extract the compressed domain features and network characteristics.

3.1 A. Compressed domain feature extraction (\( \overset{\wedge }{{\displaystyle {Y}_B}} \))

In the first step, we do feature extraction from video bitstream (a H.264/AVC one is used in the current stage of our work, although the framework can be extended to other coding schemes), describing the properties of the encoded video sequence.

3.1.1 1) Quantization parameter (QP)

QP has direct impact on video encoding quality and bit rate. The lower QP value is, the less information is lost, and therefore the higher video quality becomes [14]. Since the QP value may be different from one MBto another, we also obtain other attributes of it. We mainly analyze the typical quantitative parameters, such as the average (denoted as QPavg), the median (denoted as QPme), the standard deviation (denoted as QPsd), the minimum (denoted as QPmin) and the maximum (denoted as QPmax). As we all know, QP reflects the visual quality of multiple compressed videos sharing the same content well, but cannot describe the content’s influence on visual quality. At the same time, as is well-known, the visual distortion in the simple scene would be more likely intolerable by the human eyes than that in the complex scenario. So we propose to use Content Unpredictability (abbreviated as CU) [4] to quantify the content complexity. For each MB, its CU in the luminance channel is the variance of the residuals:

$$ C{U}_r=\frac{1}{K_{MB}}{{\displaystyle \sum_{k=1}^{K_{MB}}\left({I}_{r, k}-\frac{1}{K_{MB}}{\displaystyle \sum_{k=1}^{K_{MB}}{I}_{r, k}}\right)}}^2 $$
(3.1)

where I r,k represents the k-th pixel residual in the r-th MB; K MB is the pixel number of the MB, which is 256. In general, a smaller CU after intra prediction suggests a lower spatial complexity, while a smaller CU after inter prediction implies a lower temporal complexity.

3.1.2 2) Intra predicted block

H.264/AVC codec has two modes for block prediction: one is intra prediction and the other is inter prediction. The former mode makes pixel value prediction according to the spatial neighbours within the same frame, and the latter makes the pixel value prediction between the frames after and before the current frame. If a frame has areas with details, the I 4 × 4 mode is preferred. That is to say, the I 4 × 4 mode is useful for complex texture, with the cost of bitstream capacity increase. On the opposite, the I 16 × 16 mode is more suitable for those frames with homogenous areas. The block prediction mode has important influence on video coding quality [9].

3.1.3 3) Skipped MBs

If an MB is not necessary to be encoded, it will be dropped and be counted as a skipped MB [13]. Since the number of skipped MBs reflects information changes between the frames after and before the current frame, the percentage of skipped MBs is adopted as a parameter of our metric proposed.

3.1.4 4) I-slice

In bitstreams encoded by H.264/AVC, there are a number of slice structures, for example, B slice, P slice, I slice, switching P slice, and switching I slice. In an H.264/AVC encoded bitstream, the first slice of is named as instantaneous decoding refresh (IDR) picture, which must be an I slice. The higher the number of I slices in a bitstream is, the better the video quality gets. Therefore, our NR metric also takes the percentage of I-slices as a parameter.

3.2 B. Network feature extraction (\( \overset{\wedge }{{\displaystyle {Y}_P}} \))

Taking networked video applications into consideration, bandwidth, delay and jitter, and packet loss ratio are the 3 major parameters to affect the perceived visual quality.

3.2.1 1) Bandwidth

Bandwidth means the data transmission rate [1], which is important to the performance of networked video playback. If it is low, users will experience networked video with poor quality. So apparently, appropriate bandwidth is a quality assurance on user experience.

3.2.2 2) delay and jitter

Transmission delay is a significant metric which is often used to evaluate the performance of telecommunication networks [2]. The delay of a network specifies the time a packet spends to travel across the network from ingress to egress node. A long delay may cause frequent pauses in video streaming and broadcasting. Thus we consider it as an important factor which influences the transmission quality of services (QoS) of video streams. Jitter is a measure of the variance of packet delays across a network over time [8], and it is calculated as the average of the deviations from the mean delay; it is necessary to enlarge the transmission buffer size when jitter is large. At that time the sojourn time gets longer because the buffer size is expanded. Therefore, in order to evaluate the QoS of video stream transmission, we should take jitter as another parameter of the proposed metric.

3.2.3 3) packet loss

Transmission interruption caused by line faults and network congestion are two primary reasons which could cause packet loss. Generally video has data in large quantity. If the transmission control protocol (TCP) (as a reliable transmission protocol) is used to transmit multimedia data, the source resends data when a packet has been lost; this may result in unacceptably long delay. Actually the UDP is a better choice, since the user can still play real-time video with acceptable distortion, if packet loss ratio (PLR) is at an acceptable extent. But packet loss beyond a certain level may lead to playback interruption.

In summary, PLR significantly influences the quality of video transmission. Packet loss frequency (PLF), average-PLR, centralized-PLR and packet error rate (PER) are crucial parameters for wireless networks. Our research has simulated all kinds of packet losses on a video on demand system, made a detailed analysis of the network packet information, and paid attention to the lost packet types at the same time. In addition, the effect of access network parameters such as block error rate (BLER) and the mean burst length (MBL) vary depending on visual content [7]. That is to say, the influence of network parameters on visual quality is content dependent.

3.3 More on feature selection

In the study, feature selection involves a series of attempts. As aforementioned, compressed domain features include QP, QPavg, QPme, QPsd, QPmin, QPmax, CU, I 4 × 4, I 16 × 16, skipped MB and I-slice, and network parameters include bandwidth, delay, jitter, PLR, PLF, average-PLR, centralized-PLR, PER, BLER and MBL, piece of coexistence of each error hidden damage intervals, ratio of internal forecasted MB, the average number of MB partitions for motion estimation, and the mean of MV.

Let x denotes a feature (e.g., QP), the distribution of the variation of x, h(x), can take the form of the following functions:

1) h(x)= x − x 1 or x h  − x, where x h and x 1 denote the higher and the lower bounds of x. Taking H.264 encoded video as example, QP is no greater than 51. Furthermore, we find that 51 − QP is more appropriate to be used than QP itself.

2) Logarithm: h(x)= log(x − x 1 + 1) or log(x h  − x + 1). The logarithmic function reduces the dynamic range of maximum and minimum of x, and therefore decreases the fluctuation of x. For example, the effect of log(CU + 1) is better than that of CU.

The compression features are chosen from the compressed video without packet loss, and the slicing features and the slicing mode of error concealment related to those videos which are compressed slightly. For initialization, we start with an empty feature set. When performing feature selection in the first round, the most important features are found in the candidate uni-type obstacle, and regarded as the critical elements. In the 2nd round, we found that log (CU + 1) is more important than any other candidate features. Therefore it is regarded as the covariate. And in the 3rd round, we found that the remaining features are not significant and could be dropped.

When feature extraction from the network and bitstream is finished, we adopt data analysis to produce a forecast model of visual quality, as Fig. 1.

Fig. 1
figure 1

PLSR Model building with bitstream features and network features

Firstly, extracting the features from the networked video, now the quality estimation can be expressed as

$$ \widehat{Y}={\widehat{b}}_0+ X\widehat{B}+ e $$
(3.2)
$$ \widehat{B}= W{\left({P}^T W\right)}^{-1}\widehat{C}{Q}^T $$
(3.3)

It is applied to calculate the weight coefficients using the PLS2 algorithm, so we get b 0, b 1 and other n, m feature weight coefficients ω. Then, the quality prediction \( \overset{\wedge }{{\displaystyle {Y}_B}} \) from the bitstream model and the quality prediction \( \overset{\wedge }{{\displaystyle {Y}_P}} \) from the network model for the proposed NR metric based on PLS2 can be expressed as follows:

$$ \overset{\wedge }{{\displaystyle {Y}_B}}={b}_0+{\displaystyle \sum_{i=1}^n{\omega}_i{X}_B} $$
(3.4)
$$ \overset{\wedge }{{\displaystyle {Y}_P}}={b}_1+{\displaystyle \sum_{j=1}^m{\omega}_j{X}_P} $$
(3.5)

We integrate \( \overset{\wedge }{{\displaystyle {Y}_B}} \) and \( \overset{\wedge }{{\displaystyle {Y}_P}} \):

$$ {\displaystyle \overset{\wedge }{Y}}={\displaystyle 0.35}\overset{\wedge }{{\displaystyle {Y}_B}}+{\displaystyle 0.65}\overset{\wedge }{{\displaystyle {Y}_P}} $$
(3.6)

The weights in Eq. (3.6) are decided by linear regression the prediction results from training data, in which both bitstream features and packet headers based on the network parameters (with PLS2) are used; and then averaging is performed for the regression coefficients. Lastly, to capture the nonlinear characteristics of subjective experiment results, a sigmoid nonlinear correction is made as:

$$ \overset{\wedge }{{\displaystyle {Y}_S}}=1+{\displaystyle {e}^{\left(0.5-\overset{\wedge }{Y}\right)/0.2}} $$
(3.7)

4 Evaluation and results

For evaluation of the proposed hybrid metric, we have firstly used k-mean clustering to select the video sequences in Table 1. The selected video sequences represent contents with low spatio-temporal (ST) to high ST features. The standard definition (SD) video sequences tree, dance and football were used for training and a different set of sequences with tempete, horses and juggler for validation of the model. Each of the training sequences represents typical content offered by network providers. The subjective viewing tests for all video sequences used in this work are to be presented in Part B below.

Table 1 Video content description

4.1 A. Test environment

NS-2 [7] is an event driven simulator, providing wide simulation support for related network protocols, such as TCP, UDP, etc., with wired and wireless networks, including local and satellite networks. To simulate video encoding and decoding, we integrate the Evalvid framework (as presented in [16]) into NS-2. All the experiments of this paper were done in this combined simulator.

In our simulation, as the recommended codec for wireless video transmission, H.264/AVC is adopted to encode the source videos. We selected six different video sequences of 576i and 576p resolution. IPPP frame structure is used for encoding of each sequence, and each GOP is encoded with three frame (I, P and B) types. I frames are independently encoded, P frames are predicted from the preceding I frame or P frame, while B frames are predicted from both preceding and future I frames or P frames. All the selected test videos were transmitted over the simulated network. Other test conditions include: frame rates were10fps, 20fps, and 30fps; QP was set at 5, 8, 12, 18, 25, 30, 36 and 40; a random loss model was used; the decoder used zero-motion error concealment (i.e., a lost MB was replaced directly with its closest MB of a reference frame in the same position).

4.2 B. Visual quality prediction test

In the simulated wireless network, a base station and several mobile stations are employed. During the simulation, we generated 78 sequences for training and another 53 sequences for model validation. As is well known, MOS (Mean Opinion Score) is a subjective indicator for video quality measuring. A subjective score of human visual quality of video usually ranges from 1 to 5, indicating the perceived video quality from the worst to the best.

Figure 2 shows differences of the bitstream-only metric, network-parameter-only metric and the proposed hybrid metric. By comparison, we show the advantages of the proposed hybrid metric. In Table 2, not only the Pearson and Spearman results but also the root mean squared error (between predicted visual quality and actual visual quality, abbreviated as RMSE) results are provided. And as comparative video quality metrics, SSIM (Structural Similarity) [20] and PSNR (Peak Signal to Noise Ratio), which are employed by most existing studies, are compared. In Fig. 2, we have listed the results to our bitstream-based only and packet-based no-reference only metrics. And we consider one coding and the same quantization parameter in the packet-based only. The experiments with Pearson correlation indicate the significance of the introduction of additional features when constructing the video quality model in this study.

Fig. 2
figure 2

Prediction results of three conditions for the SD video. (a) Bitstream metric (b) Network parameters metric (c) Hybrid metric

Table 2 Performance of the quality prediction

Figure 3 shows the scatter plot of subjective visual quality against prediction by the proposed model. As the figure shows, we obtained a correlation of 95.5 % when using the validation sequence set. The results also show that the prediction accuracy of the proposed metric is better than that of the model in [23] which also combines bitstream-based features and network-based ones.

Fig. 3
figure 3

Scatter plot of subjective video quality against quality prediction from model

Also we can notice in Fig. 3 that for the sequences containing more temporal variation (Tree, Dance and Football), the relation between MOS and predicted quality is closer to be linear. The simulations show that our hybrid metric outperforms both the bitstream-based only metric and the network-based only metric with respect to Pearson and Spearman correlations, verifying the effectiveness of the combination with network-based features. However, from Figs. 2 and 3, we also find that the predicted quality is not very good at the lower part of quality scale, because training data are not sufficient in that area. Even if the results are not directly comparable because multiple video datasets are utilized, we note that the prediction accuracy of our hybrid metric is superior in terms of Pearson correlation, in comparison with the existing relevant metrics.

4.3 C. Cross database validation

The robustness of a metric is usually verified by cross validation. During the process of cross validation, data are divided into training set and validation set.

A visual quality indicator must be tested in a variety of types for visual content and distortion, and make meaningful conclusions in the performance test. For cross database validation, we selected the five public HD (High Definition) databases available in VQEG [6]. These databases can be used as an extended conformance test for implementations of the model. There are a total of 278 distorted video sequences and its related mean opinion score (MOS). The cross database validation contributes to assess the robustness of the presented scheme to untrained data. We can again see that the presented scheme performs very well with MOS.

Figure 4 is the results using the HD database with MOS for the proposed hybrid metric, showing our model’s high accuracy for high-definition video quality assessment. As shown in Fig. 4, the plot is close to the logistic fitting curve and presents low scattering around it. Therefore, the prediction accuracy of our hybrid metric is good for all the distortions because low scattering implies good prediction performance. We have shown that our model works for a large range of situations (i.e., from QCIF to SD and HD video sequences).

Fig. 4
figure 4

Scatter plot for the HD database with MOS vs the proposed hybrid metric

Because of the use of public video databases, simulations presented in this paper can be replicated easily for future studies and benchmarking.

5 Conclusion

We have proposed a NR visual quality prediction metric which combines both bit stream based features and packet header information based on the network parameters. As the major contribution in this work, we have proposed to use partial least squares regression (PLSR) to establish the relation between perceived video quality and hybrid (for both visual content and transmission) features. In addition, we have extracted the compressed domain features and network characteristics, and built a hybrid video quality estimation model that can be used to assess the overall (for both source coding and transmission) video quality. Finally, we have verified the effectiveness of our proposed model through using SD and HD video sequences. And we draw a conclusion that our model shows high accuracy of prediction with the proposed PLSR algorithm.

The proposed PLSR based model provides a compact, objective and computationally manageable prediction of subjective video quality. In essence, we use multi-type impairments to evaluate the distortion of transmitted video. Generally, the model can evaluate the video quality over wire network and wireless network with SD and HD formats. We have used different databases to validate the proposed model, and compared with the existing relevant algorithms.