1 Introduction

Multimedia technology and digital visual signal processing have developed rapidly during recent decades. Digital images and videos are very easy to create, transmit, store, and share. Owing to these developments, the design of reliable video quality assessment (VQA) algorithms has attracted considerable attention. Consequently, VQA has been the focus of many research studies and patents. Furthermore, the vast volume of user-created digital video content has led to the development of numerous VQA applications, which require reliable and effective quality monitoring [39].

Visual signals can undergo a wide variety of distortions after their capture during compression, transmission, and storage. Human observers are the end users of visual content; thus, the quality of visual signals should ideally be evaluated in subjective user studies in a laboratory environment involving specialists. During these user studies, subjective quality scores are collected from each participant. Subsequently, the quality of a visual signal is given a mean opinion score (MOS), which is calculated as the arithmetic mean of all the individual quality ratings. In most cases, an absolute category rating is applied, which ranges from 1.0 (bad quality) to 5.0 (excellent quality). Other standardized quality ratings also exist, such as a continuous scale ranging from 1.0 to 100.0, but Huynh-Thu et al. [11] noted that there are no statistical differences between the different scales used for the same visual stimuli.

However, subjective VQA is expensive, time consuming, and labor intensive, thereby preventing its application to real-time systems. Moreover, the results obtained by subjective VQA depend on the physical condition, emotional state, personality, and culture of the observers [27]. As a consequence, there is an increasing need for objective VQA. The classification of VQA algorithms is based on the availability of the original (reference) signal. If a reference signal is not available, a VQA algorithm is regarded as a no-reference VQA (NR-VQA). NR algorithms can be classified into two further groups, where the so-called distortion-specific NR algorithms assume that specific distortion is present in the visual signal, whereas general purpose (or non-distortion specific) algorithms operate on various distortion types. Reduced-reference methods retain only part of the information from the reference signal, whereas full-reference algorithms have full access to the complete reference medium to predict the quality scores.

Deep learning is now applied widely in industry and research, and with great success in the fields of image processing and computer vision [7, 8, 44]. Thus, recently developed NR-VQA algorithms have employed deep learning techniques, such as neural networks [47], convolutional neural networks (CNNs) [2], and deep belief networks [5].

It has been shown that the features extracted using a pretrained CNN are rich and effective for a wide range of computer vision and image processing tasks, such as content-based image retrieval [35], NR image quality assessment [2], and medical image classification [10]. The main contribution and novel aspect of the present study is that we obtain possible solutions for NR-VQA using only the deep features extracted from pretrained CNNs (Inception-V3 [32] and Inception-ResNet-V2 [31]) without depending on manually selected features. In particular, for a given video sequence that needs to be assessed, frame-level deep features are extracted from each video frame with a pretrained CNN. Subsequently, these frame-level features are temporally pooled to compile a video-level feature vector that characterizes the video sequence. Finally, the temporally pooled video-level feature vectors are mapped onto perceptual quality scores with a support vector regressor (SVR). Furthermore, our architecture was trained based on the recently published Konstanz natural video quality database (KoNViD-1k) [9], which in contrast to other publicly available databases, contains video sequences with authentic distortion rather than artificial distortion. Moreover, KoNViD-1k contains more videos (1200 sequences) than any other publicly available databases, which allowed us to create a deep, temporally pooled model.

The remainder of this paper is organized as follows. In Sect. 2, we review related research, particularly into NR-VQA algorithms. In Sect. 3, we describe our proposed NR-VQA algorithm. In Sect. 4, we present the experimental results. We give our conclusions in Sect. 5.

2 Related Work

As mentioned earlier, NR methods only require an input signal and no information about the reference signal. Early NR algorithms largely focused on distortion-specific approaches. Thus, Borer [3] developed an algorithm for measuring jerkiness based on the mean squared difference between frames. By contrast, Xue et al. [42] trained a neural network to model the quality impact of jerkiness. An H.264-specific algorithm was introduced by [4], where the error estimate depended on the discrete cosine transform (DCT) coefficient for data. Subsequently, perceptual quality scores were derived from the error estimates, and the motion vectors were obtained from the bit stream. Similarly, Zhu et al. [47] proposed a H.264-specific method; however, they extracted the first frame-level features using DCT coefficients. In addition, video-level features were created by averaging the frame-level features (temporal pooling), and a trained neural network predicted the subjective quality scores. Algorithms were also developed to assess blocking artifacts in distorted videos in studies by [21, 33, 38].

Subsequent studies focused on general-purpose algorithms. A successful and widely applied feature extraction method was developed based on natural scene statistics [22], where it was assumed that natural visual signals contain statistical regularities that are changed by distortion. Saad et al. [24] implemented this feature of their NR-IQA method called BLind Image Integrity Notator with DCT Statistics (BLIINDS) [23] for NR-VQA to produce the Video BLIINDS method. Video BLIINDS employs a spatiotemporal model derived from the natural scene statistics of the DCT coefficients, and the extracted features are then employed to train an SVR. This method was later extended to the three-dimensional (3D)-DCT domain [14].

In contrast to other methods, the video intrinsic integrity and distortion evaluation oracle (VIIDEO) [19] requires no information regarding the distortion types or human ratings of the video quality. Instead, it is assumed that pristine video sequences contain intrinsic statistical regularities, and deviations from them can be used to predict perceptual quality scores. The main feature of this method is that local statistics related to the frame differences derived using mean removal and divisive contrast normalization should follow a generalized Gaussian distribution if the video is of a good quality. Based on the NR-IQA CORNIA method [45], Xu et al. [40] also proposed an opinion-unaware NR-VQA method called Video CORNIA, where the frame-level features are first extracted via unsupervised feature learning and these features are then used to train an SVR. Finally, the video’s perceptual quality score is derived by temporal pooling of the frame-level features. Similarly, Anegekuh et al. [1] presented an opinion-unaware architecture for HEVC encoded videos, where the quality is predicted based on motion vector extraction and spatial information derived from the video content type.

In contrast to previous studies where training was conducted using artificially distorted videos, the algorithm proposed by Men et al. [18] was trained using the KoNViD-1k database [9], which comprises numerous video sequences with authentic distortion. They combined six spatial features and three temporal features to characterize a video sequence. Subsequently, a trained SVR was used to map these features onto perceptual quality scores.

Another area of research is based on deep learning techniques. Recently, deep learning-based NR-IQA algorithms have increased in popularity [2, 12, 13], although very few NR-VQA methods utilize deep learning. Zhang et al. [46] trained a CNN by weakly supervised learning where the corresponding labels were obtained for the video blocks according to a full reference-VQA metric. Subsequently, the feature vectors were extracted using the trained CNN and mapped onto subjective quality scores. By contrast, Li et al. [15] applied a 3D shearlet transform to video blocks and compiled spatiotemporal feature vectors for each video sequence. CNN and logistic regression were then utilized to map the features onto perceptual quality scores. Torres Vega et al. [36] proposed a restricted Boltzmann machine-based solution, which was trained with lightweight NR metrics, such as the noise ratio, motion intensity, and blockiness. This method was developed for assessing the quality of live video streams.

For further reviews of NR-VQA, we refer the reader to the studies by [29, 37, 41].

3 Methodology

The architecture of our proposed deep feature pooling algorithm is shown in Fig. 1. For a given video sequence that needs to be evaluated, the frame-level deep features are first extracted with the pretrained CNNs. Subsequently, these frame-level feature vectors are temporally pooled to create a video-level feature vector that characterizes the whole video. Finally, the temporally pooled video-level features are mapped onto subjective quality scores with a trained SVR.

The remainder of this section is organized as follows. In Sect. 3.1, we describe the training and test database compilation processes. Frame-level feature extraction is conducted by pretrained CNNs and the detailed process is presented in Sect. 3.2. Finally, we explain the video-level feature extraction method in Sect. 3.3.

Fig. 1
figure 1

General structure of the proposed NR-VQA algorithm. The algorithm reads a given video sequence and processes each of the video frames in turn to extract the frame-level feature vectors with the pretrained CNN. Finally, the extracted frame-level feature vectors are temporally pooled to form a video-level feature vector, which is mapped onto a quality score with a trained SVR

3.1 Database Compilation

Several video quality databases are publicly available, such as LIVE VQA [28], LIVE mobile video quality database [20], and MCL-V [16]. In this study, we selected the KoNViD-1k [9] natural video quality database to train and test our system. In contrast to most previously published data sets and similar to LIVE-VQC [30], KoNViD-1k [9] contains natural videos with authentic distortions. Furthermore, the videos were sampled from the Yahoo Flickr Creative Commons 100 Million (YFCC100m) [34] data set. The subjective quality scores were collected online [25] using the CrowdFlower platform. The spatial resolution is \(960\times 540\) in this data set and the frame rate is 25, 27 or 30 fps. Furthermore, the length of video sequences varies between 7 and 8 s. The MOS values for the video sequences are on a scale from 1.0 (worst) to 5.0 (best). Furthermore, KoNViD-1k contains more quality labeled video sequences than any other publicly available data sets. The large number of video sequences in KoNViD-1k allowed us to directly train a temporal feature pooling model using the features extracted from pretrained CNNs.

KoNViD-1k contains 1200 video sequences. We randomly selected 960 sequences for training purposes, whereas the remaining 240 sequences were retained only for testing and they were not utilized in the training process. The videos selected for training were split into frames and 20% of the frames were then selected randomly. We employed Inception-V3 [32] and Inception-ResNet-V2 [31] as feature extractors because their input receptive fields are significantly larger than those in other pretrained networks (\(299\times 299\) vs. \(224\times 224\) or \(227\times 227\)) and in case of input image’s resizing, the visual clues of perceptual quality deteriorates to a lesser extent than with other pretrained CNNs. As a consequence of fixed input size, the selected video frames were resized to \(338\times 338\) and \(299\times 299\) center patches were cropped from the resized video frames. The resulting training images retained the MOS values of their source videos. Consequently, we assumed that the perceived visual quality of the individual frames was related to that of the complete video sequence. The final image database contained 43,320 images, which were used for transfer learning with the selected pretrained CNNs.

For completeness, we selected the LIVE VQA [28] database as an additional test set in order to analyze the generalizability of the proposed algorithm. LIVE VQA contains 15 reference videos and 150 artificially distorted video sequences with length of 8, 10 or 20 s obtained using four different types of distortion: simulated transmission of H.264 compressed videos through error-prone wireless networks and through error-prone IP networks, H.264 compression, and MPEG-2 compression. The spatial resolution of the videos in LIVE VQA is \(768\times 432\).

3.2 Frame-Level Feature Extraction

The features were extracted by providing the CNN with the whole image, which had to fit with the CNN’s input size. As mentioned above, both Inception-V3 and Inception-ResNet-V2 accept images measuring \(299\times 299 \), which is why the input video frames were resized to \(338\times 338 \) and the \(299\times 299 \) center patches were then cropped. The CNNs employed were fine-tuned (so-called transfer learning) based on the image database described above.

3.2.1 Transfer Learning

The usual method was employed for transfer learning, where we truncated the last 1000-way softmax layer of the Inception-v3 and Inception-ResNet-v2 network. Furthermore, this layer was replaced by a 5-way softmax layer, which was relevant to the problem addressed. Five classes were defined in our training image database: class A for excellent image quality \((5.0\ge MOS\ge 4.2) \), class B for good image quality \((4.2>MOS\ge 3.4) \), class C for fair image quality \((3.4>MOS\ge 2.6) \), class D for poor image quality \((2.6>MOS\ge 1.8) \), and class E for very poor image quality \((1.8>MOS\ge 1.0) \). During transfer learning, the initial learning rate was set to 0.0001 and divided by 10 when the validation error stopped improving. Moreover, the batch size was set to 32 and the momentum was adjusted to 0.9. During transfer learning, the last new layer was trained from scratch utilizing Xavier initialization [6], where the initial weights for the other layers came from the corresponding layers in the pretrained networks and all the layers were updated using the back-propagation algorithm [26].

As shown in Fig. 2, the MOS distribution is imbalanced in the KoNViD-1k natural video quality database, which could cause problems during transfer learning. Thus, we sampled each instance in the batch based on the inverse frequency of the class. Consequently, instances were selected in larger classes with lower probabilities. The final batch was equally distributed because of differences in the populations of the classes.

Figure 3 depicts the training process with Inception-V3 [32] during transfer learning, where the training accuracy, training loss, validation accuracy, and validation loss are plotted.

3.2.2 Feature Extraction

Frame-level feature vectors were extracted by providing the CNN with video frames that fitted with the CNN’s input receptive field. As mentioned above, both Inception-V3 [32] and Inception-ResNet-V2 [31] accept \(299\times 299 \)-sized images, which is why each video frame was resized to \(338\times 338 \) and the \(299\times 299 \) center patch was cropped.

The CNN performed all its defined operations for an input image. Therefore, it is run through an each resized and center-cropped video frame saving the output from the final pooling layer which is named ’avg_pool’ both in Inception-V3 and Inception-ResNet-V2. As a consequence, the length of the frame-level feature vectors was 2048 using Inception-V3 and 1536 when we employed Inception-ResNet-V2.

Fig. 2
figure 2

MOS distribution in the KoNViD-1k [9] natural video quality database. KoNViD-1k is a video quality database that contains 1200 real-world video sequences with authentic distortion collected from the YFCC100m data set [34]. Furthermore, the database contains the corresponding MOS values on a scale from 1.0 (worst) to 5.0 (best)

Fig. 3
figure 3

Training process for Inception-V3 [32] during transfer learning. The smoothed training accuracy is shown by the dark blue line, the training accuracy by the light blue line, the smoothed training loss by the orange line, and the training loss by the light orange line. Furthermore, the validation accuracy and validation loss are depicted with dashed lines. The final checkpoint is denoted by a double round which is determined by early stopping. (Color figure online)

3.3 Video-Level Feature Extraction

Information fusion was conducted element by element based on each frame’s feature vectors to create a single, video-level feature vector for each video sequence. Average, median, minimum, and maximum pooling were considered. Let \(f_i^{(j)} \) denote the ith entry of the jth video frame’s feature vector. Furthermore, let \(N_f \) be the number of frames in a video sequence and M is the length of the frame-level feature vector. The four different pooling strategies can be formally expressed as:

$$\begin{aligned} F_{i}^{avg}= & {} \frac{1}{N_f}\sum _{j=1,\ldots , N_f} f_i^{(j)}, i=1,\ldots , M, \end{aligned}$$
(1)
$$\begin{aligned} F_{i}^{median}= & {} \underset{j=1,\ldots ,N_f}{\mathrm {median}}\, f_i^{(j)}, i=1,\ldots , M, \end{aligned}$$
(2)
$$\begin{aligned} F_{i}^{min}= & {} \min _{j=1,\ldots , N_f} f_i^{(j)}, i=1,\ldots , M, \end{aligned}$$
(3)
$$\begin{aligned} F_{i}^{max}= & {} \max _{j=1,\ldots , N_f} f_i^{(j)}, i=1,\ldots , M, \end{aligned}$$
(4)

where \(F_i\) denotes the ith entry of the video-level feature vector. Consequently, the length of the video-level feature vector is equal to the length of the frame-level feature vector.

4 Experimental Results and Analysis

The proposed NR-VQA algorithms were evaluated based on their performance with the benchmark VQA databases, which were labeled with the subjective scores and MOS values representing the overall image quality. The Pearsons linear correlation coefficient (PLCC) and Spearmans rank ordered correlation coefficient (SROCC) were computed between the predicted and ground-truth scores, which are widely accepted performance metrics. The PLCC between two data sets, A and B, is defined as:

$$\begin{aligned} PLCC(A,B) = \frac{\sum _{i=1}^{n} (A_i-{\overline{A}})(B_i-{\overline{B}})}{\sqrt{\sum _{i=1}^{n}(A_i-{\overline{A}})^{2}}\sqrt{\sum _{i=1}^{n}(B_i-{\overline{B}})^{2}}}, \end{aligned}$$
(5)

where \({\overline{A}} \) and \({\overline{B}} \) denote the average of sets A and B, and \(A_i \) and \(B_i \) denote the ith elements of sets A and B, respectively. For two ranked sets A and B, SROCC is defined as:

$$\begin{aligned} SROCC(A,B)=\frac{\sum _{i=1}^{n} (A_i-{\hat{A}})(B_i-{\hat{B}})}{\sqrt{\sum _{i=1}^{n}(A_i-{\hat{A}})^{2}}\sqrt{\sum _{i=1}^{n}(B_i-{\hat{B}})^{2}}}, \end{aligned}$$
(6)

where \({\hat{A}} \) and \({\hat{B}} \) are the middle ranks.

Fig. 4
figure 4

Pooling technique and SVR comparison trained and tested on KoNViD-1k [9] using Inception-V3 [32] base architecture

4.1 Parameter Study

First, we evaluated the design choices for our proposed method, before comparing it with other state-of-the-art NR-VQA techniques. As mentioned above, two different publicly available databases were used for training and testing purposes, i.e., KoNViD-1k [9] for training and testing, and the LIVE VQA database only for testing. To evaluate the performance of our proposed architecture and the effects of the parameters in the algorithm, we used four different pooling strategies (average, median, minimum, and maximum) and SVRs with different kernel functions (linear, Gaussian, 1st-order polynomial, 2nd-order polynomial, and 3rd-order polynomial). The different versions of our algorithm were assessed based on KoNViD-1k [9] by fivefold cross-validation with ten replicates in the same manner as the study by [14].

Figures 4, 5, 6 and 7 summarize the results obtained with different design choices. The results showed that the architectures based on SVRs with Gaussian kernel functions obtained significantly better results than the architectures with other kernel functions. Furthermore, SVRs with third order polynomial kernel functions apparently overfit the training data because they produce 0 or negative values PLCC and SROCC values on the test. The difference between linear and 1st order kernel function is marginal. Compared to these, 2nd order polynomial kernel function performs slightly worse results.

Further, we evaluated the architectures with and without transfer learning, where the results demonstrated that transfer learning significantly improved the performance. Specifically, it can be clearly seen that transfer learning significantly improved the performance because it was able to improve PLCC and SROCC by at least 0.1 in all cases except those architectures with 3rd order polynomial kernel function. Moreover, in most cases average pooling was the best choice, except in one case where max pooling was the best option. Subsequently, we compared the four best methods with state-of-the-art NR-VQA techniques.

Fig. 5
figure 5

Pooling technique and SVR comparison trained and tested on KoNViD-1k [9] using Inception-V3 [32] base architecture

Fig. 6
figure 6

Pooling technique and SVR comparison trained and tested on KoNViD-1k [9] using Inception-ResNet-V2 [31] base architecture

Fig. 7
figure 7

Pooling technique and SVR comparison trained and tested on KoNViD-1k [9] using Inception-ResNet-V2 [31] base architecture

4.2 Comparion to the State-of-the-Art

We compared seven state-of-the-art NR-VQA methods with our architectures. First, the algorithms were assessed based on KoNViD-1k [9] by fivefold cross-validation in a similar manner to the study by [18]. The PLCC and SROCC values for five baseline methods (Video BLIINDS [24], VIIDEO [19], Video CORNIA [40], FC Model [17], and STFC Model [18]) were those measured by Men et al. [9] and [18], while the results of STS-MLP [43] and STS-SVR [43] were taken from their original publication. The proposed architectures were also assessed based on all the videos in LIVE VQA [28] but without cross-validation because they were trained based on KoNViD-1k [9] and we wanted to demonstrate the generalizability of the proposed method. The PLCC and SROCC values for the baseline methods were those reported in the original studies.

Table 1 shows the comparisons with other state-of-the-art algorithms, which demonstrate that our architecture could also achieve state-of-the-art results without transfer learning. In addition, our fine-tuned CNN-based architectures performed significantly better than the state-of-the-art algorithms. In particular, the PLCC and SROCC values both improved by approximately 0.1. The scatter plots showing the ground-truth MOS values versus the predicted MOS values are depicted in Fig. 8.

For completeness, we also performed a comparison based on the widely used LIVE VQA database [28], which unlike KoNViD-1k [9] contains artificially distorted video sequences. Furthermore, LIVE VQA contains several videos with length of 20 s. On the other hand, KoNViD-1k typically consists of videos with length of 8 s. This difference between the two databases was essentially a serious limiting factor considering our temporally pooled video-level feature vectors. In spite of this, the results demonstrated that our architecture could obtain state-of-the-art results on the LIVE VQA database [28], although it was not employed as the training set. As shown in Table 1, Video CORNIA obtained the best performance with LIVE VQA, and it performed better than our best proposed method by 0.06 in terms of PLCC and 0.045 in terms of SROCC. It should be noted that except for the FC model [17] and the STFC model [18], the previous methods were trained with or optimized for artificially distorted sequences, which explains why the ranking of the methods was different based on KoNViD-1k [9] and LIVE VQA [28]. However, our method still obtained state-of-the-art results on LIVE VQA [28]. Therefore, the experimental results confirmed the effectiveness and generalizability of the proposed approach for NR-VQA.

Table 1 Comparison to state-of-the-art NR-VQA algorithms applied on KoNViD-1k [9] and LIVE VQA [28] databases
Fig. 8
figure 8

Scatter plots showing the ground-truth MOS values against the predicted MOS values

4.3 Implementation Details

The introduced algorithm was implemented in MATLAB R2018b mainly relying on the functions of Deep Learning Toolbox (formerly Neural Network Toolbox) and Statistics and Machine Learning Toolbox. Furthermore, it was trained and tested on a personal computer containing 8-core i7-7700K CPU and an NVidia Titan X GPU. In this environment, the evaluation of a video from KoNViD-1k (length is 7 or 8 s) lasts for on average 13.105–13.323 s from which the loading of network the trained network to the GPU lasts for 1.8 s, 11.3 s is the frame-level feature extraction, video-level feature vector compilation takes 0.003 s, and SVR regression takes 0.002–0.22 s depending on the applied kernel function.

5 Conclusions

In this study, we developed a novel framework for NR-VQA based on the features obtained from pretrained CNNs (Inception-V3 [32] and Inception-ResNet-V2 [31]), transfer learning, temporal pooling, and regression. The main novel aspect and contribution of this study is that we developed a possible architecture for NR-VQA that depends on temporally pooled frame-level deep feature vectors and it does not require manually derived features. Furthermore, we showed that the deep features extracted from a fine-tuned, pretrained CNN can provide effective and rich representations for video quality tasks. Thus, our architecture can be considered as a proof of concept regarding the successful application of deep features extracted from pretrained CNNs in NR-VQA. Our approach was trained and tested based on KoNViD-1k, which is a natural video quality database containing 1200 sequences with quality scores, and it performed better than the best state-of-the-art solution by approximately 0.1 in terms of both the PLCC and SROCC. Our method was also tested with the LIVE VQA database and it achieved state-of-the-art results, although the best state-of-the-art technique performed slightly better.