Abstract
Blind quality assessment for user-generated content (UGC) or consumer videos is challenging in computer vision. Two open issues are yet to be addressed: how to effectively extract high-dimensional spatial-temporal features of consumer videos and how to appropriately model the relationship between these features and user perceptions within a unified blind video quality assessment (BVQA). To tackle these issues, we propose a novel BVQA model with spatial-temporal perception and fusion. Firstly, we develop two perception modules to extract the perceptual-distortion-related features separately from the spatial and temporal domains. In particular, the temporal-domain features are obtained with a combination of 3D ConvNet and residual frames for their high efficiencies in capturing the motion-specific temporal features. Secondly, we propose a feature fusion module that adaptively combines spatial-temporal features. Finally, we map the fused features onto perceptual quality. Experimental results demonstrate that our model outperforms other advanced methods in conducting subjective video quality prediction.
Similar content being viewed by others
Data Availability
Data availability is not applicable to this article.
The datasets generated during and/or analysed during the current study are available in the following repository:
- KoNViD-1k http://database.mmsp-kn.de/konvid-1k-database.html
- LIVE VQC https://live.ece.utexas.edu/research/LIVEVQC/index.html
- YouTube-UGC https://media.withyoutube.com/
The project corresponding to this manuscript is available through the link https://github.com/790578527/STFN.
References
Argyropoulos S, Raake A, Garcia MN, List P (2011) No-reference video quality assessment for SD and HD H. 264/AVC sequences based on continuous estimates of packet loss visibility. In: International Workshop on Quality of Multimedia Experience (QoMEX), pp. 31–36
Chen Z, Wu D (2011) Prediction of transmission distortion for wireless video communication: Analysis. IEEE Trans Image Process 21(3):1123–1137
Chen C, Izadi M, Kokaram A (2016) A perceptual quality metric for videos distorted by spatially correlated noise. In: ACM International Conference on Multimedia, pp. 1277–1285
Chen P, Li L, Ma L, Wu J, Shi G (2020) Rirnet: Recurrent-in-recurrent network for video quality assessment. In: ACM International Conference on Multimedia, pp. 834–842
Cho K, van Merrienboer B, Gulcehre C, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Conference on Empirical Methods in Natural Language Processing (EMNLP)
Corbetta M, Shulman GL (2002) Control of goal-directed and stimulus-driven attention in the brain. Nat Rev Neurosci 3(3):201–215
Dendi SVR, Channappayya SS (2020) No-reference video quality assessment using natural spatiotemporal scene statistics. IEEE Trans Image Process 29:5612–5624
Dong S, Wang P, Abbas K (2021) A survey on deep learning and its applications. Computer Science Review 40(1):100379
Ghadiyaram D, Bovik AC (2017) Perceptual quality prediction on authentically distorted images using a bag of features approach. J Vis 17(1):32
Group VQE, et al (2000) Final report from the video quality experts group on the validation of objective models of video quality assessment. In: VQEG Meeting
Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3D cnns retrace the history of 2D cnns and imagenet? In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6546–6555
Hermens F, Luksys G, Gerstner W, Herzog MH, Ernst U (2008) Modeling spatial and temporal aspects of visual backward masking, vol. 115, pp. 83–100
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778
Hosu, V., Lin, H., Sziranyi, T., Saupe, D.: KonIQ-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE Transactions on Image Processing 29, 4041–4056 (2020)
Hosu V, Hahn F, Jenadeleh M, Lin H, Men H, Szirányi T, Li S, Saupe D (2017) The konstanz natural video database (KoNViD-1k). In: International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6
Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 20(11):1254–1259
Keimel C, Habigt J, Klimpke M, Diepold K (2011) Design of no-reference video quality metrics with multiway partial least squares regression. In: International Workshop on Quality of Multimedia Experience (QoMEX), pp. 49–54
Kingma DP, Ba J (2015) Adam: A Method for Stochastic Optimization. In: International Conference on Learning Representations (ICLR)
Korhonen J (2018) Learning-based prediction of packet loss artifact visibility in networked video. In: International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6
Korhonen J (2019) Two-level approach for no-reference consumer video quality assessment. IEEE Trans Image Process 28(12):5923–5938
Korhonen J, Su Y, You J (2020) Blind natural video quality prediction via statistical temporal features and deep spatial features. In: ACM International Conference on Multimedia, pp. 3311–3319
Kundu D, Ghadiyaram D, Bovik AC, Evans BL (2017) No-reference quality assessment of tone-mapped hdr pictures. IEEE Trans Image Process 26(6):2957–2971
Larochelle H, Hinton GE (2010) Learning to combine foveal glimpses with a third-order Boltzmann machine. In: NIPS
Li Y, Po L-M, Cheung C-H, Xu X, Feng L, Yuan F, Cheung K-W (2015) No-reference video quality assessment with 3D shearlet transform and convolutional neural networks. IEEE Trans Circuits Syst Video Technol 26(6):1044–1057
Li D, Jiang T, Jiang M (2019) Quality assessment of in-the-wild videos. In: ACM International Conference on Multimedia, pp. 2351–2359
Mittal A, Soundararajan R, Bovik AC (2012) Making a “completely blind’’ image quality analyzer. IEEE Signal Process Lett 20(3):209–212
Mittal A, Moorthy AK, Bovik AC (2012) No-reference image quality assessment in the spatial domain. IEEE Trans Image Process 21(12):4695–4708
Mittal A, Saad MA, Bovik AC (2015) A completely blind video integrity oracle. IEEE Trans Image Process 25(1):289–300
Murdock BB Jr (1962) The serial position effect of free recall. J Exp Psychol 64(5):482
Niu Y, Liu F (2012) What Makes a Professional Video? A Computational Aesthetics Approach. IEEE Trans Circuits Syst Video Technol 22(7):1037–1049
Pandremmenou K, Shahid M, Kondi LP, Lövström B (2015) A no-reference bitstream-based perceptual model for video quality estimation of videos affected by coding artifacts and packet losses. In: Human Vision and Electronic Imaging XX, vol. 9394, pp. 486–497
Park J, Seshadrinathan K, Lee S, Bovik AC (2012) Video quality pooling adaptive to perceptual distortion severity. IEEE Trans Image Process 22(2):610–620
Pinson MH, Janowski L, Pépion R, Huynh-Thu Q, Schmidmer C, Corriveau P, Younkin A, Le Callet P, Barkowsky M, Ingram W (2012) The influence of subjects and environment on audiovisual subjective tests: An international study. IEEE Journal of Selected Topics in Signal Processing 6(6):640–651
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3D residual networks. In: IEEE International Conference on Computer Vision, pp. 5533–5541
Rensink RA (2000) The dynamic representation of scenes. Vis Cogn 7(1–3):17–42
Saad MA, Bovik AC, Charrier C (2012) Blind image quality assessment: A natural scene statistics approach in the DCT domain. IEEE Trans Image Process 21(8):3339–3352
Saad MA, Bovik AC, Charrier C (2014) Blind prediction of natural video quality. IEEE Trans Image Process 23(3):1352–1365
Seshadrinathan K, Bovik AC (2011) Temporal hysteresis model of time varying subjective video quality. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1153–1156
Siahaan E, Hanjalic A, Redi JA (2018) Semantic-aware blind image quality assessment. Signal Processing: Image Communication 60:237–252
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (ICLR)
Sinno Z, Bovik AC (2018) Large-scale study of perceptual video quality. IEEE Trans Image Process 28(2):612–627
Søgaard J, Forchhammer S, Korhonen J (2015) No-reference video quality assessment using codec analysis. IEEE Trans Circuits Syst Video Technol 25(10):1637–1650
Tao L, Wang X, Yamasaki T (2021) Rethinking motion representation: Residual frames with 3D convnets. IEEE Trans Image Process 30:9231–9244
Thomee B, Shamma DA, Friedland G, Elizalde B, Ni K, Poland D, Borth D, Li L-J (2016) YFCC100M: The new data in multimedia research. Commun ACM 59(2):64–73
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: IEEE International Conference on Computer Vision, pp. 4489–4497
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459
Tu Z, Wang Y, Birkbeck N, Adsumilli B, Bovik AC (2021) UGC-VQA: Benchmarking blind video quality assessment for user generated content. IEEE Trans Image Process 30:4449–4464
Valenzise G, Magni S, Tagliasacchi M, Tubaro S (2011) No-reference pixel video quality monitoring of channel-induced distortion. IEEE Trans Circuits Syst Video Technol 22(4):605–618
Vega MT, Mocanu DC, Stavrou S, Liotta A (2017) Predictive no-reference assessment of video quality. Signal Processing: Image Communication 52:20–32
Wang Y, Inguva S, Adsumilli B (2019) YouTube UGC dataset for video compression research. In: IEEE International Workshop on Multimedia Signal Processing (MMSP), pp. 1–5
Woo, S., Park J, Lee J, Kweon IS (2018) Cbam: Convolutional block attention module. In: European Conference on Computer Vision (ECCV), pp. 3–19
Wu J, Zeng J, Dong W, Shi G, Lin W (2019) Blind image quality assessment with hierarchy: Degradation from local structure to deep semantics. J Vis Commun Image Represent 58:353–362
Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: European Conference on Computer Vision (ECCV), pp. 305–321
Xu M, Chen J, Wang H, Liu S, Li G, Bai Z (2020) C3DVQA: Full-reference video quality assessment with 3D convolutional neural network. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4447–4451
Xue W, Mou X, Zhang L, Bovik AC, Feng X (2014) Blind image quality assessment using joint statistics of gradient magnitude and Laplacian features. IEEE Trans Image Process 23(11):4850–4862
Ye P, Kumar J, Kang L, Doermann D (2012) Unsupervised feature learning framework for no-reference image quality assessment. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1098–1105
Ying Z, Mandal M, Ghadiyaram D, Bovik A (2021) Patch-vq: ’patching up’ the video quality problem. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 14019–14029
Ying Z, Niu H, Gupta P, Mahajan D, Ghadiyaram D, Bovik A (2020) From patches to pictures (PaQ-2-PiQ): Mapping the perceptual space of picture quality. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3575–3585
You J, Korhonen J (2019) Deep neural networks for no-reference video quality assessment. In: IEEE International Conference on Image Processing (ICIP), pp. 2349–2353
Zhang Y, Moorthy AK, Chandler DM, Bovik AC (2014) C-DIIVINE: No-reference image quality assessment based on local magnitude and phase statistics of natural scenes. Signal Processing: Image Communication 29(7):725–747
Zhu K, Li C, Asari V, Saupe D (2014) No-reference video quality assessment based on artifact measurement and statistical analysis. IEEE Trans Circuits Syst Video Technol 25(4):533–546
Funding
This work was supported in part by the National Natural Science Foundation of China under Grant 62072110, 61972097, and U21A20472, in part by the Major Science and Technology project of Fujian Province (China) under Granted 2021HZ022007, in part by the Industry-Academy Cooperation Project under Grant 2021H6022, in part by the Natural Science Foundation of Fujian Province under Grant 2020J01494.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have conflict of interest with all researcheres at Fuzhou University, China.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Abbreviations List
Appendix A: Abbreviations List
Table 7 shows the abbreviation correspondence in the paper.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Niu, Y., Zheng, Y., Wang, Z. et al. Blind consumer video quality assessment with spatial-temporal perception and fusion. Multimed Tools Appl 83, 18969–18986 (2024). https://doi.org/10.1007/s11042-023-16242-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16242-8