1 Introduction

Over the last decade, video has become the main component of the Web. In today’s world, social media, news channels, conventional television broadcasting and also most of telephony products are all built upon video services [1, 2, 42]. Analysis has shown that [1] whenever the content provider company fails to deliver the content in expected time and quality, the user might tend to cancel their subscription regardless if it is a paid or a free service. According to a recent whitepaper from Akamai [2] “With a 5-second delay in starting a video playing, a publisher may have a probability to lose a quarter of its audience, and with a 10-second delay, nearly half of the audience are eager to leave.”

In an ideal world, where each user sends information about their experience, it would be easy to translate this instant feedback from user’s feelings into system and network parameters to increase customer satisfaction. However, only a very small percentage of the consumers provide instant feedback about the service experience. Yet, this information can be translated into valuable feedback as many of the frontrunner companies like Facebook, Whatsapp and Skype frequently employ these methodologies.

The aim of this paper is to answer the question of measuring user experience and correlating them to objective video consumption and system capacity parameters. Unlike other research works [5, 9, 28] that use time invariant models, real time Quality of Experience (QoE) of an online video is predicted using the correspondence between quantitative and qualitative observations using machine-learning methods. The capability to measure single user’s QoE concludes on measuring the whole service quality. The ability to compare service QoE delta between two different moments is the key to provision the resources that construct an online video delivery system.

The remainder of the paper is structured as follows: Section 2 gives a brief description about components of an online video platform, Section 3 outlines state of art QoE questionnaire and Section 4 provides information on related works on the QoE. Section 5 advertises contributions of the project and Section 6 specifies details of the system implementation for online video platform. Section 7 clarifies experiment methodology, postulates derivation of overall system QoE from single user QoE and discusses comparison of the models. Section 8, 9 and 10 discusses supervised machine learning, model performance comparison and online video platform capacity estimation. Finally, conclusions and future works are argued in Section 11.

2 On-line video platform and QoE

As shown in the Fig. 1, the generalized view of an online video platform consists of following components; consumer end-user device (mobile, tablet or PC), browsers or other player software, network layer, content delivery network (CDN), load balancer, web services, and video platform.

Fig. 1
figure 1

A generalized view of online video platform components

In such a complex procedure like online video delivery, there can be several bottlenecks that might cause deterioration in delivering the content to consumer. This includes player and consumer device related errors [15], network congestion [26], video encoding and adaptation related quality drops [9, 31], CDN [10] problems or user and social context factors [5, 47].

For any of these scenarios, when users are not satisfied with the service, instant feedback to actual product owner plays a crucial role as the subscribers can submit their experience about the service instantaneously as a subjective QoE survey, which can influence correct changes for the system operator. This will unquestionably save time, profit and practically the entire business. Moreover, following a low consumer experience, only a very small number of users are eager to share their feelings. For instance; users who face a long initial buffering duration while trying to watch a YouTube video [15, 26] are more reluctant to answer a user survey. To overcome these conditions, service providers need a mechanism to estimate what might have gone wrong in the actual workflow by comparing past well-known conditions that was collected from trustworthy observers to improve customer satisfaction.

3 State-of-art QoE questionnaire implementations

Most of the online services are currently using QoE analysis and consider their product’s quality measurements based on the QoE assessments that they receive from the users. Nowadays, it is moderately typical to see a user survey at the end of a Skype call or come across a Facebook satisfaction questionnaire about Facebook’s news feed.

Skype uses a survey of one to five stars to grade the overall skype call quality and Facebook asks whether the user is satisfied with news feed content. Popular instant messaging and telephony application “WhatsApp” follows a similar pattern and frequently queries users their service quality with an additional feature of logging personal experience by asking “Tell us more” as shown in Fig. 2. Measuring overall success of a service by considering a 5 min call or social media news feed can be a challenging task as these are comprehensive concepts to be covered by a single value evaluation methodology. The same situation applies for any online video delivery service; it consists of interactions of many different and complex tiers.

Fig. 2
figure 2

QoE questionnaires for popular on-line applications (Skype, Facebook and WhatsApp)

Despite the false wide belief, the objective of these methodologies does not intend to understand a single user’s perception, but through induction of a real-time modelling, to evaluate the quality perceived by clusters of users distributed on different geographical regions. According to this information, service providers might take action and reconsider their resource management mechanisms on different layers of service including cloud, network, load balancing, routing and CDN. Ultimately, this will enhance the overall success of the online service.

4 Related work

Over-the-top content (OTT) technologies bring more content to the users more than ever. Still, a higher QoE might mean more than the context of the content [10]. In this section, both academic and research works on the impact of QoE on OTT will be discussed.

M. Knoll et al. have provided a Mean Opinion Score (MOS) model for OTT services [28], where x stands for number of stalls and t for time since last stall and a for the memory parameter (which was set as 0.14) given with Eq. 1.

$$ \mathrm{MOS}={e}^{-\frac{x}{5}+1.5-a\sqrt[e]{t}} $$
(1)

This equation provides a ground understanding for a single user’s perception and mainly relates it to the number of stalls during the watch session. However, the nature of the model cannot reflect a time varying understanding of the experience and obviously it reflects a standalone, single user centric perception. In ITU-T P.1203.3 recommendation [26], a media session quality score is formulated based on number of stalls, total stall duration, buffering duration, media length and compression quality as given with Eq. 2.

$$ \mathrm{SI}={e}^{-\frac{numStalls}{S1}}.{e}^{-\frac{\left(\frac{totalbuf}{T}\right)}{S2}}.{e}^{-\frac{\left(\frac{bufdur}{T}\right)}{S3}} $$
(2)

These equations; both 1 & 2 reflects single users QoE while correlating them to video metrics. The environment that is reflected is only the user and the single consumer device that has been used. The medium that is used to transmit the video data has not been taken into consideration. From a service provider’s perspective, modelling a single user’s perception would not induce a valid model for the delivery system. In this work, primary task is to bring a methodology that will relate video metrics to end-to-end system parameters.

C. Li et al. has presented a QoE driven mobile edge caching methodology [32] where for user u ∈ U using server s ∈ S, ΔT is the time fraction within a video file that is required to be buffered as given with Eq. 3. The initial startup delay constraint requires that the waiting time interval between submitting a request and the actual video playback must not exceed the maximum tolerable waiting time of that user, which is denoted as \( {\mathrm{d}}_u^s \).

$$ {\mathrm{d}}_u^s=\frac{{\mathrm{R}}_{f,m}.\Delta T}{c\left(s,u\right)},\forall u\in U,\forall s\in S $$
(3)

Rf,m refers to the bitrate of the video file f for the transcoding rate m. The download link transmission rate of the wireless link from server s to use u is denoted by c(s,u). This model provides a good understanding of initial delay and resolution considering the impact on user’s QoE. Yet, it lacks the ability to consider stall duration and total number of stalls that happens through the watching experience. This paper provides a better understanding of user QoE regarding a wide variety of video metrics including total stall duration, number of stalls, initial buffering and resolution at the same time through machine learning modelling. L. Zhou has published [46] a QoE oriented analytical delay indicator for video streaming systems based on a fluid framework model. Fluid dynamics can very well simulate the watching experience due to the fact that video streaming is expected to be similar to “a flowing experience” which circumvents holdups and interruptions. However, the author pointed out in the conclusion part in their own words, “a more practical user response should be considered”. As a comparison to [46], our work provides a practical, applicable, easy to integrate scientific methodology for any OTT delivery platform.

F. Wamser et al. [42] have provided an extensive collection of objective models for network operators to better understand the OTT traffic in their networks, to predict the playback behaviour of the video player, and to reflect how efficient in delivering OTT videos to their customers. In this model, network measurements including bandwidth capacity, download duration of a block, request duration of a block have been considered. Although network parameters are considered, measurements are only taken within user domain and a conclusion on QoE of the whole service is not possible.

In a recent whitepaper from Cloudstreet [10], a connected city scenario is described where there are many users with different quality and service expectations that are trying to access OTT services. The company introduced a solution where cloud bandwidth-auctioning algorithm that makes intelligent determinations of priority in real time and has effectively provisioned assured QoS/QoE. Gomez et al. [15] presented an Android application, which is able to evaluate and analyse the perceived QoE for YouTube service in wireless terminals. Their application carries out measurements of objective Quality of Service (QoS) parameters, which are then mapped onto subjective QoE (in terms of MOS) by means of a utility function.

The research works [15, 28, 42] have defined and analysed QoE from a content generation and segment size point of view while providing a relation to picture quality only. Nevertheless, QoE definition of this paper will be in correlation with [10, 42] and the intention will be to analyse the concept from service provider’s perspective where models are real-time and targeted for clusters of users instead of the single user. Rather than only a measure for picture quality, QoE has been used as a quantity which can be a measure for the whole end-to-end service perception.

5 Contributions

This paper is based on an experimental QoE platform [8]. The main intention of this work is bringing a methodology to measure the QoE of an online video system and determine QoE capacity from the service provider of view. To achieve this, an online platform has been developed to measure single user QoE with following properties:

  1. 1.

    A video service is implemented to provide random movie trailers that can serve multiple users simultaneously.

  2. 2.

    Any user can watch a different content at a time.

  3. 3.

    Users can stop watching a content anytime they desire and continue on another random content.

  4. 4.

    The resources are randomly reconfigured on platform which changes throughput and latency of the service that corresponds to changes in stalling and buffering behaviour of user video experience.

  5. 5.

    Video metrics (Active watch duration, number of stalls, total stall duration, initial buffering duration), online video platform resource parameters (goodput, latency) and subjective QoE information (QoEoverall, QoEstalls, QoEinitial) are collected for each session.

A replicate of the platform is available through Amazon Web Services (AWS) EC2 and accessible via: www.utkubulkan.co.uk/qoe.html. QoE database for the online video delivery platform is available for public access at “www.utkubulkan.co.uk/qoedatabase.php”.

The Virtual Machine (VM) instance runs a collection of applications necessary for online streaming; apache II web server, PHP7.0 interpreter, MySQL5.7 database and a catalogue of video content as presented with Fig. 3.

Fig. 3
figure 3

VM Instance application layout

Online video platform workflow has been presented in Fig. 4 and the subjective QoE survey that has been used in the online video platform has been presented in Fig. 5 where the subjects are queried for their opinions about overall, stall and initial loading time of the watch session.

Fig. 4
figure 4

Online video platform workflow

Fig. 5
figure 5

Subjective user survey

The inputs and outputs are used to train, cross validate and test three different machine learning models: ANN, KNN and SVM to predict QoE for a single user. Finally, single user’s QoE is used for evaluation of QS, online video platform’s QoE value. QS and the relationship of QS with network parameters including goodput and latency will be evaluated. This will provide a fundamental understanding for QoE and end-to-end delivery requirements.

6 QoE ecosystem implementation

The proposed QoE ecosystem consists of five main components; Client, Web services, Video Platform Manager, Video Streaming Service and QoE Database. The workflow diagram that is illustrated in Fig. 4 shows the interactions between these components and their influences on calculating QoE.

6.1 Client

The client can be either a mobile device or personal computer that runs a web browser with Moving Picture Experts Group & Dynamic Adaptive Streaming (MPEG-DASH) over Hypertext Transfer Protocol (HTTP) content playing capability. Dynamic Adaptive Streaming over HTTP (DASH), also known as MPEG-DASH [20], is an adaptive bitrate streaming technique that enables high quality streaming of media content over the Internet delivered from conventional HTTP web servers. MPEG-DASH works by breaking the content into a sequence of small HTTP-based file segments, each segment containing a short interval of playback time of content that is potentially many hours in duration. MPEG-DASH is the first adaptive bit-rate HTTP-based streaming solution that is an international standard.

6.1.1 MPEG-DASH player

Current browsers are not able to handle MPEG-DASH streams by default. They need a JavaScript library such as “dash.js” [29, 34] or “Google’s Shaka” [43] player, to understand, parse and feed the chunks of video to the html5 video player. Without loss of generality, Google’s Shaka player has been used and “video.js” [16] libraries for MPEG-DASH manifest parsing and stream injection to browser’s player.

6.1.2 Browser support

Mobile or PC client must use a browser with HTML5 capability and Media Source Extensions (MSE) [35] to support MPEG-DASH players for playing content that are available in streaming platform. Major browsers that support MSE should be or greater than following versions, Firefox 42, Google Chrome 33, Microsoft Internet Explorer version 11, Safari version 8 [35].

6.1.3 Video metric collection

A player application has been developed using JavaScript & PhP that runs on the client and gathers statistics for the video metrics. These metrics can also be monitored via enabling statistics debug mode. An example screenshot of statistics that video player application shows in debug mode is demonstrated in Fig. 6.

Fig. 6
figure 6

Video player application

6.2 Web services

According to the sequence diagram in Fig. 4, user requests information about the video services. An HTTP conversation is initiated from client to web server. The web server will reply with the location of the MPD (media presentation description) manifest for the MPEG-DASH content. A CDN consists of many different devices and hence IP addresses which requires access to many different computers and domains. For this reason, Cross Domain Origin Policy (CORS) [18, 30] has been configured to avoid access inconsistencies.

6.3 Video platform manager

6.3.1 Transcoding and MPEG-DASH manifests

Manifest files can be created in many different ways [12, 39, 44, 45]. In this work, FFmpeg has been used to transcode the content on the on-line video platform [45] and Mp4Box [39] has been used for DASH manifests. Major platform suppliers which provide this capability as SaaS include Wowza Streaming Server [44], thePlatform [11], Kaltura [12] etc. These platforms work as a companion company for actual content suppliers such as FOX or ESPN and provide solutions to play their content on all screens and devices.

6.3.2 Video catalogue

The video catalogue consists of 10 different film trailers in the following genres; science fiction, drama, comedy, documentary and action. The duration for each trailer ranges from 2 to 3 min. Trailers tend to be short, attention grabbing while provide an exemplification of the entire film. Additionally, availability to download for public use makes them appropriate candidates to be used in a scientific research environment. In these experiments, the catalogue has been transcoded into 5 different resolutions -180p, 360p, 480p, 720p, 1080p with h264 encoding using libx264 with main profile using adaptive bitrate [14]. All these industry standard resolutions are explicitly defined in MPEG-DASH MPD manifest following a similar method such as Youtube and Vimeo use to support adaptive bitrate content streaming.

6.4 Video streaming server

The Linux-based Mpeg-DASH streaming server provides the content to the clients. It interacts with Netem network [27] that manages network throughput limitation and introduces delay in order to simulate real life scenarios like mobile or PC applications working via wireless-mobile networks [3]. Changes in network conditions will force the DASH players in client devices to switch to a more suitable bitrate.

6.5 QoE database and video metrics collection

After the video ends, the user is queried with subjective metrics related with user experience. The following subjective QoE values are collected: QoEoverall (overall customer satisfaction from whole experience), QoEstalls (level of stalls in user’s perspective), QoEinitial (initial buffering time in user’s perception). Figure 5 shows the subjective user survey that is queried to user at the end of each watched content. The database stores the following information for each content; trailer name, watch duration, initial buffering duration, bitrate, number of stalls and total stall duration.

7 Experiment methodology

In recent years, QoE aware service experimentation has diverged into two mainstream techniques, traditional lab testing and crowdsourcing. L. Anegekuh et al. [4] have discussed in their recent paper how crowdsourcing is being preferred to lab testing. Without a doubt, crowdsourcing has emerged as a cheaper and quicker alternative to traditional lab testing where people are recruited from different geographical regions with ubiquitous Internet access to perform low cost subjective video quality tests [13]. However, crowdsourcing introduces uncertainty in network/connectivity parameters and reliability of the observers, which might introduce a bias considering the tendency of people to criticize more than praise. T. Hoßfeld et al. has stated [17] that in general, every crowdsourcing task suffers from bad quality results. Though, even if the task is designed effectively, subjects might still submit unreliable and misleading survey results [41]. Establishing a trustworthy cluster of subjects (either paid or voluntarily registered) across distributed geographical locations that access the service via different network operators will establish a good understanding of the QoE of the service. Decisively, to keep full control on network monitoring capabilities and user consistency, lab-testing methodology is preferred as shown in Fig. 7.

Fig. 7
figure 7

Lab testing (left) and implementation into real life scenario

The focus of this paper is “QoE of an online video system” rather than a single user’s perception. From the perspective of a service provider, metrics of a reliable user’s opinion will be the basic building block to train the model for the QoE of the video delivery platform. However, the ultimate goal is not to measure and act upon the satisfaction of each particular customer but the whole service from the providers point of view. Obviously, measurement of a real-time performance of an online service requires QoE to be modelled as a function of time, considering the number of requests at an instant, and their impact on service throughput and latency.

7.1 Test subjects and equipment

Subjects who have participated in the experiment are both undergraduate and postgraduate students in the London South Bank University. A total of 30 users have participated in the testing evaluation. Testers have used 10 different consumer devices including a variety of mobile phones; Samsung S3, S4, Note 3, Sony Xperia XZ which have resolution of 1920 × 1080, HTC 10 (2560 × 1440) and personal computers; Dell Latitude e6410 (1280 × 800), Macbook (2560 × 1600), HP Elitebook8460 (1366 × 768), Probook 430 (1366 × 768) where Firefox or Safari browsers are used depending on OS. All devices in our lab testing are connected to a wireless router TP-Link TD-W8961 N with 300Mbps throughput. Our video services and web services run in Ubuntu 16.04 OS as virtual machines via VirtualBox on a HP-Elite Book with 8GB of RAM and i5 Intel processor.

The left part of Fig. 7 denotes the lab testing methodology and relationship to the real-life online video system scenario. The subjects are representing the recruited observers on the map from cluster of users that are receiving the online video service. Each trustworthy user will reflect the QoE of the system for a particular network operator-CDN and controlled lab environment guarantees valid network metrics monitoring.

7.2 Information about movie trailers

The purpose of a movie trailer is to provide an overview of the context of the motion picture using selected shots throughout the film that is being advertised [23]. A trailer has to achieve that in less than 2 min and 30 s, the maximum length allowed by the Motion Picture Association of America (MPAA). Each studio or distributor is allowed to exceed this time limit once a year, if they feel it is necessary for a particular film.

7.3 Test methodology

During the experiments, subjects request content via their MPEG-DASH players and video platform provides random movie trailers with durations ranging from 2:18 s to 2:54 s; Theory of Everything (2:29 s), Thor II (2:27 s), Star Wars 7 (2:18 s), Saving Mr. Banks (2:54 s), Back In Time (2:32 s), James Bond Spectre (2:31), The Intern (2:30s), Independence Day Resurgence (2:31 s).

The content is streamed in an asynchronous manner and the content is unicasted to each client where the test has been performed for all participants concurrently as shown in Fig. 8. Each subject might watch a different content at a time. Users can either start watching or stop or also even exit in the middle of a session whenever they desire. At runtime, server’s goodput and latency are reassigned randomly and this may cause some users to stall and wait for the service to become available again.

Fig. 8
figure 8

A diagram of proposed QoE forecasting methodology [8]

In computer networks, goodput is the application-level throughput corresponding to the number of useful information bits delivered by the network to a certain destination per unit of time [21]. This capability simulates the service of an actual online video broadcasting system where the load and the number of requests on the system vary in time.

At any moment, if the subject desires to quit watching (the reason might be anything; number of stalls, stall duration, even unwillingness), she/he simply presses “Exit Watching” and proceeds with QoE survey. Each time a video is watched and QoE survey is submitted by the user, video metrics, active watch duration, service goodput and latency for each session is logged.

7.4 Derivation of overall online video system QoE

The ultimate aim of this paper is to bring forward a scientific methodology to evaluate QoE for the video delivery system from a single subject oriented QoE. In order to establish this association, a user’s QoE has to be defined in terms of video quality metrics. Equation 4 stands for the abstract representation of the input-output relationship that the machine learning methods SVM, KNN and ANN are based on in Section 8. For the functions in this section, the definitions of the variables are declared in Table 1 as list of notations.

Table 1 List of notations

For a user using the service on virtual machine “v” that is running on physical machine “m”, single user’s QoE “\( {\mathrm{Q}}_u^{v,m}(t) \)” is represented as Eq. 4 as a function of W (total watch duration), B (average bitrate of the stream), St (number of stalls), Stdur (time spent during stalls), tlat (the amount of time to load the content).

$$ {\mathrm{Q}}_u^{v,m}(t)=\mathrm{Q}\left(W,B, St,{St}_{dur},{t}_{lat}\right) $$
(4)

Following that, an online video platform consists of several distributed video servers and CDN nodes. This needs a clarification and classification for the particular user and physical & virtual server that provides the service for that user. The users are getting this service from a Virtual Network Function (VNF) υ∈V and this VNF runs on a physical machine m∈Μ at a moment t. The QoE for υ can be defined as Eq. 5:

$$ {\mathrm{Q}}_v^m(t)=\sum \limits_{u=1}^U\frac{{\mathrm{Q}}_u^{v,m}(t)}{u} $$
(5)

Conclusively, QoE of the service can be reflected as the success of corresponding ∀υ∈V that build up the entire system.

$$ {\mathrm{Q}}_S(t)=\sum \limits_{v=1}^V\frac{{\mathrm{Q}}_v^m(t)}{v} $$
(6)

In order to reflect the local behavior of QoE, QV and eventually QS are calculated with central moving average (PCMA) [22, 40] that spans through the QU dataset as declared in Table 2 and given with Eqs. 7 & 8.

Table 2 Lab session data collected via online video platform
$$ {\mathrm{p}}_{CMA}=\frac{{\mathrm{p}}_M+{\mathrm{p}}_{M-1}+\dots +{\mathrm{p}}_{M-\left(n-1\right)}}{n} $$
(7)
$$ {\mathrm{p}}_{CMA}={\mathrm{p}}_{CMA, prev}+\frac{p_{CMA}}{n}+\frac{p_{CMA-n}}{n} $$
(8)

In this context, single υ has been used, so QS & QV to the same entity. As a future work, a load balancer mechanism, support for multiple-CDN and edge cache node support will be implemented for system quality attributes analysis: scalability, resilience, responsiveness and availability.

7.5 Results and discussions

Six lab-testing sessions have been conducted with the subjects. The duration of each session was about 30 min. All participants have used the limited resources of the online video platform simultaneously. The collected data have been used for modelling with K-nearest Neighbours Algorithm (KNN), Artificial Neural Network (ANN) and Support Vector Machine (SVM) on a MacBook Pro running Matlab R17 with i5 processor and 16 GB RAM.

8 Supervised machine learning using objective metrics and subjective survey

8.1 SVM, support vector machines

Support Vector Machines (SVM) categorizes data by discovering the linear decision boundary (hyperplane) that separates all data points of one class from those of the other class [33]. Once the model parameters are recognized, SVM relies only on a subclass of these training cases, termed support vectors, for future estimations [6]. An increase on the weight c (box-constraint) will cause stricter separation of the classes. However, this may introduce an increase factor in false assumptions on the classification.

8.2 KNN, k-nearest neighbor classification

The K-Nearest Neighbor (KNN) classification technique categorizes objects regarding the classes of their nearest neighbors [33]. KNN forecasts are based on the supposition that objects near each other should be similar. During the learning phase, the best number of similar observations has been chosen [36]. In order to ensure that models generated using different values of k are not overfitting, a separate training and cross validation test set have been used.

8.3 ANN, artificial neural networks

Inspired by the human brain, a neural network consists of highly connected network of neurons that relate the inputs to the desired outputs [33]. ANN is quite efficient for modelling highly nonlinear systems and unexpected changes are anticipated in the input data [37].

8.4 Training methodology

For any attempt to use machine learning modelling for simulating the behavior of a function, the methodology of training using the available dataset plays a crucial role for the base understanding of the mathematical endeavor [38]. In this work three different machine learning methods have been employed and this section presents a clarification of the training phase.

In order to train a SVM, there needs to be couple of foundational decisions taken regarding; how to preprocess data and what kernel to use [7]. In this work, cubic kernel model has been used for SVM. Three different values for box-constraint have been taken and the results have been presented in Table 3.

Table 3 QoE error analysis for machine learning methods for different settings

For the KNN models that have been used in this work, distance metric has been selected as Euclidean with equal distance weights. The accuracy of three different neighbour settings has been presented in the results table.

The ANN models have been modelled with three different f neurons settings; 8, 10 & 12. The network is trained with Levenberg-Marquant training algorithm modifying the strengths of the connections so that given inputs map to the correct response.

During lab testing, over 400 watched session information have been collected regarding the input & output relations where user’s QoE \( {}^{"}{\mathrm{Q}}_u^{v,m}{(t)}^{"} \) is modelled with parameters; W, B, St, Stdur & tlat as given with Eq. 4. The data collected from these experiments are used for training the models and cross validation. A set of these streaming sessions have been presented as Table 2. Regarding this dataset, in the following section, confusion matrixes for different machine learning models will be presented where predicted values will be compared against true values. The classes that are given in the confusion matrixes in Section 8.5 refers to subjective QoE evaluation labels; 1 - Very Bad, 2 – Bad, 3 – Moderate, 4 – Good and 5 – Very Good.

8.5 Confusion matrix

The confusion matrix (also known as the error matrix) provides the scattering of the correct match rates for predicted versus true classes as shown in Figs. 9, 10 and 11. True positive rate reflects correct hit levels while false negative rate provides the miss percentage. KNN QoE model has shown the best accuracy rates with a setting of 10 neighbors.

Fig. 9
figure 9

Confusion Matrix for KNN QoE Model. Weighted KNN has been implemented with a setting of 10 neighbors. Distance metric is Euclidean with squared inverse distance weight. The accuracy of the true positive rates %82.6 have been presented via confusion matrix

Fig. 10
figure 10

Confusion Matrix for SVM Cubic Kernel Model, with the box-constraint value as c = 3

Fig. 11
figure 11

Confusion Matrix for ANN with 10 neurons. The network is trained with Levenberg-Marquant training algorithm

This may be due to the fact of lazy learning and KNN’s capability to distinguish neighbouring class features, when a strict clustering is not possible across the dataset due to bias. Still, performance of SVM & ANN is very close and selecting a methodology for implementing one of these methods should rely on empirical confirmation for each test setup and session.

8.6 Experiment dataset and error analysis

Regarding calculated and actual qualitative values, the error has been measured with three different methods [25]: Pearson correlation (Eq. 10), root mean square error (Eq. 11) and mean average error (Eq. 12). The definitions of the arguments in these equations are explicitly described in Table 1 as list of notations.

The Pearson correlation measures the linear association between a model’s performance and the subjective QoE. It provides a standard scale of −1 to 1: 1 indicates a total positive correlation, 0 no linear correlation and − 1 total negative correlation.

$$ \mathrm{pcc}=\frac{\sum_{i=1}^N\left({X}_i-\overline{X}\right)\ast \left({Y}_i-\overline{Y}\right)}{\sqrt{\sum {\left({X}_i-\overline{X}\right)}^2}\ast \sqrt{\sum {\left({Y}_i-\overline{Y}\right)}^2}} $$
(10)

Root mean square error is the square root of the average of squared errors. Despite the general false assumption, it does not reflect the average error. Due to the fact that the square of the error has been used, greater error rates have greater impact on the rmse. A lower value of rmse indicates a better correlation between prediction of the model and actual values.

$$ \mathrm{rmse}=\sqrt{\frac{1}{N-d}\sum \limits_N{\left({Y}_i-{\overline{Y}}_i\right)}^2} $$
(11)

Mean average error provides a simple analysis of the average difference between prediction and real values. The difference error is proportional to the absolute difference of actual and calculated.

$$ mae=\frac{1}{n}{\sum}_{i=1}^n\mid {Y}_{actual_i}-{Y}_{calculated_i}\mid $$
(12)

8.7 K-fold cross validation

K-fold cross validation (CV) is a model validation technique, which partitions the dataset into equal sized subsets. Single subset is used as validation data while the rest k-1 subsets are used as training data. Spanning through all dataset k-fold times guarantee each data as training data exactly once. Results from the folds can be averaged to produce a single estimation [19]. In this work, three different cross validation training strategies have been conducted with 3, 5 and 10 k-fold values and two of them have been presented.

9 Model performance comparison

Performance of SVM QoE model changes thoroughly with different values of box-constraint configuration in Matlab’s fitcecoc functionality [33]. The best results of SVM for Pearson correlation have been achieved with a value of c = 3 while rmse provides better results for the c = 5 function. Additionally, k-fold cross validation values with k = 5 subsets give better values than k = 10.

ANN gives good results for real time analysis due to its dynamic programming nature and continuous training capability, which makes it a perfect candidate for a QoE modelling system implementation. Though ANN’s real-time capabilities, when compared with other methods, ANN provide the worst performance for QoE modelling for this dataset. With a setting of 10 hidden neurons, ANN provides an estimation performance of pcc ≅ 0.89 and rmse ≅ 0.2.

KNN, although a lazy learning methodology, shows the best results with the 10 neighbor settings and k-fold = 5 (pcc ≅ 0.91, rms ≅ 0.19, mae ≅ 0.4622) compared to all three methods. Commonly, SVM and ANN provide better solutions when compared to KNN for nonlinear variables. However due to the nature of our methodology, bias that might have been caused by user subjective observations might lead SVM to fail distinct classification of input data whereas KNN might have shown better results in mimicking the neighboring classes.

Time-invariant models of ITU-P.1203.3 [26] and Knoll et al. [28] have shown parallel behavior and reflected single user. However, from overall perspective, machine learning methods provide a better understanding of the trends of the QoE, in regard to their learning and cross-validation ability directly from the same dataset.

Principally, on such subjective tests where long periods of testing are needed, one of the key facts that must be considered is the exhaustion of test subjects, which may cause unreliable MOS values. In order to avoid such misleading conclusions, precautions such as user’s intention to watch particular genre or user’s wish for participating in such an experiment at any point is considered during experimentation as discussed by B. Gardlo et al. [13].

10 Online video platform QoE capacity estimation

The prime intention of this paper is to measure the QoE vs online video delivery platform capacity parameters. In order to achieve that, single user experience is taken as the elementary unit. After training a model for Qu, system wide QoE is calculated. The relationship of QS vs online video delivery platform goodput and latency has been shown on Fig. 12. Equation 13 is a cubic polynomial function in its generalized form [24]. In order to fit QS, the arguments goodput and latency have been used in Eq. 13 and their relationship Eq. 14 have been obtained. The coefficients also have been declared in Fig. 12.

$$ \mathrm{P}\left(\mathrm{x},\mathrm{y}\right)=\sum \limits_{k=0,}^3\sum \limits_{i=0}^1{p}_{ij}.{x}^k.{y}^i $$
(13)
Fig. 12
figure 12

Online Video Platform QoE “QS” vs Goodput and Latency. The model that is given as Eq. 14 relates system QoE to goodput and latency and has following coefficients for the polynomial: p00 = 2.205, p10 = 1.01, p01 = 0.6451, p20 = 0.7613, p11 = −0.1645, p30 = −0.8037, p21 = −0.2947. Goodness of fit, R-square: 0.07697, RMSE: 0.736

$$ {\mathrm{Q}}_{\mathrm{S}}\left({\mathrm{S}}_{gp},{\mathrm{S}}_l\right)={p}_{00}+{p}_{10}.{\mathrm{S}}_{gp}+{p}_{01}.{\mathrm{S}}_l+{p}_{20}.{{\mathrm{S}}_{gp}}^2+{p}_{11}.{\mathrm{S}}_{gp}.{\mathrm{S}}_l+{p}_{30}.{{\mathrm{S}}_{gp}}^3+{p}_{21}.{{\mathrm{S}}_{gp}}^2.{\mathrm{S}}_l $$
(14)

Providing a distinct understanding of system QoE information will help any online video delivery platform and service provider to take appropriate action regarding the orchestration of their system resources.

The proof of concept platform that has been discussed in Section 6 consists of a single network device and a single virtual machine running on a physical server. Streaming capacity of the online video platform which refers to the obtainable bandwidth that can be served with a tolerable latency while providing an adequate perception quality from this service can be declared as a function of network capabilities, goodput and latency. From the perspective of this paper, Eqs. 13 & 14 present an understanding of QoE in terms of system resource metrics which are modelled using Matlab cubic curve fitting tool [24] with cubic interpolation configuration [48] based on the subjective user experience records and objective experiment statistics. The data that is used as input for the curve fitting tool has been collected through the lab sessions which are available on the publicly accessible database in Section 5. The polynomial coefficients of the fit function are declared in information section of Fig. 12 and the variables for these equations are defined in list of notations as Table 1.

Any QoE degradation that concludes as stalls or initial buffering durations can be prevented by refining existing resources or providing additional capability to system. Real life scenarios rely on many servers, running multiple instances of virtual machines and several network peripherals. When proceeding with this kind of experimentation, empirical validation of the test bed and parallelism to real life scenarios should always be carefully considered.

For a given goodput and minimum latency request, QoE can be estimated with Eq. 14. Whenever there is more demand for content, correspondingly there is a probability for QoE degradation as the load increases. Depending on the advertised service quality such as; basic service (Youtube, Vimeo) or advanced & premium service (Amazon Prime, Netflix, Youtube Premium) the intended and expected QoE levels can be adapted. One important point for the operator is to consider the QoE changes through time and decide when to act against QoE deprivation comparing the delta between two instants during serving period.

This work provides a foundation for scaling strategies of an online video platform. Whenever there is more demand for video which corresponds to relative increase in goodput and latency, Eq. 13 will provide the QoE value in regard with system resources.

11 Conclusion & future works

The work has provided an evaluation methodology for video delivery system QoE ‘QS’ through single user QU and showed that modelling is possible through objective video metrics and subjective QoE survey analysis. System performance parameters goodput and latency can be associated with user experience, whereas a controlled testing environment is available guaranteeing reliable network performance measurement when network metrics are introduced into numerical prediction analysis.

The methodology that has been proposed in this paper can provide a fundamental understanding on how to act for QoE degradation for online video platforms. This paper can be guideline for any network operator on how to maintain resources; instantiate or terminate VMs responsible for streaming content that will save cloud budgets and deployment costs while considering QoE.

As an extension to our research work, implementation of a load balancer with multi-CDN support is planned while considering cloud computing resource constraints to cover wide variety of needs of future online video trends.