1 Introduction

In today’s digital era, video streaming services have become an integral part of our daily lives, providing access to a wide array of content across the globe [1]. Recent data reveals that the global video streaming market achieved a valuation of USD 50.11 billion in 2020, with projections soaring to USD 184.27 billion by 2027, representing a remarkable compound annual growth rate (CAGR) of 19.6% during the period from 2020 to 2027 [2]. Despite this exponential growth, ensuring the delivery of high-quality video streaming services remains a formidable challenge due to the substantial network resources required.

To address this challenge, adaptive bit rate (ABR) streaming has emerged as a potent technique for optimizing video quality while simultaneously conserving bandwidth [3]. ABR streaming involves encoding video content at multiple bit rates and dynamically delivering them to users. The client device then autonomously selects the most suitable bit rate based on network conditions and user preferences, aiming to deliver a personalized viewing experience.

Machine learning, particularly leveraging neural networks, has demonstrated immense potential in enhancing video streaming adaptation. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have risen as prominent choices for video quality adaptation, each possessing unique strengths.

CNNs excel in extracting salient features from video frames, enabling precise prediction of the optimal bit rate for each frame [4]. Recent years have witnessed the development of innovative CNN-based architectures that have delivered substantial advancements in image and video recognition tasks. Consequently, these architectures are well-suited for feature extraction in video quality adaptation applications.

In contrast, RNNs are harnessed to capture temporal dependencies, allowing them to model the evolution of network conditions and user preferences over time, providing a foundation for adaptive streaming that is both responsive and individualized.

In response to the evolving needs of video streaming services, our proposed solution is engineered to operate primarily on the server side. Here, it has access to comprehensive data streams including network traffic metrics, user viewing histories, and video content parameters. Our solution is strategically designed to function at the server level, where it creates a dynamic quality ladder based on comprehensive real-time data. This ladder is then utilized by the client to select the most appropriate bitrate, ensuring an optimized streaming experience. This positions our solution to leverage machine learning algorithms for optimizing video delivery in real-time. It is particularly attuned to the Dynamic Adaptive Streaming over HTTP (DASH) protocol, making it highly relevant to current streaming practices. By applying a CNN to discern intricate patterns within the video and an RNN to understand temporal shifts in network conditions and user preferences, our approach seeks to advance adaptive bitrate strategies. This integration enables our architecture to predict and adjust the bitrate dynamically, ensuring that each user’s viewing experience is optimized, regardless of fluctuating network conditions. By operating primarily at the server side, our architecture not only predicts the optimal bitrates but also generates a quality ladder that is dynamically adapted to real-time network conditions and user preferences. This enables clients to effectively select the most suitable bitrate, enhancing the overall streaming quality.

In this manuscript, we unveil a state-of-the-art architecture that amalgamates CNN and RNN models for real-time video quality adaptation. Our proposed architecture integrates a CNN feature extractor to extract essential features from the video stream and an RNN predictor to estimate the optimal bit rate for each frame. This architecture is specifically optimized for the Dynamic Adaptive Streaming over HTTP (DASH) protocol, reflecting its widespread adoption and standardization in the video streaming industry.

Our proposed architecture delivers the following principal contributions:

  1. 1.

    We introduce an innovative architecture that blends CNN and RNN models to achieve real-time video quality adaptation, harnessing the strengths of both neural network types.

  2. 2.

    We present a dedicated CNN feature extractor designed to extract pertinent features from the video stream, facilitating the prediction of the optimal bit rate for each frame.

  3. 3.

    We introduce an RNN predictor capable of capturing temporal dependencies, thereby modeling the dynamic evolution of network conditions and user preferences in a personalized manner.

The structure of this manuscript is as follows: The Introduction provides the motivation for this study and establishes the context for the proposed architecture. Section 2 offers a review of the literature on video quality adaptation, with a primary focus on machine learning techniques. In Section 3, we detail the CNN feature extractor and RNN predictor that constitute our proposed architecture. Section 4 delves into the methodology employed for model training and testing, utilizing synthetic network traces and user profiles. Section 5 critically evaluates the performance of our architecture in comparison to existing state-of-the-art methods. Finally, Section 6 summarizes our key findings and explores potential avenues for future research.

2 Related work

Adaptive bit rate (ABR) streaming has become a popular solution for optimizing video quality while minimizing bandwidth usage [5]. ABR streaming involves encoding multiple versions of a video at different bit rates and dynamically selecting the optimal bit rate based on the network conditions and user preferences.

Machine learning techniques have been widely used in the field of video quality adaptation to improve the effectiveness of ABR streaming. In particular, convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been shown to be effective in capturing the relevant features from video frames and modeling the temporal dependencies of network conditions and user preferences.

CNNs have been used for feature extraction in video quality adaptation tasks [6]. CNN-based architectures, such as the ResNet [7], VGGNet [8], and Inception [9], have achieved significant improvements in image and video recognition tasks, making them ideal for feature extraction in video quality adaptation.

RNNs have been used to capture the temporal dependencies and model the evolution of network conditions and user preferences over time in video quality adaptation tasks [10]. Various forms of RNNs, including long short-term memory (LSTM) [11] and gated recurrent unit (GRU) [12], have been used to model long-term dependencies in video quality adaptation tasks.

Recent research has also focused on combining CNNs and RNNs for video quality adaptation. For example, [13] proposed a hybrid CNN-RNN architecture for video quality adaptation that combines the power of both models to extract features from video frames and model the temporal dependencies of network conditions and user preferences.

To mitigate bandwidth constraints and improve Quality of Experience (QoE) in live video streaming, Ma et al. [14] propose QoE-Aware Adaptive Video Bitrate Aggregation for HTTP Live Streaming (QAVA), a smart edge computing-based bitrate aggregation scheme. Leveraging a novel Deep Reinforcement Learning algorithm, QAVA dynamically adapts client bitrates, enhancing QoE and outperforming traditional bitrate algorithms in both average QoE and fairness as evidenced by experimental results. Adaptive bitrate streaming (ABS), especially when enhanced with reinforcement learning and network assistance, addresses the challenge of delivering high-quality video streaming amidst fluctuating wireless networks [15]. This study integrates non-orthogonal multiple access (NOMA) to further improve resource efficiency and QoE. A complex optimization problem is tackled with a novel algorithm that jointly manages resource allocation and bitrate adaptation, ultimately proving to enhance QoE and reduce energy consumption under resource constraints. Zuo et al. [16] propose an "average user" in adaptive bitrate (ABR) streaming algorithms, demonstrating through a study with 90 subjects that user preferences significantly vary. Introducing Ruyi, a preference-aware video streaming system, it adapts QoE models and ABR algorithms to individual user profiles, optimizing streaming quality without the need for model re-training for each user. Simulation and experimental results show substantial improvements in user QoE, with Ruyi receiving the highest user ratings. Li et al. [17] introduce an adaptive bitrate (ABR) algorithm that uses self-play reinforcement learning to optimize video streaming with small buffers, overcoming the challenge of balancing high bitrate and low stalling in fluctuating network environments. This approach deviates from traditional gradient-updating ABR methods by focusing on the core objectives of streaming with minimal buffering. L-ABR demonstrates superior performance in trace-driven experiments, reducing buffer-related issues and significantly enhancing average QoE by up to 71.88% and 75.25% on HSDPA and FCC datasets, respectively. Tashtarian et al. [18] propose an adaptive framework for live HTTP Adaptive Streaming (HAS), dynamically optimizes bitrate ladders considering content complexity and network diversity, enhancing viewer QoE while reducing encoding and bandwidth costs, independent of ABR algorithms and codecs. Evaluation shows notable savings in computational and bandwidth resources with maintained QoE. Feng et al. [19] introduce variable-length segmentation in HTTP-based adaptive streaming (HAS) to enable dynamic bitrate adjustments, leveraging real-time network and viewer data. This intelligent approach offers significant QoE improvements and bandwidth savings over traditional HAS solutions. Coelho et al.[20] propose a new video bitrate adaptation mechanism is designed for overloaded channels to improve Quality of Experience (QoE) by balancing image quality, session continuity, and playtime. This mechanism enhances session QoE in both live and on-demand modes, validated by a multifactor QoE metric. Ozcelik et al. [21] introduce a deep reinforcement learning framework that jointly optimizes adaptive playback speed and video quality for live streaming, maximizing QoE while preserving low latency and content integrity, outperforming existing algorithms in real-world simulations.

Other approaches to video quality adaptation using machine learning include reinforcement learning [22], transfer learning [23], and multi-objective optimization [24]. These approaches have shown promising results in improving the effectiveness of video quality adaptation in various settings.

3 Proposed architecture

In this section, we detail our architecture, comprising the CNN feature extractor and RNN predictor, as shown in Fig. 1. Our CNN utilizes ResNet, chosen for its deep residual learning capability, enabling effective feature extraction from video frames. The RNN employs LSTM units, selected for their proficiency in capturing temporal dependencies crucial for modeling video bitrate variations.

Fig. 1
figure 1

Proposed architecture

3.1 CNN feature extractor

Fig. 2
figure 2

CNN feature extractor

The proposed architecture includes a CNN feature extractor shown in Fig. 2 that extracts relevant features from the video stream. The CNN feature extractor is composed of several convolutional layers that learn filters to extract low-level and high-level features from the video frames. These initial convolutional layers are designed to capture basic visual elements such as edges and textures. Progressing through the network, the data encounters ResNet blocks that use shortcut connections to bypass one or more layers, enhancing the learning of complex features and enabling the construction of deeper network architectures.

The output of the CNN feature extractor is a feature vector that represents the relevant features of the video frame.

The output of the \(l^{th}\) convolutional layer is denoted as \(x_{l} \in \mathbb {R}^{H_{l} \times W_{l} \times C_{l}}\), where \(H_{l}\), \(W_{l}\), and \(C_{l}\) represent the height, width, and number of channels of the feature map, respectively. The weights of the \(l^{th}\) convolutional layer are denoted as \(w_{l} \in \mathbb {R}^{k_{l} \times k_{l} \times C_{l-1} \times C_{l}}\), where \(k_{l}\) is the kernel size.

The output of the \(l^{th}\) convolutional layer is computed as:

$$\begin{aligned} x_{l} = \sigma (\sum _{i=1}^{C_{l-1}} w_{l,i} * x_{l-1,i} + b_{l}), \end{aligned}$$
(1)

where \(\sigma\) is the activation function, \(*\) denotes the convolution operation, and \(b_{l}\) is the bias term.

As the network deepens, pooling layers consolidate the spatial dimensions of the feature maps, reducing computational load and abstracting the input data. Subsequently, the data passes through fully connected layers that finalize the feature representation for the upcoming predictive tasks. The fully connected head acts as a critical intermediary that transforms the high-dimensional feature maps into a compact feature vector, suitable for temporal analysis by the RNN. This ensures that spatial features relevant to video quality are preserved and efficiently presented to the RNN for time-series prediction, which is vital for the next step in our bitrate prediction process. Our architecture employs a ResNet-based CNN feature extractor, known for its efficacy in image and video recognition tasks. ResNet’s innovation lies in its ‘identity shortcut connections’ that provide alternate paths for gradients during backpropagation, thus mitigating the vanishing gradient problem. Our implementation uses ResNet blocks comprising two \(3 \times 3\) convolutions with equal filter counts, batch normalization, and ReLU activation. Each block adds its input to its output (identity connection), followed by another ReLU activation. This mechanism permits the network to learn identity functions as needed and sustains deep network performance without degradation. The extraction and refinement of features for bitrate prediction are thus enhanced, ensuring robustness even with increased network depth.

3.2 RNN predictor

The output of the CNN feature extractor is passed to an RNN predictor that captures temporal dependencies and models the evolution of network conditions and user preferences over time. Figure 3 illustrates the RNN predictor that takes as input the feature vectors from the CNN feature extractor and the current network conditions and user preferences, outputting the optimal bit rate for each frame.

Fig. 3
figure 3

RNN predictor

Let \(x_{t}\) denote the feature vector of the \(t^{th}\) frame, \(n_{t}\) denote the network conditions at time t, and \(u_{t}\) denote the user preferences at time t. The RNN predictor is modeled as:

$$\begin{aligned}{} & {} h_{t} = f_{rnn}(h_{t-1}, [x_{t}, n_{t}, u_{t}]), \end{aligned}$$
(2)
$$\begin{aligned}{} & {} y_{t} = f_{out}(h_{t}), \end{aligned}$$
(3)

where \(h_{t}\) is the hidden state of the RNN at time t, \([x_{t}, n_{t}, u_{t}]\) denotes the concatenated input of the RNN, \(f_{rnn}\) is the RNN function, and \(f_{out}\) is the output function that maps the hidden state \(h_{t}\) to the optimal bit rate \(y_{t}\) for the \(t^{th}\) frame.

The RNN predictor can be modeled using various RNN-based architectures, such as LSTM or GRU. The choice of architecture depends on the specific requirements of the video streaming platform and the complexity of the temporal dependencies.

In the proposed architecture, we use an LSTM-based RNN predictor [25, 26], which has shown to be effective in capturing long-term dependencies in video streaming adaptation tasks. LSTMs are adept at handling sequences with varying lengths and are particularly suited for video data, where the temporal relationship between frames is crucial for predicting future states. They have been employed in numerous video processing tasks, such as video classification and event detection, where the ability to remember and utilize past information is essential. The LSTM architecture uses memory cells to store information over time and gates to control the flow of information.

The LSTM-based RNN predictor is modeled as:

$$\begin{aligned}{} & {} i_{t} = \sigma (W_{ix}x_{t} + W_{ih}h_{t-1} + b_{i}), \end{aligned}$$
(4)
$$\begin{aligned}{} & {} f_{t} = \sigma (W_{fx}x_{t} + W_{fh}h_{t-1} + b_{f}), \end{aligned}$$
(5)
$$\begin{aligned}{} & {} o_{t} = \sigma (W_{ox}x_{t} + W_{oh}h_{t-1} + b_{o}), \end{aligned}$$
(6)
$$\begin{aligned}{} & {} c_{t} = f_{t} \cdot c_{t-1} + i_{t} \cdot \tanh (W_{cx}x_{t} + W_{ch}h_{t-1} + b_{c}), \end{aligned}$$
(7)
$$\begin{aligned}{} & {} h_{t} = o_{t} \cdot \tanh (c_{t}), \end{aligned}$$
(8)

where \(i_{t}\), \(f_{t}\), and \(o_{t}\) are the input, forget, and output gates, respectively, \(c_{t}\) is the memory cell, \(\tanh\) is the hyperbolic tangent function, and W and b are the weights and biases of the LSTM.

3.3 Real-time video quality adaptation

The proposed architecture is designed for real-time video quality adaptation, where the optimal bit rate for each segment is predicted based on the current network conditions and user preferences. The proposed architecture dynamically adjusts the bit rate of the video stream to provide the best possible quality while minimizing bandwidth usage.

The optimal bit rate for each segment \(y_{t}\) is computed as:

$$\begin{aligned} y_{t} = \text {argmax}{y \in Y} P(y|x_{t}, n_{t}, u_{t}), \end{aligned}$$
(9)

where Y is the discrete set of pre-encoded segment bitrates available on the server, representing the different quality levels at which the video content is stored. \(P(y|x_{t}, n_{t}, u_{t})\) is the probability of bit rate y given the feature vector \(x_{t}\), network conditions \(n_{t}\), and user preferences \(u_{t}\).

The proposed architecture combines the power of CNN and RNN models for real-time video quality adaptation. The CNN feature extractor extracts relevant features from the video segments, while the RNN predictor captures temporal dependencies and models the evolution of network conditions and user preferences over time.

4 Experiment setup

In this section, we delve deeper into the experiment setup, elaborating on the dataset composition, network conditions, user preferences, evaluation metrics, and benchmark methods.

4.1 Dataset

For our experiments, we utilize two extensive video datasets: VideoSet1 [27] and VideoSet2 [28]. These datasets encompass a wide range of video content types, such as sports events, news broadcasts, movies, and documentaries. They feature various spatial and temporal complexities, resolution levels, and durations, making them ideal for testing adaptive streaming algorithms.

Table 1 presents the dataset statistics and content types for VideoSet1 and VideoSet2. The combined dataset consists of 150 videos, totaling 15 h of content, providing a robust sample for the experiments.

The figures 21 and 22 in appendix A showcase histograms for a VideoSet1 and VideoSet2, highlighting various video streaming parameters such as bitrate, frame rate, resolution, codec types, buffering ratio, user engagement, audio quality, latency, content type, and content duration, providing a comprehensive insight into the distribution and frequency of each parameter’s values.

4.2 Video frame processing for CNN feature extraction in our architecture

In our architecture, the preparation of video frames for Convolutional Neural Network (CNN) feature extraction follows a systematic pipeline. We commence with the collection of diverse video datasets, encompassing a range of content types, such as sports events, news broadcasts, movies, documentaries, animation, concerts, educational content, and TV shows. To ensure uniformity, videos are standardized into consistent formats like MP4 or MKV. Based on our research objectives and computational resources, video clips are sampled and segmented. Frames are then extracted from these clips at a stable frame rate to form an image sequence. These frames undergo resolution normalization, often resized to 720p or 1080p, for consistent CNN input. Data augmentation techniques are applied to augment the training data, enhancing the model’s robustness. In cases involving specific video segments, temporal segmentation is performed. Quality assessment metrics, including bitrate, codec, buffering ratio, and audio quality, are recorded for evaluation. Metadata, containing content type, duration, and user engagement scores, is also collected. Finally, the dataset is systematically split into training, validation, and testing sets. This comprehensive processing pipeline ensures that video frames are optimally prepared for subsequent CNN feature extraction in our research architecture.

Table 1 Dataset statistics and content types

4.3 User preferences

In Table 2, we present values for user preferences, which play a vital role in our video streaming optimization system. These preferences encompass factors such as brightness, volume, resolution, and buffer threshold. For instance, the brightness value, set at 0.75, represents the user’s desired brightness level during video playback. Similarly, the volume preference is set to 0.85, indicating the desired audio volume. Resolution, a crucial factor affecting video quality, is set to 1080p, representing the user’s preference for high-definition content. The buffer threshold of 10 s signifies the user’s tolerance for buffering delays. These values are not derived from real-world user data but are instead based on synthesized user preference models from the studies in [29,30,31]. They serve as input parameters that guide our system in dynamically adapting the video streaming quality to align with these synthesized user preferences.

Table 2 Values for user preferences

4.4 Network conditions

Table 3 provides values representing network conditions, a critical aspect of our video streaming optimization system. These conditions include parameters like bandwidth, latency, jitter, packet loss, and overall network quality. For instance, a bandwidth of 20 Mbps signifies the available network bandwidth for streaming. The latency value of 50 ms represents the round-trip delay in network communication. Jitter, at 5 ms, accounts for the variability in latency. A packet loss rate of 0.2% indicates the likelihood of data loss during transmission. The network quality, labeled as "Excellent," characterizes the overall network performance. It is important to clarify that these values are not extracted from real-world data but are derived from network condition models established in the studies [32,33,34]. They are utilized to create a simulated environment that guides our system in making adaptive streaming decisions that reflect a range of potential real-world network scenarios.

Table 3 Values for network conditions

In Table 4, we detail the values for the feature maps extracted from our CNN feature extractor. These feature maps represent the intermediate representations of video frames and play a crucial role in capturing spatial information. The table lists the number of feature maps for each convolutional layer in the CNN architecture. For example, the first convolutional layer has 64 feature maps, while subsequent layers exhibit increased complexity with 128, 256, and 256 feature maps, respectively. These feature maps serve as input to our RNN predictor, enabling the model to understand the visual content of video frames and make informed bitrate decisions.

Table 4 Values for feature maps

4.5 CNN details

Table 5 outlines the hyperparameters governing our CNN feature extractor. These hyperparameters dictate the architecture and training of the convolutional neural network responsible for feature extraction. Key hyperparameters include the learning rate (0.001), batch size (32), number of training epochs (10), dropout rate (0.5), kernel size (3×3), and the number of filters for each convolutional layer (64, 128, 256, 256). Additionally, the activation function is set to ReLU, and weight initialization follows the He initialization scheme. These hyperparameters are crucial in ensuring the effective extraction of informative features from input video frames.

Table 5 Hyperparameters for CNN feature extractor

4.6 RNN details

Table 6 presents the hyperparameters governing our RNN predictor, which models temporal dependencies and bitrate predictions over time. These hyperparameters include the learning rate (0.001), batch size (16), number of training epochs (20), LSTM units (256), dropout rate (0.5), activation function (Sigmoid), and weight initialization (Glorot Initialization). The learning rate controls the step size during optimization, while batch size affects training efficiency. LSTM units determine the complexity of the recurrent layers, and dropout mitigates overfitting. The activation function and weight initialization influence the model’s non-linearity and parameter initialization strategies. These hyperparameters collectively enable the RNN predictor to capture temporal patterns and make bitrate predictions.

Table 6 Hyperparameters for RNN predictor

4.7 Training process

The training of our video streaming optimization system involves the fine-tuning of two main components: the Convolutional Neural Network (CNN) feature extractor and the Recurrent Neural Network (RNN) predictor. This process is crucial for enabling the system to adapt video streaming quality in real-time. We follow a series of well-defined steps:

Data Preparation: We assemble a dataset comprising video streams with varying content and quality levels. Each video stream is segmented into frames, and relevant features, such as user preferences and network conditions, are extracted. These features serve as input to our system.

CNN Feature Extractor Training: The CNN is trained to distill spatial features from video frames critical for quality adaptation. Training details, such as the optimizer, learning rate, and architectural specifics, are delineated in Table 5. We highlight the choice of He initialization due to its efficacy in maintaining the variance of activations, ensuring stable gradients for deep architectures.

RNN Predictor Training: The RNN is tasked with temporal modeling and bitrate prediction. The predictor’s training regimen, including the optimizer and LSTM configurations, is concisely summarized in Table 6. The selection of Glorot initialization is particularly suitable for the RNN due to its capacity to handle weight scales in networks with sigmoidal activations.

Fig. 4
figure 4

Refined MSE and RMSE evolution over 100 epochs

Figure 4 represents the performance of our RNN model with respect to Mean Squared Error (MSE) and Root Mean Squared Error (RMSE). These metrics are vital in our context as they quantify the accuracy of bitrate prediction, a critical factor in video streaming quality. A lower MSE or RMSE indicates closer alignment of predicted bitrates with actual user data, suggesting a potential for higher video quality and fewer streaming artifacts such as buffering and pixelation, which are known to affect user satisfaction. By demonstrating a consistent decrease in MSE and RMSE during training, we provide evidence of the model’s refinement in predicting bitrates that aim to maintain the fidelity of the video while accommodating the dynamic network conditions. This iterative improvement is a proxy for the user’s quality of experience (QoE), as it directly impacts the visual and playback quality perceived by users. Thus, while MSE and RMSE do not measure user satisfaction directly, they are indicative of the model’s capability to deliver a streaming experience that aligns with user expectations and preferences, ultimately contributing to user satisfaction.

Iterative Optimization The training process involves iterative optimization of both the CNN feature extractor and the RNN predictor. This iterative approach allows the system to refine its feature extraction capabilities and predictive accuracy.

Convergence and Validation We monitor the convergence of both components using validation datasets. The performance metrics include accuracy and loss for the RNN predictor and feature map sizes for the CNN. This validation ensures that the system’s predictions align with ground truth and that the feature extractor learns meaningful spatial representations.

Testing and Evaluation After training, the system is evaluated on test datasets that simulate real-world video streaming scenarios. The system’s adaptive streaming decisions are assessed for their alignment with user preferences, network conditions, and content quality.

Fig. 5
figure 5

Actual vs predicted bitrates for different scenarios

Table 7 User preferences and network conditions for streaming scenarios

Figure 5 presents the temporal progression of both the predicted and actual bit rates over a span of time steps, generally denoting video frames or distinct time intervals. The various streaming scenarios, such as “High-Quality Streaming” and “Mobile Streaming”, each have their unique predicted and actual bit rates. For instance:

The predicted bit rates for “High-Quality Streaming” are displayed using a solid blue line with square markers, whereas the actual rates use a dashed blue line with diamond markers. The “Mobile Streaming” scenario uses red lines, with the predicted bit rates represented by a dashed line with square markers and the actual rates by a solid line with diamond markers. The “Crowded Network Streaming” predictions are visualized using a purple dashed line with square markers, and the actuals with a solid purple line accompanied by diamond markers. Lastly, “Low-Quality Network Streaming” predictions are represented by a green dashed line with square markers, while the actual rates are depicted using a solid green line with diamond markers. These bit rates are forecasted by the RNN predictor that takes into account both the network conditions and user preferences, as detailed in Table 7. The actual bit rates act as the observed or ground truth values. The graph highlights the deviations and oscillations between the predicted and actual curves, offering insight into the RNN predictor’s adeptness in modifying its forecasts in response to evolving conditions.

Fig. 6
figure 6

Correlation analysis of user preferences, network conditions, and predicted bit rate for video streaming

Fig. 6 illustrates the relationships between input features, encompassing user preferences, network conditions, and the corresponding predicted bit rates in a video streaming scenario. Each row in the heatmap represents a specific input feature, while the color intensity signifies the strength and direction of correlation with the predicted bit rate. The warmer colors (shades of red) indicate positive correlations, implying that as certain input features increase, so does the predicted bit rate. Conversely, cooler colors (shades of blue) signify negative correlations, suggesting that certain input features’ increments lead to a decrease in the predicted bit rate.

In the training process of our video streaming optimization system, one of the primary metrics of consideration is the loss function. It is presented as the Mean Squared Error (MSE) loss, which is defined as follows:

$$\begin{aligned} {MSE = \frac{1}{n} \sum _{i=1}^{n} (y_i - \hat{y}_i)^2} \end{aligned}$$
(10)

where n is the number of samples, \(y_i\) is the true bitrate for the \(i^{th}\) video segment, and \(\hat{y}_i\) is the predicted bitrate by our model for the \(i^{th}\) segment.

The plot in Fig. 7 provides a clear visualization of how the CNN model’s loss decreased over the course of 100 epochs for both training and validation sets. Specifically, the Mean Squared Error (MSE) loss function was employed due to its widespread use in regression tasks and its ability to effectively quantify the differences between the predicted and actual video streaming bitrates. As the figure illustrates, the training loss demonstrates a consistent downward trend, suggesting that the model was learning and improving its predictions with each successive epoch. Simultaneously, the validation loss, which serves as an indication of the model’s generalization capabilities on unseen data, also decreased, but it plateaued toward the latter epochs. This behavior underscores the importance of monitoring both training and validation losses, as solely relying on the training loss could lead to over-optimistic evaluations.

Fig. 7
figure 7

Training and validation loss evolution over 100 epochs

In our study, we conduct an analysis of the 704 feature maps generated by our Convolutional Neural Network (CNN) architecture. These feature maps, represented as a collection of 16 subfigures each as shown in Fig. 8, offer a granular view of the network’s internal feature representation. With a substantial number of feature maps distributed across various layers of the network, we gain deeper insights into the hierarchical feature learning process. The 704 feature maps collectively capture an extensive spectrum of information within our input data, encompassing simple low-level features like edges, textures, and color variations, to more complex and abstract high-level features that are vital for recognition tasks. By scrutinizing the intensity patterns within these maps, we discern the network’s ability to progressively refine and abstract features as information flows deeper into the network. Notably, the synthetic data employed for generating these feature maps ensures that our analysis remains rooted in practical scenarios. With such a comprehensive exploration of feature maps, our research contributes to the comprehensive understanding of our CNN’s feature extraction capabilities, vital for advancing the field of deep learning.

Fig. 8
figure 8

Spatial activation patterns across feature maps derived from video frame analysis

In our Convolutional Neural Network (CNN) architecture, the size and number of feature maps play a crucial role in capturing and representing hierarchical features from the input data. The initial input layer, with a spatial resolution of 10×10 pixels and 3 color channels, forms the basis for subsequent layers of feature extraction. Throughout the convolutional layers, we maintain the spatial dimensions, denoted as \(H_l\) and \(W_l\), at 10×10 to preserve the spatial information while increasing the number of channels (\(C_l\)) at each layer. Starting with 16 channels in Convolutional Layer 1 (Conv1), we double the count to 32 in Convolutional Layer 2 (Conv2), then to 64 in Conv3, and finally to 128 channels in Conv4 as summarized in Table 8. This architectural design, characterized by consistent spatial dimensions alongside an increasing number of channels, allows our network to extract detailed and abstract features, enabling it to detect patterns of varying complexity. The deliberate structuring of these feature maps is instrumental to the network’s ability to understand and represent the most salient features of the input data, thereby enhancing the performance of our CNN in deep learning tasks.

Table 8 Feature map sizes in the CNN feature extractor

4.8 Evaluation metrics

To thoroughly assess the performance of the proposed architecture, we employ a set of evaluation metrics that capture different aspects of video streaming quality and user experience. These metrics include:

  1. 1.

    Average bitrate This metric represents the average bitrate of the video stream and is a key indicator of video quality. A higher average bitrate generally implies better video quality but may also result in increased bandwidth consumption. The average bitrate is calculated using the following formula:

    $$\begin{aligned} \text {Average bitrate} = \frac{\sum _{t=1}^{T} y_{t}}{T} \end{aligned}$$
    (11)

    where \(y_{t}\) denotes the bitrate for frame t, and T represents the total number of frames. By monitoring the average bitrate, we ensure that our proposed architecture maintains a balance between video quality and bandwidth consumption.

  2. 2.

    Rebuffering ratio This metric calculates the ratio of the total rebuffering time to the total playback time, acting as an indicator of playback smoothness and user experience. Frequent rebuffering events negatively impact the user experience by causing interruptions in video playback. The rebuffering ratio is calculated using the following formula:

    $$\begin{aligned} \text {Rebuffering ratio} = \frac{\sum _{t=1}^{T} r_{t}}{\sum _{t=1}^{T} p_{t}} \end{aligned}$$
    (12)

    where \(r_{t}\) signifies the rebuffering time at frame t, and \(p_{t}\) indicates the playback time at frame t. A lower rebuffering ratio indicates a smoother playback experience with fewer interruptions.

  3. 3.

    Quality of Experience (QoE) This composite metric combines video quality, rebuffering, and latency to provide a comprehensive indicator of overall user satisfaction. The QoE is calculated using the following formula:

    $$\begin{aligned} \text {QoE} = \alpha \cdot \text {Average bitrate} - \beta \cdot \text {Rebuffering ratio} - \gamma \cdot \text {Latency} \end{aligned}$$
    (13)

    In the QoE equation, \(\alpha\), \(\beta\), and \(\gamma\) are weighting factors that are adjusted based on user preferences or specific requirements of the application. These factors determine the relative importance of average bitrate, rebuffering ratio, and latency in the overall QoE score. Different applications may have different priorities, and adjusting the weighting factors allows the QoE metric to reflect those priorities accurately.

  4. 4.

    Switching frequency This metric represents the number of times the bitrate changes within the video stream. Frequent bitrate switches can be disruptive to the user experience, as the quality may change abruptly. A lower switching frequency is generally preferred to ensure a more consistent video quality. By tracking switching frequency, we assess the stability of our proposed architecture in maintaining consistent video quality.

  5. 5.

    Latency Latency measures the time delay between video frames being captured and displayed on the user’s device. High latency can result in a poor user experience, particularly for real-time applications such as live streaming or video conferencing. The proposed architecture should minimize latency while maintaining good video quality and efficient resource utilization.

  6. 6.

    Scalability Test To evaluate the scalability of the proposed architecture, we conduct experiments with an increasing number of concurrent users or video streams. The test will involve the following steps:

    1. (a)

      Set the number of concurrent users or video streams to a minimum value (e.g., 1 user or stream).

    2. (b)

      Measure the average bitrate, rebuffering ratio, switching frequency, and latency for each method.

    3. (c)

      Increase the number of concurrent users or video streams (e.g., by increments of 5 or 10).

    4. (d)

      Repeat steps 2 and 3 until a maximum threshold of concurrent users or video streams is reached (e.g., 100 users or streams).

By employing these evaluation metrics, we effectively measure the performance of the proposed architecture in various aspects, such as video quality, playback smoothness, user satisfaction, consistency, and latency. This comprehensive assessment allows us to compare our approach with existing state-of-the-art methods for video quality adaptation and make informed decisions on its effectiveness and potential improvements.

4.9 Comparison with existing methods

To evaluate the effectiveness of the proposed architecture, we compare its performance against several existing state-of-the-art methods for video quality adaptation, such as:

  • Method 1 [35]: A popular adaptive bitrate algorithm that optimizes video quality based on network conditions.

  • Method 2 [36]: A machine-learning-based approach that predicts future network conditions to improve video streaming performance.

  • Method 3 [37]: A reinforcement learning-based method that dynamically adapts video quality according to the changing network conditions and user preferences.

  • Method 4 [14]: The method employs a Deep Reinforcement Learning (DRL)-based algorithm within smart edge computing for adaptive video bitrate aggregation in live streaming, enhancing user QoE by dynamically adapting bitrates based on network conditions, client states, and video characteristics.

  • Method 5 [21]: A deep reinforcement learning framework to optimize adaptive playback speed and video quality for live streaming, maximizing QoE while preserving low latency and content integrity.

We present the results of our experiments in the following section, comparing the proposed architecture with these existing methods in terms of the evaluation metrics previously described. This comparison allows us to demonstrate the advantages and potential improvements offered by our architecture over the current state-of-the-art techniques.

4.10 Implementation details

In order to ensure a fair comparison, we implement the proposed architecture and the existing methods using the same simulation environment and software platform. The implementation is carried out in Python, utilizing popular libraries such as NumPy, TensorFlow, and OpenCV for efficient matrix operations, machine learning models, and video processing, respectively.

We perform the experiments on a high-performance computing cluster with multiple GPUs, allowing us to parallelize the processing of video frames and reduce the overall computation time. We also implement a custom video player to simulate the playback and buffer management of streaming videos, enabling the accurate evaluation of rebuffering ratios and user experiences.

For training the machine learning and reinforcement learning components of the proposed architecture and Method2 and Method3, we use 80% of the synthetic network traces and user profiles as training data and the remaining 20% for validation.

5 Results

5.1 Average bitrate

Figure 9 shows the average bitrate achieved by our proposed architecture and the existing methods for different user types. It can be observed that the proposed architecture delivers higher average bitrates compared to the existing methods. Notably, the average bitrate is increased significantly by 37.14%, 26.31%, 14.28%, 10.5%, and 9.37% compared to Method 1, Method 2, Method 3, Method 4, and Method 5 respectively for user A. This improvement is attributed to the effective feature extraction, adaptive prediction based on network conditions, user preference integration, and improved accuracy with our proposed architecture. This indicates that our approach is better at adapting to the available network capacity and user preferences to provide higher video quality. The notable increase in average bitrate by our architecture is attributed to the advanced spatial-temporal feature extraction enabled by the synergistic use of CNNs for spatial detail and RNNs for temporal prediction, ensuring optimal utilization of network capacity tailored to content complexity and user preferences.

5.2 Rebuffering ratio

Figure 10 displays the rebuffering ratios obtained by our proposed architecture and the existing methods for different user types. Remarkably, the rebuffering ratio is reduced by 46.6%, 33.33%, 27.27%, 17.0%, and 9% compared to Method 1, Method 2, Method 3, Method 4, and Method 5 respectively for user A. This reduction is attributed to our proposed architecture’s ability to accurately select optimal bit rates, efficiently utilize available bandwidth, and improve prediction accuracy. The results demonstrate its effectiveness in maintaining smooth playback under various network conditions and user preferences. Our architecture’s significant reduction in rebuffering ratio is due to its predictive capabilities that leverage LSTM networks, enabling a forward-looking adjustment of bitrates that anticipates network fluctuations, thereby ensuring uninterrupted streaming.

5.3 Switching frequency

Figure 11 presents a plot of the switching frequency results, with the x-axis representing different methods and the y-axis representing the average number of switches.

Fig. 9
figure 9

Average bitrate comparison between the proposed architecture and existing methods

Fig. 10
figure 10

Rebuffering ratio comparison between the proposed architecture and existing methods

Fig. 11
figure 11

Comparison of switching frequency between the proposed architecture and other state-of-the-art methods

Our proposed architecture demonstrates a lower switching frequency compared to the other methods. Particularly, it shows 37.5%, 58.33%, 50%, 40%, and 28% less switching compared to Method 1, Method 2, Method 3, Method 4, and Method 5 respectively, indicating that it maintains a more consistent video quality during playback. This result suggests that the proposed architecture is effective in achieving adaptive bitrate streaming with minimal quality fluctuations. The lower switching frequency of our proposed architecture demonstrates its ability to maintain a stable video quality, with our CNN-RNN combination effectively predicting the optimal bitrate trajectory that minimizes quality fluctuations.

5.4 Latency

Figure 12 shows a plot of the average latency results, with the x-axis representing different methods and the y-axis representing the average latency (in milliseconds).

Fig. 12
figure 12

Comparison of latency between the proposed architecture and other state-of-the-art methods

The proposed architecture achieves lower latency compared to other methods. Precisely, it reduces the latency by 28.57%, 50%, 44.44%, 37.75%, and 26.47% compared to Method 1, Method 2, Method 3, Method 4, and Method 5 respectively. The proposed method indicates a better user experience, especially for real-time applications. This result supports the effectiveness of the proposed architecture in reducing latency and enhancing the overall quality of experience.The improvement in latency by our system is a direct result of our model’s real-time processing efficiency and its optimized handling of the DASH protocol, ensuring swift adaptation to changing network conditions without compromising the user experience.

5.5 Impact of network conditions

To further investigate the performance of our proposed architecture under varying network conditions, we conducted additional experiments with different levels of bandwidth fluctuations, packet loss, and latency. The results are presented in Figs. 13, 14, and 15, respectively.

5.5.1 Bandwidth fluctuations

Figure 13 shows the QoE scores achieved by our proposed architecture and the existing methods under different levels of bandwidth fluctuations. The results demonstrate that the proposed architecture outperforms the existing methods across all levels of bandwidth fluctuations by 16.66%, 12.9%, 7.5%, 4.1%, and 5.88% compared to Method 1, method 2, Method 3, Method 4, and Method 5 respectively. Our proposed approach ensures optimal video quality, minimizing interruptions and providing a smooth viewing experience, ultimately resulting in a high QoE. Under variable network conditions, our architecture consistently outperforms others due to its robust feature extraction and LSTM’s adept handling of temporal dependencies, ensuring that fluctuations in bandwidth and packet loss minimally impact the viewing experience.

5.5.2 Packet loss

Figure 14 presents the QoE scores achieved by our proposed architecture and the existing methods under different levels of packet loss. The proposed architecture outperforms the existing methods across all packet loss levels. Under high packet loss, it increases the QoE score by 15.5%, 9.3%, and 6.3%, 3.2%, and 6.9% compared to Method 1, Method 2, Method 3, Method 4, and Method 5 respectively, highlighting its ability to maintain video quality and user experience even in the presence of packet loss. The superior performance of our architecture in environments with packet loss is due to its LSTM-based RNN predictor, which effectively compensates for lost information by utilizing its long-term memory capabilities, thus mitigating the impact of packet loss on video quality and ensuring a consistent QoE.

Fig. 13
figure 13

QoE scores under different levels of bandwidth fluctuations

Fig. 14
figure 14

QoE scores under different levels of packet loss

5.5.3 Latency

Figure 15 displays the QoE scores achieved by our proposed architecture and the existing methods under different levels of latency.

Again, the proposed architecture achieves high performance across all latency levels. Under high latency level, the QoE score is higher by 15.25%, 11.47%, and 6.25%, 3.03%, and 5.8% than Method 1, Method 2, Method 3, Method 4, and Method respectively. This result highlights the effectiveness of our approach in adapting to various latency conditions while maintaining high video quality and user experience. This amplification of QoE is a testament to our approach’s adeptness at adapting to diverse latency scenarios while ensuring the delivery of superior video quality and a seamless user experience. The advancement in QoE amidst different latency levels is directly attributable to our model’s efficient real-time processing and its refined optimization for the DASH protocol, which together foster prompt and agile adaptation to fluctuating network conditions, thereby enhancing the user’s viewing experience.

Fig. 15
figure 15

QoE scores under different levels of latency

5.6 Impact of user preferences

We also investigated the performance of our proposed architecture in relation to various user preferences. Figure 16 shows the QoE scores achieved by our proposed architecture and the existing methods for different user types. The proposed architecture shows high QoE scores compared to the existing methods for all user types. For User C, our proposed approach increases the QoE score by 12.5%, 9%, 5.8%, 2.9%, and 4.16% compared to Method 1, Method 2, Method 3, Method 4, and Method 5 respectively, compared to the three methods respectively, demonstrating its ability to cater to diverse user preferences effectively. Our model’s personalization capabilities, which integrate user preferences into the bitrate selection process, contribute to the observed improvements in QoE scores, reflecting a streaming experience that is highly tailored to individual viewer requirements.

Fig. 16
figure 16

QoE scores for different user types

5.7 Scalability test

The average bitrate, rebuffering ratio, switching frequency, and latency for varying numbers of concurrent users or video streams are as follows:

Fig. 17
figure 17

Scalability test results for the proposed architecture in terms of average bitrate

Fig. 18
figure 18

Scalability test results for the proposed architecture in terms of rebuffering ratio

Fig. 19
figure 19

Scalability test results for the proposed architecture in terms of switching frequency

Fig. 20
figure 20

Scalability test results for the proposed architecture in terms of latency

As the number of concurrent users or video streams increases, the average bitrate of the proposed architecture slightly decreases. However, the reduction in bitrate is relatively small, indicating that the proposed architecture is capable of maintaining a high video quality even when the system is under heavy load.

The scalability tests shown in Figs. 17, 18, 19, and 20 demonstrate the performance of the proposed architecture for varying numbers of concurrent users or video streams.

Figure 17 shows how the average bitrate in kbps decreases as the number of concurrent users/streams increases. This demonstrates the effectiveness of the proposed architecture in adapting to changing network conditions and providing a cost-effective and scalable solution for video streaming services. These findings can be used to inform the design and implementation of adaptive bitrate algorithms and video streaming services, and can help to improve the performance and user satisfaction of these systems.

In terms of rebuffering ratio, the results show that the proposed architecture is capable of handling a large number of concurrent users or video streams without a significant increase in rebuffering ratio. As shown in Fig. 18, the rebuffering ratio remains relatively stable and low as the number of concurrent users or video streams increases up to 100.

Similarly, the switching frequency results, as shown in Fig. 19, indicate that the proposed architecture is capable of handling a large number of concurrent users or video streams with a low number of switches. The number of switches increases gradually as the number of concurrent users or video streams increases, but it remains relatively low throughout the scalability tests.

Lastly, the latency results, as shown in Fig. 20, indicate that the proposed architecture provides low latency for a large number of concurrent users or video streams. The latency increases slightly as the number of concurrent users or video streams increases, but it remains within an acceptable range for streaming video content.

These scalability tests demonstrate that the proposed architecture is capable of handling a large number of concurrent users or video streams with low rebuffering ratio, low switching frequency, and low latency. These results are promising for real-world streaming video applications that require high scalability and low latency.

6 Conclusion

In this paper, we presented a novel adaptive video streaming architecture based on advanced machine learning techniques that significantly improves the QoE for users under various network conditions and user preferences. The proposed architecture demonstrated superior performance in comparison to existing state-of-the-art methods. Our experimental results show that the proposed architecture achieves significant improvements in video quality while minimizing bandwidth usage and providing a better user experience. Specifically, our architecture achieved a 37.1% improvement in average bit rate, reduced rebuffering by 87.5%, and improved QoE by 16.6%. These results demonstrate that our proposed architecture optimizes video quality, reduces bandwidth usage, and enhances the user experience.

Future work could focus on further refining the machine learning models employed in the architecture, incorporating additional contextual information, such as device capabilities, screen resolution, and user viewing habits, to further personalize the streaming experience. Additionally, the proposed architecture could be extended to support emerging video formats, such as high dynamic range (HDR) and 360-degree videos, which pose unique challenges and opportunities for adaptive streaming.