1 Introduction

Video streaming has become increasingly more popular, especially after the recent COVID-19 epidemic, due to the rising needs for distance learning, working from home, business and government virtual meetings, and one-to-one video calling. Moreover, video surveillance systems are mostly relying on digital communication nowadays and this communication may share other communication types as what happen in home networks. Video data generally consumes extremely high network and servers bandwidth. This situation become worst when multiple parties, who share the same network, need to use video communication at the same time, which may over-saturate the network and may cause interruption in the communication over the network. Moreover, some devices are battery powered which introduces the need for power management to guarantee the longest availability of the video streaming systems. System availability is very important for smart networks services and one of the three main goals of cyber security [1, 2]. To prevent these problems, next generation networks and smart video sources that may utilize machine learning are required [3,4,5].

Video compression is generally incorporated with video streaming to reduce the required streaming transmission bandwidth. Video compression relies on similarity (spacial and temporal) in video frames and it only codes the differences within a frame (spatial) and/or between consecutive frames (temporal). This encoding process consumes the most energy compared to all other phases of transmissions, such as capturing, wired or wireless transmission, decoding, up-scaling, and displaying [6,7,8]. Video compression can ideally provide any needed output video playback rate. One of the most popular encoders in video communication is H.264, which employs many features for more efficient compression and better flexibility [9, 10]. In this paper, we present H.264 as a case study.

An adaptive video communication system can achieve network availability by adapting user video transmission rate [11] . Many things should be done for such systems to work correctly. One thing is to measure network congestion (network capacity) which help in specifying the possible rate of the video stream that is being transmitted. Another thing is to force the video compressor to produce the suitable video rate to guarantee smooth video transmission and playback. Device energy consummation should be also taken into consideration and should be reduced to maximize the overall system availability especially if some devices are battery powered. This can be considered as a control problem with two inputs: current device energy level and network congestion level, and one output which is the video transmission rate. Video transmission rate can be then fed to the video compressor to compress the video with the needed rate. If all video sources apply this optimization continuously, system availability is guaranteed. We can look at the problem from a different perspective in which we want to predict the availability of the streaming system. In this perspective, the input would be some encoder parameters and the output would be the power needed for the encoding process beside the resulting rate of the video stream. In this study, we are interested in the later perspective and we build the prediction system using deep learning. In previous studies [11,12,13,14], researchers tried to provide solutions for adaptive video streaming. But most solutions were only limited to specific types of networks or they did not take into consideration all related system’s parameters. To be more specific, only few studies [6, 7, 15] tried to model the impact of video encoders related parameters on the power consumption and the resulting bit-rate of the encoder. And according to our knowledge, none of the studies used deep learning in the modeling process.

This paper contributes to the Video communication systems’ availability by modeling the relationship between the systems’ availability related parameters and the video encoders’ parameters. In specific, we study the relationship between (the video streaming related power consumption, perceptual video quality at the destination, and the required communication bandwidth) and the video encoder resolution and quantization parameters. We study and analyze the predictability of the video communication related power consumption, perceptual video quality at the destination, and the required communication bandwidth based on the encoder resolution and quantization parameters using deep learning algorithms. All data needed have been collected in a lab environment as described in Sect. 3. The developed deep learning models are based on more than 1, 000 different experiments of video encoding and simultaneously measuring consumed power. The perceptual video quality and required bandwidth are measured at the end of the encoding process. The developed models can be used to assess the impacts of various video encoding parameters, and thus can help in the dynamic control of various source settings, including but not limited to resolution and quantization, to achieve the optimal power consumption, bandwidth, video quality, and computer vision accuracy in robotics [11,12,13,14]. Although we consider the popular H.264 standard, this study can help in deriving models for other encoders, such as High Efficiency Video Encoding (HEVC) and VP9, by following the same procedure. The results show that the proposed models are very accurate in predicting the video communication related power consumption, perceptual video quality at the destination, and the required communication bandwidth based on the encoder resolution and quantization parameters.

The rest of the paper is organized as follows. Section 2 discusses background information and related work. Section 3 develops various models and discusses the experimental setups and modeling methodology. Section 4 presents the validation results and provides an overall analysis. Finally, conclusions are drawn in the last section.

2 Background information and related work

A video is a sequence of frames (images) captured at a rate equal or faster than the eye perception (frame rate). Consecutive frames are usually very similar to each other and the similarity is called temporal redundancy. Each frame consists of number of pixels and the number of pixels in a frame is known as frame spatial resolution. Higher-resolution indicates higher clarity. A video file in the original format (storing all pixel information) would be tremendous in size, which requires very high bandwidth and storage. Encoding techniques compress these large video files into much smaller files that could be transmitted without noticeable loss or delay. This, along with the availability of high-speed communication networks facilitated video communication as a new means of communication.

The most popular video encoder nowadays is H.264, which implement both spatial and temporal compression. H.264 has high computational complexity, mainly due to its motion estimation, complex prediction, and rate distortion optimization [7, 10, 15,16,17].

The video encoding process can generally be divided into the following three high-level stages: Intra- and Inter- Prediction (Estimation) Stage, Transformation, Quantization and Their Inverse Stage, and Entropy Coding Stage. In the estimation stage, both intra-prediction and inter-prediction are used to reduce the spatial and temporal redundancies in the video, respectively [7, 17, 18]. Streaming video quality depends on video encoding process and the amount of bandwidth required for it to be viewed properly. Video data contains spatial and temporal redundancy. Similarities can thus be encoded by just considering differences within a frame (spatial), and/or between frames (temporal). Intra-prediction utilizes spatial correlation in each frame to reduce the amount of data necessary to represent the image.

Some of the common forms of video communication are video calling and video conferencing. Video conferencing technology has enabled us to seamlessly connect face-to-face with people all over the world. Video conferencing technology has revolutionized education and schools by bringing the best learning opportunities from around the world to classrooms and homes. Video Communication process is shown in Fig. 1,

Fig. 1
figure 1

The process of video communication

Power consumption, bandwidth requirements, and video quality are major concerns in video communication systems especially in smart systems that utilize mobile devices and wireless communications. In this paper, we develop deep learning-based predictive models for power consumption, required video bandwidth, and perceived video quality in a typical video communication system. The model capture the impacts of various video encoding parameters, and thus can assists in the dynamic control of various camera/sensor settings to achieve the optimal power consumption, bandwidth, and video quality.

3 Building the model and experimental setups

Since 2012, Deep Learning (DL) has witnessed great success in computer vision and other disciplines such as speech recognition, natural language processing and modeling [19] . The success of DL is based on the recent availability of big data, high computational power, and the utilization of the powerful Artificial Neural Network (ANN) algorithms. A popular example of Deep Learning networks is the feed-forward Deep Network or Multi-Layer perceptron (MLP).

These models use multiple layers, one input layer that represents the input independent vector, one or more hidden layers, and one output layer that represent the output dependent vector. Nodes in each layer are linked to all nodes in the next layer. These links hold the intermediate weights. The weight on each link defines how a first layer influences the second layer and so on. During the training process, the weights are updated in each iteration in order to obtain the lowest error in the output. Gradient descent back-propagation is often used as the learning algorithm to minimize the output error. Calculating the optimal weights is the main issue of deep learning and machine learning training process. Nonlinearities are represented in the neural network by activation functions in each node such as ReLu and Softmax [20].

In this paper, a deep learning approach, mainly supervised learning of artificial neural networks, is proposed to develop prediction models for video encoding quality and required resources based on input encoder parameters. Regression by artificial neural networks is chosen for its high accuracy in prediction based on labeled data.

Figure 2 shows the deep learning process we followed in this paper. Raw video sequences are collected and encoded using different resolutions and quantization parameters. During and after encoding, each encoded video sequence is labeled by the required power, required bandwidth, and the resulted quality of the sequence. After that, the sequences were divided into training and testing subsets. A deep neural network model is then developed and trained on the training subset of the sequences. Upon completing the training, the trained model is tested on the unseen test subset of the sequences to predict the video encoding power consumption, the required bandwidth, and the perceptual video quality for these videos. The predicted data is then compared with the actual data to mature the model accuracy.

Fig. 2
figure 2

Illustration of the deep learning process

We encoded sequences by changing both the resolution and the quantization parameter and we measure encoding power consumption, resulting bit-rate, and the perceived video quality. The utilized sequences are the popular Akiyo, Silent, SignIrene, and Deadline video sequences downloaded from VIPs LabFootnote 1. More information about these sequences are listed in Table 1. We use these sequences because they are available in raw (uncompressed) format. We change the resolution by down-scaling each video sequence from the original size of (352 × 240) down to \(10\%\) of that or (35 × 24) (specifically, we consider the original itself \(100\%\) and (\(90\%\), \(80\%\), ..., and \(10\%\) of the original size). For each of these sizes, we also produce different video quality levels using the quantization parameter (from 1 to 30) of the encoder.

We measure the power consumption while encoding as explained in [8] and then we find the bit-rate of the encoded video from the encoded video file size. We use the decoded video frames to measure the perceived video quality compared to the original video frames. As a metric for perceived video quality, we utilize Structural SIMilarity Index (SSIM) [21] between decoded and the original frames.

Table 1 Properties of the standard dataset used in the training

We prepare a table that contains two input columns: a combination of the percentage of downsizing the frames (1.0, 0.9, 0.8, ....., 0.2, 0.1) and the quantization parameter (1,2,3,...., 29,30), and three output columns: the power consumption, the bit-rate, and the SSIM quality. Figure 3a–c plots the collected data.

Figure 3a, b, and c represent the impact of varying the video encoding parameters (quantization parameter and spatial resolution) on the encoder power consumption, required video transmission bandwidth, and encoded video quality respectively. By combining the use of the quantization parameter and frame size in pixels, in H.264 encoding, the required bandwidth can be reduced to as low as one percent of the bandwidth at the highest setting and the encoder power consumption can be decreased to as low as 1/25 of the power consumption with full resolution at quantization parameter of 1. Also, we can see from the figures that determining the output combination of power consumption, bitrate, and SSIM for a specific combination of resolution and quantization parameters is not an easy task and requires a sophisticated prediction algorithm.

Fig. 3
figure 3

Impact of varying quantization parameter and spatial resolution

To build the ANN model, we use Keras on top of TensorFlow and Python environment. Figure  4 shows the code for the developed ANN model. Utilizing Keras/TensorFlow, the code define the network to be sequential and define the number of nodes in the input, the output and the hidden layers. Adam optimizer is selected to minimize the mean squared error (MSE)

between the actual and predicted values. The “Earlystop” function is used to stop training automatically. During training, the MSE is measured at end of each epoch and if the loss is no longer decreasing by a value more than \(min\_delta\) for patience consecutive epochs, the training terminates.

With “Earlystop”, the direction (weather to stop when monitored quantity decreasing or increasing) is automatically inferred from the name (type) of the monitored quantity (loss should decrease and accuracy should increase). Training is invoked by model.fit()) function. We set the maximum number of epochs in the training process to be 2000 if early stopping function did not work. The fit function splits the training dataset randomly into \(20\%\) for validation and \(80\%\) for training. Finally, we measure the goodness of the fit by \(R^2\), where if \(R^2=1\) indicates a perfect fit.

Fig. 4
figure 4

The ANN model code

Figure 5 shows the neural network model, with inputs being video resolution as a percentage of the original frame size and the quantization parameter. The network has four hidden layers and three output layers: video power consumption, required bandwidth, and perceptual video quality (measured in SSIM).

Fig. 5
figure 5

ANN model structure

Fig. 6
figure 6

ANN model training monitoring

4 Model validation results and analysis

Figure 6a shows the progress of the cost function (Validation loss) used in the training of the ANN model. In this paper, we use Mean Squared Error (MSE) which is the average (mean) loss over the entire training dataset of the squared of the difference between the actual and the predicted values. From the figure, we can see that the MSE is close to zero after training the model for 292 epochs.

Figure 6b shows the progress of the accuracy of the model over the training and the validation data. The figure shows quick convergence of the training process without showing any overfitting.

Fig. 7
figure 7

Overall actual vs predicted for the testing dataset

Figure 7 shows the overall relation between the actual data and the predicted data from the neural network model. Since the prediction has high accuracy, the relation is linear with actual and predicted data are very close to each other.

Fig. 8
figure 8

Actual vs predicted for each output in the testing dataset

To details the results shown in Figs. 7, 8 plots that actual data and the predicted data for each point for each output in the testing dataset. Since the prediction has high accuracy, the values are very close to each other for all outputs.

Table 2 R2 results

To provide numerical values that represent the closeness of the predicted data and the actual data plotted in the previous figure, We calculated \(R^2\) for each output separately and for all outputs together for both the training and testing data. The results are shown in Table 2. As shown in the table, \(R^2\) is close to 1 for both the training and the testing data for all outputs.

5 Conclusions and future work

We have developed a deep learning-based power consumption, required bandwidth, and SSIM quality prediction model for a video communication system. The model can help in the automatic control of various camera/sensor parameters setting, including frame size and quantization, to achieve the optimal outcomes in terms of video perceived quality, encoding power consumption, and the required bandwidth. By combining H.264 encoding quantization parameter and frame size settings, the required bandwidth can be reduced to as low as one percent of the original requirements and the power consumption can be lowered to as low as 1/25 of the full resolution at quantization parameter of 1. We notice that the neural network model is very accurate. It achieves prediction with high similarity to the original labeled data as measured by R squared goodness of fit measure. In future work, we will build a neural network model that predict the required resolution and quantization parameter (output) based on the encoding power consumption, required bandwidth, and the video quality (input). We will also experiment with higher resolution videos.