Event-Driven Serverless Pipelines for Video Coding and Quality Metrics

Nowadays, the majority of Internet traffic is multimedia content. Video streaming services are in high demand by end users and use HTTP Adaptive Streaming (HAS) as transmission protocol. HAS splits the video into non-overlapping chunks and each video chunk can be encoded independently using different representations. Therefore, these encode tasks can be parallelized and Cloud computing can be used for this. However, in the most extended solutions, the infrastructure must be configured and provisioned in advance. Recently, serverless platforms have made posible to deploy functions that can scale from zero to a configurable maximum. This work presents and analyses the behavior of event-driven serverless functions to encode video chunks and to compute, optionally, the quality of the encoded videos. These functions have been implemented using an adapted version of embedded Tomcat to deal with CloudEvents. We have deployed these event-driven serverless pipelines for video coding and quality metrics on an on-premises serverless platform based on Knative on one master node and eight worker nodes. We have tested the scalability and resource consumption of the proposed solution using two video codecs: x264 and AV1, varying the maximum number of replicas and the resources allocated to them (fat and slim function replicas). We have encoded different 4K videos to generate multiple representations per function call and we show how it is possible to create pipelines of serverless media functions. The results of the different tests carried out show the good performance of the serverless functions proposed. The system scales the replicas and distributes the jobs evenly across all the replicas. The overall encoding time is reduced by 18% using slim replicas but fat replicas are more adequate in live video streaming as the encoding time per chunk is reduced. Finally, the results of the pipeline test show an appropriate distribution and chaining among the available replicas of each function type.

AV1, varying the maximum number of replicas and the resources allocated to them (fat and slim function replicas). We have encoded different 4K videos to generate multiple representations per function call and we show how it is possible to create pipelines of serverless media functions. The results of the different tests carried out show the good performance of the serverless functions proposed. The system scales the replicas and distributes the jobs evenly across all the replicas. The overall encoding time is reduced by 18% using slim replicas but fat replicas are more adequate in live video streaming as the encoding time per chunk is reduced. Finally, the results of the pipeline test show an appropriate distribution and chaining among the available replicas of each function type. users spend more than 85$ billion per year on videoon-demand (VoD) platforms such as Netflix, Amazon Prime Video, Disney+, and, others [13].
All these platforms stream their videos over HTTP to offer high compatibility and device-independent access. More specifically, the most commonly used technology is HTTP Adaptive Streaming (HAS). In HAS, a video is divided into smaller temporal chunks (or segments) and each chunk is encoded at several resolutions and bit rates, thus creating several representations. These streaming systems require efficient video encoding systems to deliver content in as many resolutions, bit rates and codecs as possible to give the best quality of experience (QoE) to end-users [29].
All these encoding and transcoding processes require distributed systems to improve encoding times. Today, these tasks are performed in Cloud environments, as they offer the dynamism demanded by streaming platforms [9]. As we will see in Section 2, Cloud-based video encoding solutions have evolved as new paradigms have emerged in the Cloud. There are many Cloud services, all of which aim to offer specific functionalities suitable for a purpose. Specifically, they are service implementation models that allow the user to choose the level of control over the information and services they will provide (see Fig. 1). The most important services offered by Cloud providers are: -IaaS (Infrastructure as a Service): In this model, the provider rents the infrastructure and the customer has almost total control over it. The provider is responsible for ensuring longterm reliability and security by maintaining the infrastructure. -PaaS (Platform as a Service): The vendor provides the hardware system and an application software platform. PaaS allows the developer to develop, run and manage their applications without having to design and maintain the infrastructure and platform. -Serverless: This type of architecture allows the execution of applications through ephemeral containers. These containers are executed on the fly, so the developer does not have to worry about the management of the infrastructure on which Fig. 1 Comparison of the most important paradigms used in cloud computing environments his function is executed, focusing exclusively on the functionality. Usually, it embodies two complementary service models: FaaS (Function as a Service) and BaaS (Backend as a Service). -SaaS (Software as a Service): This model offers users a specific application that can be used immediately, without the need to install anything, deploy anything or maintain it. Everything is managed by the service provider.
In particular, serverless architectures versus the remaining Cloud paradigms have the following advantages (+) and disadvantages (−): + Delegate server administration: the execution times of the program or service are defined automatically, they do not require constant verification or installation of plugins; the software is simply applied automatically. + Scaling process: Generally, performance and memory capacity require a scaling process when more resources are required. With serverless computing technology, this process can be provided automatically. + Application as a sum of functions: This allows users to update, patch, fix or add new features to parts of the application. There is no need to make changes to the entire application; instead, developers can update each function independently. + Automation: The process does not require the load of contingency protocols, since the technology offers tolerance to application errors. + Nearby functions: The application is not hosted on an origin server. Functions are deployed in the Cloud when they are required. This task will be performed by the provider. It will place them as close to the user as possible to reduce latency. + Savings: pay for the time the function is running. − Loss of control over infrastructure: The Cloud service provider controls the underlying infrastructure, so you won't be able to customize or optimize it to meet your specific needs. − Security concerns: function developers should take care of company critical data, leaks of secrets, denial of service, etc. − Demanding testing and debugging: Debugging is more complicated, because developers do not have visibility into backend processes, and because the application is divided into separate smaller functions.
− Not designed for long-term processes: Providers charge for the time the code is running, it can cost more to run an application with long-running processes on a serverless infrastructure than on a traditional one.
According to the features of serverless architectures, they have evolved to become a very useful tool for deploying some functionality to a production environment paying only when it is executed without investing in the infrastructure [32]. HAS-based video coding systems for VoD could be a suitable use case to be deployed on a serverless architecture. In this type of scenario, the original video is split into several chunks with a duration of 2-4 seconds, and each chunk is usually encoded for different bitrates and/or different resolutions. For each of these encodings or set of encodings per chunk, the Cloud could execute a code fragment encapsulated in a function, performing a dynamic resource allocation. The serverless platform would be able to execute in parallel the encoding of each chunk as serverless would allow the number of replicas running the function to scale automatically in case of major requirements which could reduce the overall encoding time.
With all this data, it is important to stress the importance of developing Cloud-based coding systems with high performance, that are highly scalable, and present automated or quasi-automated deployment. For these reasons, in this work, we present a video coding system based on the serverless architecture paradigm over a Cloud infrastructure. Our proposal is based on the Knative open-source serverless platform [16] deployed on a resource-constrained infrastructure.
The implemented system has been tested in several scenarios in order to analyze its scalability and performance for encoding videos and calculate a full reference quality metric.
The main contributions of this work are: 1 -Development of event-driven video coding (x264 and AV1) and video quality (VMAF) serverless functions encapsulated in small containers. -Analysis of the behavior of the developed functions in an on-premises resource-constrained serverless platform. Experiments about scalability and resource consumption have been carried out.
-Analysis of the proposed solution for video on demand for DASH streaming with multiresolution, showing its performance and behavior varying the number and type of replicas. -Development and analysis of pipeline of events to encode video chunks and obtain the encoded video quality from a single event.
This paper is organized as follows. Section 2 shows several related works about previous approaches for distributed encoding on Cloud infrastructures and new tendencies carried out. Our video coding serverless functions based on FaaS are presented in Section 3. Section 4 analyzes the scalability and resource consumption of the proposed functions in a real test bed with 4K videos. A comparison of the behavior between fat (with six vCPUs) and slim (with only one vCPU) replicas is carried out. The performance of the proposal for multiresolution encoding is assessed. Besides, a pipeline to encode chunks and compute the encoded chunk quality based on CloudEvents is presented and analyzed. Section 5 presents a qualitative comparison of our proposal with previous work on serverless video coding. Finally, conclusions and future work are presented in Section 6.

Related Work
Recently, the proposed Cloud solutions for distributed video coding and transcoding services have evolved to be adapted to the more demanding requirements and improve their performance. The first proposals for distributed video encoding over Cloud environments were mainly based on the MapReduce paradigm. Thus, in Pereira et al. [24], a Split & Merge architecture for deploying video encoding over Cloud infrastructures, using the standard H.264, is proposed. The proposal of this work is based on splitting the video into several chunks and distributing these chunks in the Cloud infrastructure to decrease the coding time. The Group of Pictures (GOP) size selected in this approach is fixed to 250 frames. Moreover, the paper does not provide quantitative measurements of the communication overheads that the Cloud execution entails and does not measure output quality.
Jeon et al. [12] improve the performance of video coding with content-aware video segmentation and scheduling over Cloud. Their segmentation is based on several chunks with the same size, but it considers the video content to schedule the encoding. In this approach, each chunk may contain a type of video content and, for this reason, each chunk requires an encoding time. The chunks with long encoding times are arranged first, and chunks with short encoding times are gradually arranged. The encoding standards H.264 and HEVC are used.
Kim et al. [15] propose a distributed transcoding system based on Apache Hadoop to improve transcoding throughput for H.264 and HEVC encoders, achieving higher throughput for larger batches of videos. Song et al. [31] obtain similar results that in [15] for their Hadoop-based approach using H.264 encoders. As an evolution of these previous proposals, Sameti et al. [27] propose a transcoder agnostic design based on Apache Spark. Their solution speedups the transcoding time of individual videos encoded using the HEVC standard, achieves real-time performance and reduces transcoding time compared to an implementation without Spark.
But, in these MapReduce and Spark-based solutions, the number of encoders cannot be modified dynamically once the encoding is started, even if dynamic scheduling is used, such as in [27]. This dynamic instantiation of the number of encoders is crucial for video coding services, characterized by high demand and adaptive streaming applications. To adapt the number of encoders to the needs of video streaming, such as high video resolutions with bandwidth constraints and quality requirements, new Cloud-based solutions have been proposed. Gutiérrez-Aguado et al. [8,9] propose a Cloud-based distributed architecture, tuned for video coding (using HEVC, VP9 and AV1 encoders), that relies on an elastic pool of workers and media servers to provide faulttolerance. This solution allows the adaptation of the resources to the workload, offers high availability, and provides ubiquitous access. This way, a distributed application is deployed on top of the architecture to split the video encoding process of a video into several jobs that can be dynamically assigned among the elastic pool of workers.
However, in these Cloud-based solutions, the dynamic encoding workers used for the distributed video encoding are virtual machines (VMs) that require a non-negligible time to deploy and destroy. For this reason, Linux containers, and specially Docker containers, have been an alternative to VMs for virtualizing computing resources. As a lightweight process, containers allow the encapsulation of applications with all their dependencies and guarantee their execution across multiple platforms, reducing the deploying, and obtaining higher efficiency than with traditional VMs [30].
Dong et al. [3] propose a containerized multimedia processing system for video transcoding service (for H.264 and HEVC encoders) on a private Cloud, but this solution has a simple load balancing scheme.
Pääkkönen et al. [25] propose an architecture for predicting live video transcoding performance on a Docker-based platform, focused on Cloud resource management for live video transcoding. The solution uses a live prediction to improve resource allocation in transcoding container scheduling. They use only H.264 encoding.
Sameti et al. [28] propose CONTRAST, a container-based distributed transcoding framework for interactive video streaming. This solution, centered on HEVC encoding, uses an offline profiler to automatically determine the degree of parallelism and launches containers configured with the required number of cores to perform the transcoding. So, one container will be launched for each video chunk on the number of cores determined in the profiling stage. This solution has been implemented and tested on a public Cloud (Amazon EC2), but no implementation details are provided.
The use of containers in a cluster adds complexity due to orchestration and scheduling. For this reason, container orchestration platforms (COP) as Docker Swarm, Nomad or Kubernetes are used in Cloud computing environments [21], but Kubernetes is currently the "the facto" standard COP, followed by OpenShift (which is often referred to as "Enterprise Kubernetes") [23].
However, then appears Serverless, a new Cloud computing abstraction that makes it easier to write robust, large-scale web services [11] that opens a new era in Cloud computing. A concept of computing resources management where you deploy the functions on the Cloud provider platform and pay for the running time of the functions when they are invoked, everything else is the responsibility of the Cloud provider. Amazon Web Services introduced on 2014 the first serverless platform, AWS Lambda, and similar abstractions are now available on most Cloud computing platforms (e.g., Google Cloud Functions, IBM OpenWhisk, and Azure Functions) [22]. For example, a recent work of Risco et al. [26] shows the advantages of using on-premise serverless infrastructures as OSCAR, which is based on OpenFaaS, combined with the use of public Cloud serverless as AWS Lambda. In these cases, the on-premises solution can be a better solution when a minimalist Kubernetes distribution and a fast even-driven functionality are required, as in edge computing.
Fouladi et al. [5] propose a video encoder (for VP8 video formats) intended for fine-grained parallelism, using a functional programming style that achieves fine-grained parallelism using the AWS Lambda [33] and using the proposed Mu framework. Thus, instead of invoking workers in response to a single event, they invoke them in bulk, thousands at a time. In this approach, the VP8 encoder (the only one used), has to be modified to reencode consecutive chunks of 6 frames (this is a particular election of authors) to perform the "rebase" phase (running serially) to compound the chunks in the final encoded video sequence. This implementation could be improved by increasing the chunks' size (e.g., chunks of 2 seconds or 48 frames) to avoid communication among workers ("rebase" phase) and increasing the reported performance significantly (more than 8 times faster).
Sprocket [1] is a serverless system implemented on AWS Lambda. This video processing framework enables developers to program a series of operations over video content in a modular and extensible manner. Therefore, it handles underlying access, encoding and decoding, and processing of video and image content through highly parallel operations. This work extends the Mu framework [5] to support the dynamic creation of tasks during processing, instead of batchstyle invocation of Lambda, and asynchronous messaging and data transfer among the Lambda workers. No details about the encoders used are provided.
Currenly, there are several popular open-source serverless platforms (i.e., Knative, Kubeless, Nuclio and OpenFaaS) [18], which are alternatives to the serverless services offered by public Cloud providers and offer exciting new features. These platforms have performance differences between different service exporting and auto-scaling modes. More studies are needed to decide which platform is better, but Knative and OpenFaaS are very promising. In particular, Knative has many useful features, such as scale-to-zero and multiple auto-scaling modes, and an active community that could provide help for users and developers.
As we have seen in this section, and according to our knowledge, there are few works [1,5] where video coding systems have been developed on serverless Cloud platforms. In the following sections, we will present our proposal and evaluate it to validate its correct operation. In Section 5, our proposal is compared with previous works related to video encoding on serverless platforms to evaluate their differences and similarities.

Event-Driven Serverless Functions for Media Processing
In this Section, the main features of Knative serverless platform, used in our design approach, are described. Moreover, we describe the platform architecture and main components, and the event-driven serverless function implementation carried out.

Serverless Platform: Knative
There are some open-source serverless platforms available: Knative, Nuclio, OpenFaaS, or Kubeless. The features offered by these platforms have been analyzed in [18]. In our deployment, we have selected Knative because the scaling of replicas can be controlled by taking into account concurrency (this is important in our application as we only allow one concurrent execution per replica of the function as the job performed consumes all the assigned resources). Besides, it offers cold/warm start, scale to zero, and it has support for CloudEvents [2]. CloudEvents is a specification for describing event data simplifying event declaration and delivery across services and platforms. The specification is under the Cloud Native Computing Foundation and has been adopted by major cloud providers and SaaS companies. The specification describes how to encapsulate CloudEvents in different protocol bindings such as AMQP, HTTP, Kafka, MQTT, websockets, Protobuf, etc. In our application, CloudEvents will be encapsulated in HTTP messages. Therefore, CloudEvent metadata will be added as HTTP headers while event data will be encapsulated in JSON format in the body of the HTTP message.
When a resource of type Knative service is deployed, the created pod contains the user container (where the function has been encapsulated) and queue-proxy (sidecar) that exposes metrics and is used by Knative to control concurrency, timeouts, retries, etc. as shown in Fig. 2.
Knative supports the use of CloudEvents to deploy event-driven functions. The user must create a broker that receives the events from an event source. The broker can have several triggers that use the CloudEvent type and/or source attributes to determine which consumer must receive the CloudEvent as shown in Fig. 3.
The sink or consumer of the event can be a standard deployment Kubernetes resource or can be a Knative service (serving). In the latter case, the serverless When installing Knative, a network layer must be installed and some possibilities are Kourier, Istio, or Contour. We have opted for Contour as it reads Knative properties that control the timeouts of the functions. These properties are established in the configmap config-defaults in the knative-serving namespace. In our case, we  Figure 4 shows the deployed infrastructure. Knative has been installed on a cluster of 9 virtual machines with Kubernetes and it is used to deploy different serverless event-driven functions. Videos are split into chunks and stored in a download server (in this case, an Nginx container running on a server). A private registry storing the container images has been used. All the images have been tagged with this registry, and Kubernetes has been configured with the credentials to obtain images from this private registry. When a video chunk is processed in a function, the resulting video is uploaded to a server (as functions are ephemeral) that is running an upload service (developed in Tomcat). When a user wants to encode a video, a CloudEvent must be generated to encode each chunk. The events are sent to the networking layer gateway and are delivered to a replica running the corresponding function.
Listing 2 Sleketon of the function that encodes a video chunk using x264

Function Implementation
The serverless platform Knative does not impose any restrictions on the implementation of the functions. In our case, we opted for Java due to our previous projects. Using that language, the function can be developed using Tomcat, Quarkus, Spring, Micronaut, etc. to expose an HTTP endpoint.
In this work, the functions have been implemented using a modified embedded Tomcat 2 that provides abstractions to work with CloudEvents [7]. The use of this runtime leads to smaller images as Tomcat has fewer dependencies compared to the other application runtimes. The modified embedded Tomcat adds an abstract class CloudEventListener, which extends GenericServlet, with the abstract method consumeEvent(...) that will be called when a HTTP POST method is received. The first argument of this method is the CloudEvent, the second one is the HttpServletRequest, and the third one is of type HttpServletResponse.
Using this application runtime, we have implemented three event-driven serverless functions: -A function to encode a video using x264 [6].
-A function to encode a video using AV1 [6].
-A function to compute the VMAF [19] between a reference video and a distorted video.
Sample skeletons of the implemented functions are show in Listings 2 and 3.
As these functions will be executed in a serverless environment, the replicas are ephemeral and can be scaled to zero. Therefore, for each invocation, the video chunk must be downloaded, processed (building a call to ffmpeg), and finally uploaded to an external upload service.
An instance of FunctionX264 can be registered as a Servlet in a Tomcat instance to process HTTP requests.
Once these functions have been deployed, they can be activated with HTTP requests. Listing 4 shows a sample call sent to the broker through Envoy (which provides a high-performance L4/L7 reverse proxy) exposed in port 8080: As these functions are CPU intensive, as will be shown in Section 4.3, we do not allow concurrency, i.e., only one CloudEvent is dispatched to each replica of the functions to maximize the performance of each replica. When the replica finishes the encoding job, it is ready to receive and process the next event.

Experiments
In this Section, we describe the experimental setup and conditions used to validate the proposed architecture. We have carried out experiments to validate the scalability and good behavior in terms of performance and resource consumption (CPU and memory) for video encoders (x264 and AV1) and video quality (VMAF) functions. Also, we have evaluated the use of event-driven processing pipelines.

Test Bed
The infrastructure used to deploy the proposed components, shown in Fig. 4, consists of five physical machines connected through a 1 Gb switch. One machine (Intel(R) Xeon(R) CPU E3-1220 v6 @3.00 GHz, 4 cores, 16 GB RAM) runs, among other services, a containerized private registry with all the developed function images. Other machine (2 Intel(R) Xeon(R) Silver 4210 CPU @2.10Ghz with 10 cores each and Hyper-threading, 128 GB RAM) offers, among other services unrelated to this experiment, a Listing 3 Sleketon of the function that computes VMAF quality metric of an encoded video chunk containerized download service (Nginx) where all the chunks are stored. There is no overprovisioning in terms of the allocated vCPU cores when running this containerized download service. The third machine, is a dedicated old server (Pentium(R) Dual-Core CPU E6600 @3.06 GHz, 2 cores, 4 GB RAM), that runs a containerized upload service (developed using Tomcat) to store the encoded videos and the VMAF quality results. The other two servers, with 104 vCPUs available, have the following characteristics: to Worker8 are running in the Server2. Each VM has a disk image stored in its dedicated Crucial SSD with 240 GB. These VMs were created with virt-install and configured with Ansible. As we can see, the serverless platform is installed on a heterogeneous environment with slightly different hardware features. This does not affect the overall functionality of the application, but as we will discuss in Section 4.3.1, it has an impact on the speedup compared to a homogeneous system. The 4K videos 3 used in the experiments are listed in Table 1. The videos have been downloaded, transcoded using x264 at high quality (CRF=10, which generates visually lossless encoded videos), and finally split to obtain temporal video chunks of 2 seconds [17]. The sizes shown in Table 1 are the final size after transcoding and splitting.
The codecs encapsulated in the functions are H.264 (x264) and AV1 (rav1e) integrated in FFmpeg [4]. The container images of the functions use a Java 11 base image, and a multistage pattern was used to copy the FFmpeg binary (with support to multiple codecs) to the image. Table 2 shows the final container image sizes.
In all the experiments, no other functions were executed in the infrastructure.

Function Cold Start Versus Warm Start
In this subsection, the impact of cold start function invocation on our deployment is analyzed. For this experiment, we invoked the x264-fn function using eight fat replicas. Each replica is configured to consume 6 vCPUs, so each replica requests all the available vCPUs of the worker node where has been executed. Two scenarios are considered: -No replicas running (cold start). In this case, Knative must start all the replicas to deal with the incoming events. -Eight replicas running (warm start).
In both cases, we performed eight simultaneous calls to encode video chunks and measured the time from the creation of the event to the time that the event is received in the function. This has been repeated 50 times (400 invocations in each case) and in the coldstart scenario, we ensure that between each iteration all the functions have finished and have scaled to zero. Figure 5 shows the histograms of the distribution of times in warm and cold start. The mean time in warm start is 17.7 ms while in the cold start case the mean time is 3,179.4 ms. Listing 4 Sample HTTP request to encode a chunk. As the CloudEvent Ce-Type attribute is x264, the trigger will sent this event to the x264 function

Function Scalability and Performance
In this subsection, we have performed several experiments to show the scalability and performance of the proposed infrastructure. For scalability, we evaluate the performance of the x264-fn and av1-fn encoding functions, from 1 to 8 replicas that request 6 vCPUs each, referred to as fat replicas. Those results are compared with the use of replicas that request 1 vCPU each, referred to as slim replicas, when they are running the x264-fn function. With this test, we can see the behavior of the proposal for different types of granularity. Finally, a detailed performance analysis of the memory consumption per replica and worker CPU usage of the encoder and quality functions is carried out for the case of 8 fat replicas.

Scalability with Fat Replicas
In this experiment, the scalability and the performance of the encoder functions have been measured. In all the tests, we use cold start and the maximum number of allowed replicas will vary from 1 to 8: {1, 2, 4, 8}. All the chunks of the ANC video will be encoded using the x264-fn and av1-fn functions.  Table 3 shows the times to encode the 94 chunks of ANC 4K video as the maximum number of replicas of the function increases. In all cases, the 94 events are sent in parallel. For x264, the time required for one replica to encode all the chunks is 617.4 seconds while the time to encode all the chunks of the video using 8 replicas is 112.6 seconds (the speedup compared with one replica is 5.5 and the efficiency is 68%). In the case of AV1, the time to encode all the chunks of the video with one replica is 3256.1 seconds and this time is reduced to 547.3 seconds with 8 replicas (the speedup in this case is 5.9 and the efficiency is 74%). In this experiment, the scheduler has allocated, for both codecs, the first four replicas in Worker5 to Worker8 (in Server2 with slightly better hardware features). Therefore, the base case (the time with one replica) has been executed in a VM in Server2. In the experiment with 8 replicas, four were executed in Worker5 to Worker8 (Server 2) and the other four in Worker1 to Worker4 (in Server1 with slightly worse hardware features). The speedup would have been better in a homogeneous system with the features of Server2. Figure 6 shows the distribution of jobs per replica as the number of replicas increases using encoder x264 and AV1, respectively. On the one hand, jobs have a different duration depending on the spatial and  But, in all cases, we can see that the proposed system tries to balance the jobs among the available replicas. Besides, some replicas can complete more jobs than others because the delivery of events is dynamically controlled by the Knative eventing components.

Scalability with Slim Replicas
In the last experiment in this section, we tested the behavior using slim replicas that request fewer computational resources. In this experiment, each function requests one vCPU and, inside the function, ffmpeg is launched with one thread. In this experiment the maximum number of x264-fn function replicas allowed was increased to 48, consuming the same vCPUs as in the previous experiment. Through this experiment we want to know which type of deployment is more efficient: one with many replicas with low processing capacity, or one with few replicas with high processing capacity. Comparing Table 4 with Table 3, we can see that the overall time to encode the 94 chunks is 92.6 seconds, an 18% less than 112.6 seconds with the use of 8 fat replicas (using 6 vCPUs and 6 threads). The mean time to download the chunks has increased as there are  To conclude this subsection, for live video, where chunks must be encoded as fast as possible to be delivered, a setup with fat replicas using all the available resources is more appropriate as each chunk is encoded faster compared to the use of slim replicas. However, for video on demand slim replicas can be used as the total time to complete the encoding is reduced.

Video Encoder Functions Memory Consumption
In this experiment, memory consumption using fat replicas was obtained when encoding the 94 chunks of ANC with x264-fn and av1-fn functions. The maximum number of replicas of the function was limited to 8.
As can be seen in Fig. 7a), the replicas performing the x264 encoding show a 2.2 GB peak consumption of RAM. Figure 7b) shows the memory consumption per replica using av1-fn function. In this case, the av1-fn function performing the encoding shows a 5 The last column shows the total time (in seconds) to complete the encoding of all the chunks of the video This is due to the way the codecs work. x264 is a 2nd generation codec widely used and very efficient due to its 20 years of continuous work. But the maximum macroblock size for x264 is 16x16, while for AV1 it is 128x128. Larger macroblocks need more information to be stored in the RAM to perform tasks such as motion estimation and compensation, transformation, quantization, etc. [14].
Therefore, the encoder determines the resources required by the function and also the maximum number of replicas that can be executed simultaneously in a given resource-constrained infrastructure.

Video Encoder Functions CPU Usage
In this experiment, CPU consumption using fat replicas was obtained when encoding the same chunks of The maximum number of replicas of the function was limited to 8. Figure 8 shows the boxplots of the CPU consumption (in %) per vCPU for each worker node (the scheduler has allocated one replica per worker) when they are using x264-fn and av1-fn functions respectively. As can be seen in Listing 5, the ffmpeg call uses the same number of threads as vCPUs has the worker node. Figure 8a) shows how the x264-fn function has all the vCPUS of each replica working between 60% and 80%, reaching 100% on many occasions. This behavior is a little different for the av1-fn function (see Fig. 8b). For this case, the vCPUs of each replica consume 50% on average, leading to 80% in many cases. Differences in CPU usage for encoding tasks are primarily due to codec implementation. It is also worth noting that the x264 encoder has been improved for over 20 years, while the AV1 codec has only been around for 3 years.

Video Quality Function Memory Consumption
In this experiment, the VMAF fat replica resource consumption is measured for chunks previously encoded using x264 and AV1. When the encoded chunk is downloaded it has to be decoded to perform a comparison with the high-quality original video. Figure 9 shows the memory consumption per replica (one per node) during the experiment using 8 fat replicas of the vmaf-fn function. For both codecs we can see that the function performing the VMAF shows a 2.85 GB peak consumption GB of RAM for chunks.
To obtain a VMAF value, the function decodes the frames of the reference video and decodes the frames of the distorted video to obtain a frame-level measure. These results are pooled to obtain the final quality score. Figure 9 shows that this vmaf-fn function has similar behavior, independently of the decoder used in the videos in terms of RAM required. Therefore, the required RAM to compute VMAF will be independent of the decoder but it will depend on the resolution. For a resolution of FHD (1920×1080), which has fewer pixels than 4K resolution, the memory consumed will be less. Figure 10 shows the boxplots of the CPU consumption (in %) per CPU number for each worker node with a maximum of 8 fat replicas of the vmaf-fn function when it receives x264 and AV1 videos.

Video Quality Function CPU Usage
We can see a similar behavior as the one observed in Fig. 9, where there is no relevant dependency on the decoder of the input video. The replicas in all workers consume close to 100% of the vCPUs to decode and compute the VMAF score.
To sum up, when the videos are encoded using x264, the results in terms of RAM consumption and CPU usage in the VMAF function are very similar to the ones obtained using AV1. Therefore, resource consumption, in this case, does not depend on the decoder (x264 or AV1).

Multiresolution Video Coding
In DASH, a chunk is encoded in multiple representations, for instance, different resolutions and bit rates. The client will request the best representation according to the changing network conditions [29]. In this experiment, we generate multiple representations of each video chunk for each function invocation. Four videos (ANC, ELD, IND and SKA) have been encoded generating multiple representations for each chunk in a single call. Each function invocation downloads the assigned chunk and encodes the video in 4 representations with different resolutions and bitrates, then all the generated representations are uploaded using a single multipart/form-data POST request. In these experiments, we used cold-start and the maximum number of replicas allowed was 8.
Listings 6 and 7 show sample calls used inside the x264-fn and av1-fn functions, respectively, to encode a 4K video chunk generating four resolutions with different bit rates. The bit rates per resolution used for x264 are those recommended by [10] and for AV1 in 480p, 720p, and 1080p resolutions the bit rates are those recommended in the work of [6] and for 2160p and 4K resolutions, the bit rates were taken from [20]. The adaptation of the FFmpeg call is performed in the body of the curl request using the extraEncoderParams shown in Listing 4.
Comparing the times per encoder in Table 5, we can see that the mean upload times for AV1 are smaller than the mean upload times for x264 as the size of the coded video chunks with the former are smaller. The mean encode times are bigger than those reported in Table 3 (the row with 8 replicas) because in this experiment we are encoding each video chunk in four Listing 6 Sample call made to ffmpeg in the x264-fn function to encode a chunk at different resolutions and bit rates Listing 7 Sample call made to ffmpeg in the av1-fn function to encode a chunk at different resolutions and bit rates  . 11 Sample pipeline to encode a video chunk and obtain the VMAF quality: (1) the event source request the encoding of a chunk using x264, (2) this event is processed in the x264 function. When the video is uploaded, (3) an event is generated to compute the VMAF. Finally, (4) the VMAF function is started to compute the quality metric between the original and the encoded videos resolutions. Besides, the upload times are also bigger, compared to those reported in Table 3, because now four encoded videos are uploaded.

Event-driven Media Processing Pipelines
To demonstrate a simple event-driven media processing pipeline, we have added a flag to the data sent to the encoder functions, x264-fn and av1-fn, to signal whether a CloudEvent must be generated after the encoding process to compute the VMAF in the vmaf-fn function. In this case, x264-fn and av1-fn act as event sinks and event sources (see Fig. 11).
In this scenario, the encoder functions then: 1. receive a CloudEvent 2. download the chunk 3. encode the chunk 4. upload the chunk 5. if the flag computeVMAF is true, then create a CloudEvent, with all the information required by the vmaf-fn function to download the reference chunk and the just encoded and uploaded chunk, and send it to the broker.
We have encoded the 94 ANC video chunks using a maximum of 12 functions requesting 2 CPUs each and using a maximum of 6 VMAF functions requesting 4 CPUs each (the total number of requested CPUs is 48). At the beginning of the experiment, no replicas of the functions are running (cold-start). This is the skeleton of a sample call that triggers the pipeline: In the body of the CloudEvent, we are requesting the computation of VMAF. Therefore, after the video has been encoded and uploaded to a server, a CloudEvent is generated and returned as a response. This event triggers the call to the VMAF function. The body of the generated CloudEvent contains information about the original video (the reference), the encoded video (the distorted), and the upload server to store the computation of the VMAF: When the VMAF function is executed, the reference and the distorted video are downloaded in parallel, then the VMAF is computed, and the JSON file with the results is uploaded to the upload server. Figure 12 shows the scheduling of jobs in the available replicas. Initially, 94 events are generated to encode all the chunks of ANC. As can be seen, 12 replicas are started to encode the chunks using x264. When a replica finishes the encoding, it fires a CloudEvent to compute the VMAF of the just encoded chunk and receives another event to encode other chunk. We can observe that when the first encoding jobs finish, vmaf-fn replicas must be started to compute the VMAF (up to the allowed maximum of 6). In Fig. 12 two jobs have been highlighted. When a replica finishes the encoding of chunk 7, then fires a CloudEvent to compute the VMAF and a replica with vmaf-fn must be started (cold-start). On the contrary, when other replica finishes the encoding of chunk 62 and fires the CloudEvent to compute the VMAF, this event is immediately received by a vmaf-fn running replica (warm-start).

Comparison of Serverless Video Coding Proposals
In this section, we are going to make a comparison between the previous work and our proposal. As we have seen, in Section 2, the existing works, related to the paradigm of serverless video encoding, are very different from the proposed architecture. For this reason, we provide a qualitative comparison to see what benefits and limitations provide our architecture compared to previous works, as summarized in Table 6. The information shown in Table 6 has been obtained from the papers studied in the related work section. As it can be seen, there are some cells with the code N/A, this is because the information could not be found in their respective papers.
As we have discussed in Section 2, there are few studies on this topic. For this reason, our proposal can only be compared with [5] and [1]. Both previous works use Amazon's serverless platform called AWS lambda to execute their proposals. For this reason, they have an unlimited platform of resources. In our proposal, we have used Knative deployed on our platform, with limited resources, but which allows us to have full control of the system. Mu framework is used by [5] and [1] for managing parallel executions, while our proposal uses CloudEvents to trigger the execution of functions deployed on Knative open-source serverless platform. The use of the Mu framework is very residual as the project was abandoned in 2018 and since that there is no support.
If we analyze in detail the video encoding systems presented, there are major differences between previous proposals and our system. In particular, the proposed in [5] only supports the VP8 codec, which is not nowadays used mainly due to the compatibility offered by the H264 (x264) and VP9 (evolution of VP8) codec family. In addition, this codec does not perform well for ultra-high resolutions such as 4K, because VP9 or AV1 show a better performance for those resolutions. In our proposal, we have used two codecs: x264 and AV1. The first one, x264, was  3  7  16 19 28 33 38 43 46  53  61 64  72  77  85 88 91   5  11  20  27  30  35  44  51  60  62  69  71  74  84   4  8  14  21  24  34  40  47  54  55  70 78 81 90 92   6  10  13  18  25  31  41  45  52  56  63  67  73  76  86   2  9  17 23 29 37 39  48 50  57  58 66  75  79 82 87 Time(s) 167.7 0 Fig. 12 Scheduling of jobs in the x264 encoder functions and in the VMAF functions. The maximum number of replicas is 12 for x264 and 6 for VMAF. The length of the chunk is proportional to the duration of the job and the number under each segment is the chunk number processed in the corresponding replica (indicated in the left) selected because it is the most used codec nowadays, although for 4K or higher resolutions there are codecs that work better. AV1 was selected because it is intended for ultra high resolutions, such as 4K, 8K and 16K. Our proposal and [5] was tested using 4K videos, while [1] uses videos with resolutions of 1080p and 720p. With respect to the chunk size, [1] and our proposal use 2-second chunk sizes, while [5] uses 4-second chunks divided into micro-chunks of six frames each. In both cases, sizes of around 2 to 4 seconds are used, which is a good compromise between encoding efficiency and flexibility for stream adaption to bandwidth changes [17]. Table 6, row Modified codec, shows if the video codec must be changed for each proposal to work. On the one hand, [5] needs to modify the VP8 codec to merge the micro-chunks, which limits the applicability of the proposal to other codecs. On the other hand, [1] and our proposal do not require any codec adaptation as the video coding functions are based on the FFmpeg tool.
Another aspect analyzed in Table 6 is whether additional serverless functions, other than coding functions, are defined. [1] introduces into their pipeline other functions such as face detection. In our proposal, a video quality function has also been included, where the VMAF per chunk is calculated from a reference video and an encoded video. According to the papers, [5] does not support pipelines, while [1] and our proposal show how to build pipelines and measure their performance. The only work that is proposed for an environment with limited resources is the proposal presented in this paper, while the previous proposals were developed in environments where the only limitation is the monetary cost. Finally, our proposal shows a detailed study about the scalability and performance of the system. This type of study was also performed by [1] but the level of depth was not as high, while [5] presents a very light study about the performance and scalability of their proposal.

Conclusions and Future Work
As we can see in our daily lives, the consumption of multimedia resources through the Internet has a great weight. Movies, series, and TV shows are consumed through commercial platforms that use HAS as stream protocol. Therefore, they must prepare the contents using encodings as efficiently as possible. This leads to parallelizing jobs and using the Cloud to save costs.
If we think that these streaming platforms are capable of generating thousands of hours of content per month, dynamic and elastic solutions are needed to perform coding tasks. For that reason, an encoding system based on the FaaS paradigm on a serverless platform could be an optimal solution.
In this work, three event-driven serverless functions (x264, AV1 and VMAF) have been developed and encapsulated in three small containers. Their behavior has been analyzed when they are deployed in an on-premises resource-constrained serverless platform managed by Knative. The experiments analyze the scalability and resource consumption for various container states (cold start and warm start), varying the maximum number of replicas and the resources allocated to them (fat and slim function replicas). The results of the different tests carried out show the good performance of the serverless functions proposed scaling replicas and distributing jobs evenly among all replicas. Results show that using slim replicas, the total encoding time is reduced by 18% but fat replicas would be more adequate in live video streaming scenarios because the encoding time per chunk is decreased. We have shown that it is possible to perform a multiresolution encoding of each chunk per function call and this has been assessed by encoding four videos at four resolutions with x264 and AV1.
Finally, we have shown that it is possible to build pipelines to perform several stages in the computation by chaining CloudEvents. In this paper, we have presented a sample pipeline of events to encode video chunks and obtain the encoded video quality from a single event. This pipeline has been tested, and the results show an appropriate distribution and chaining of the jobs among the available replicas of each function type.
As future work, we will explore the development of new functions to support more encoders like VP9, VVC, LCEVC, etc. We are also working on the development of a serverless platform for field-of-view (FoV) based encoding of 360VR videos.