Application Level Resource Scheduling for Deep Learning Acceleration on MPSoC

Deep Neutral Networks (DNNs) have been widely used in many applications, such as self-driving cars, natural language processing (NLP), image classification, visual object recognition, and so on. Field-programmable gate array (FPGA) based Multiprocessor System on a Chip (MPSoC) is recently considered one of the popular choices for deploying DNN models. However, the limited resource capacity of MPSoC imposes a challenge for such practical implementation. Recent studies revealed the trade-off between the “resources consumed" vs. the “performance achieved". Taking a cue from these findings, we address the problem of efficient implementation of deep learning into the resource-constrained MPSoC in this paper, where each deep learning network is run with different service levels based on resource usage (where a higher service level implies higher performance with increased resource consumption). To this end, we propose a heuristic-based strategy, Application Wise Level Selector (AWLS), for selecting service levels to maximize the overall performance subject to a given resource bound. AWLS can achieve higher performance within a constrained resource budget under various simulation scenarios. Further, we verify the proposed strategy using an AMD-Xilinx Zynq UltraScale+ XCZU9EG SoC. Using a framework designed to deploy multi-DNN on multi-DPUs (Deep Learning Units), it is proved that an optimal solution is achieved from the algorithm, which obtains the highest performance (Frames Per Second) using the same resource budget.


Introduction
Deep Neutral Networks (DNN) have been widely used in image classification and Natural Language Processing (NLP) applications in the last decade.Due to the complexity of the increased layer interconnections and weights, the accuracy of new DNN models has been greatly improved.However, although these models can provide more sophisticated and state-of-the-art accuracy, the run-time cost of models is also increased significantly.
In a number of fields, including computer vision, bioinformatics, NLP, and robotics, to name a few, deep learning has recently become the de facto methodology [1].Its success can be attributed to its capacity to draw knowledge from vast amounts of data.The Internet of Things is another area well known for producing enormous amounts of data (IoT).Due to recent developments in the reduction of lowpower embedded devices' size and advancements in the optimization of machine learning (ML) algorithms, tiny machine learning (TinyML) is also emerging as a new Internet of Things (IoT) prospect that calls for putting the ML algorithm within the IoT device [2].
Traditionally, DNN models are normally deployed on GPUs and CPUs.However, due to resource constraints in many IoT devices, one of the widespread approaches is to implement DNNs on an ASIC (Application-specific integrated circuit) or an FPGA (field-programmable gate array).As for ASIC, this usually needs a long development cycle and cost for production, and it is unsuitable for applications that need flexibility.Therefore, to maximize the flexibility and performance of the application at run-time, FPGAs are usually a better choice due to their reconfiguration ability. 1 3 When deploying DNN models on FPGAs, the balance between performance and run-time cost (such as power consumption) should be considered.Although FPGAs provide a swift hardware resource allocation mechanism, the total hardware resources are limited.Usually, models with a similar network structure will perform better and cost more energy if they use more computing and memory units.However, sometimes, the performance of an embedded application is a higher priority.In comparison, the power consumption can be lowered at the cost of some acceptable accuracy loss.
Many researchers have focused on modifying the network to achieve high performance with limited resources.Mainly their objective was to reduce its size by pruning and quantizing.Recently there have been some other approaches to modifying the network and making models fit a specific hardware platform: in [3], researchers raise a framework to train the network with a flexible structure parameter (i.e.kernel size, depth, width, and channel numbers) and gain a super-network with 2 × 10 6 sub-networks contained and by using the network searching strategy, they can select the best network under a specific hardware platform; in [4] researchers modify the searching algorithm of OFA (once for all) and raising a dynamic network searching strategy to find a set of networks based on the accuracy and latency of the OFA super-network.
Another research direction focuses on hardware/software co-design and adjusting the hardware/software resources in a customized way with FPGAs in the design stage.For example, in [5][6][7][8], researchers develop an efficient design methodology to consider both hardware, software, and DNN structures in the network design or training stage.
Though FPGAs are becoming a popular choice for DNN tasks, resource constraints are a common bottleneck.In [9,10], the authors have assumed that the computing server has sufficient FPGA resources to extract intermediary features using deep learning layers.However, these assumptions will be violated in many real-life cases.For example, in the case of resource-constrained IoT environment [11], successful completion of the application is more critical than achieving the higher performance [11,12].Hence, to successfully execute deep learning in a resource-constrained FPGAbased system, we consider each deep learning network to be equipped with multiple distinct implementations represented by "service levels".Each implementation can produce the same result of prediction or classification but with different performance levels (e.g.Frames Per Second, FPS).A higher service level normally will return a higher performance but at a cost of increased resource utilization.
The research findings in [13] support the concept of distinct service levels for deploying deep learning networks.In this work, the authors have found that the memory requirement of the weight parameters contributes most to the memory footprint.Furthermore, the research further proves that a reduced precision in representing 20% weight parameters results in 1% performance loss.Taking a cue from these findings, we assumed that depending upon the availability of the resource budget, each deep learning network on an FPGA platform executed at a particular service level can be optimized in order to achieve higher performance.
In this paper, we propose a strategy for efficiently implementing deep learning into FPGA-based systems, where multiple DPUs are used for executing multiple neural networks on the application level.Further, each DNN can be executed in different service levels to achieve optimal performance.
We specifically respond to the following query: How can we guarantee that the multiple DNNs will be effectively executed at a specific service level while maximizing the overall performance (FPS), given the resource constraints of the DPUs in FPGA?.To this end, we proposed a heuristic-based strategy, Application Wise Level Selector (AWLS).This scheduling strategy is incorporated and further verified using a physical FPGA-based hardware/ software co-design framework.This framework is based on Zynq UltraScale+ XCZU9EG multiprocessor system on a chip (MPSoC) is used to configure the "service level" of different DNN applications, and we can also calculate the overall "performance" and obtain the DPU "resource" by analyzing the data recorded with this framework.By providing the "resource" and "performance" of each DNN model at a different "service level" to the proposed strategy, it will find out an optimal solution for a multi-DNN application.It has been observed that the results obtained from the real frameworks follow a similar trend as observed in software simulation.
The contributions of this work are summarized as follows: • Formulating the problem and development of heuristicbased, namely AWLS, for selecting service levels for deep learning applications.• Evaluating the proposed heuristic strategy with simulation experiments and comparing it with the optimal ILP-based technique.As a result, we found that the performance of the proposed heuristic is comparable to that of ILP.
• Proposing a framework for deploying deep learning in FPGA-based MPSoC systems with multiple service levels.• Demonstrating the proof-of-concept of the proposed strategy by implementing a multi-DNN application on an MPSoC.

System Model
We assumed an FPGA-based system, where each FPGA may contain multiple DPUs.In the given edge computing environment, let us assume that A denotes the set of N applications (DNNs) executing on the FPGA: It has been assumed that based on the degree of resources allocated, each application will be equipped to execute in different service levels based on the available resources.Each DNN can only be executed in any one service level among the possible q service levels i.e., l i = {l 1 i , l 2 i , … , l q i } .Hence, j th service level of A i can be denoted as l j i .The service of a level is proportional to its level ID.Thus, 1 is the lowest, and q denotes the highest execution level.
It can be concluded that the higher the service levels, the higher its resource consumption will be.This resource consumption could be in terms of the hardware resource, e.g.utilization.On the other hand, executing the network at a high service level will enhance the performance level more.This work assumes that higher be the service level of A j i , the higher its resource consumption Res j i denotes the resource consumed by A i while it executes in j th service level [14].Similarly, we have also assumed that performance per j i will be assigned to A j i if the i th the FPGA successfully executes the deep learning network in j th service level by fulfilling the resource demand.The overall resource budget ⃗ R total is fixed for hardware.The detailed calculation is provided in Section 2.2.Having the given ⃗ R total , each application has to finish the execution of the deep learning network by selecting a service level.
Table 1 represents all the acronyms and their explanations used in this paper.

Mathematical Representation of the Problem
In this section, we will attempt to formalize the proposed problem based on the system mddel described in the previous section For this purpose, we define a binary decision variable: i. Z = {Z j i ∶ i = 1, 2, ..., N;j = 1, 2, ..., q .Here, indices i and j respectively denote applications and corresponding selected service level ID.Z j i = 1 , if application A i executes in j th service level and obtains Per j i performance value.Z j i = 0 , otherwise.We now present the required constraints on the decision variable to model this problem before presenting its overall objective function.
1. Overall resource budget constraint: The complete amount of resources ( ⃗ R total ) in hardware must be used to execute the deep learning network.This basically indicates that the total amount of resources used by the available accelerators shouldn't exceed the total hardware allocated budget.The following equation imposes this restriction.
where the Res j i refers to the resource utilization when application A i executes in j th service level.2. Unique service level execution constraint: Each application will only be allowed to execute the deep learning network at one certain service level.That is,

Objective:
The objective of the formulation is to choose a feasible solution that maximizes the overall performance of the prediction/testing process through the appropriate choice of service levels.Hence, the objective can be written as follows:

AWLS: Application Wise Level Selector Heuristic
In this section, we propose a heuristic strategy named the Application Wise Level Selector (AWLS).It is a fast yet efficient heuristic strategy that allows resource balance to be restored quickly through a greedy but elegant approach proceeding level by level so that higher overall performance can be achieved.It is observed that, in order to achieve a good result, AWLS must be aware of the remaining resources at any stage of the algorithm.Thus, AWLS must be aware of individual performances during service level enhancements.Along with this, it also needs to consider the account for the number of incremental resources required during such level enhancement processes.
In order to achieve this objective, we have transformed the parameter "performance" to "PPR" (Perfromance Per Unit Resource) and defined two new factors.The first factor is termed as IGF (Immediate Gain Factor).IGF defines the difference in PPR between a current level (l) and the immediate next higher ( l + 1 ) level.Thus, the Immediate Gain Factor ( IGF i ) for an application A i can be calculated as: Similarly, we have defined another factor called OGF (Overall Gain Factor).OGF defines the difference in PPR between a current level (l) and the maximum possible service level (q).Thus, the Overall Gain Factor ( OGF i ) for A i can be calculated as: (2) Based on these derived factors, AWLS generates a key term called as "Decision Factor (DF)".While selecting a level, the applications are maintained in a max-heap.An application will be selected for the level up-gradation within the heap, based on "DF" which is defined for each A i as follows: It can be observed that DF i is able to provide an appropri- ate balance between IGF and OGF.

Implication of Decision Factor (DF)
Let us consider that the two deep learning networks are currently executing in A i and A j , respectively.A i is executing in priority level l and A j is executing in level l ′ .Let us assume based on the selected level, the internal gain for A i i.e.IGF i is lower than IGF j , i.e. the internal gain for A j .However on the other hand, OGF i >> OGF j .In this case, if OGF values are not con- sidered as a part of the Decision Factor (DF), A j will be selected for level upgradation by one over A i , despite the fact that the OGF i is much higher than OGF j .In the worst case, A i will hardly get the opportunity of level upgradation, in spite of having high OGF i .Hence, in such typical scenarios, DF will play an important role.
Working Principle of AWLS The working strategy of the AWLS is described as follows.As we can observer, in line 1, AWLS calculates Decision Factor ( DF i ) for all applica- tions.Based on the calculated DF values, AWLS constructs a max-heap (ref.line 3).Initially, the service level for all applications is set to one.As we can observe, from line 7 to 11, AWLS iteratively increments (updates) the service level for all applications till the max-heap ( H ) is empty.Then AWLS extracts A i from the root of the H (Ref. line 8).If the Increment in Resource (IR) is greater than the available resource budget ( RES_BGT ) then A i is removed from the max-heap.Otherwise, the AWLS increments its service level by 1, updates the DF value, and re-cqlculates it in line 12.If any A i reaches its highest possible priority level (ref.line 11), then the application is removed from the max-heap. (5)

AWLS at Work
In this section, we have illustrated the working mechanism of AWLS through an example for ease of understanding.Let us assume, there exist three applications,i.e., A 1 , A 2 , and A 3 inside the FPGA.Resource demand (Res j i ) and correspond- ing performance per j i value for each service level is provided in Table 2.We have also assumed that the available overall resource budget ( RES_BGT ) is 35.AWLS will begin its operation by calculating the DF value by Eq. 6.The initial DF values can be calculated as follows: lev becomes 3. A 2 is discarded from further consideration of service level upgradation as reached it its highest level 3. Now, A 1 with the highest DF value is extracted from the heap and similarly by completing the iterations, the Curr 1 lev becomes 3, and the remaining resource is updated as RES_BGT = 17 − 7 = 10 .It can be observed that A 1 also reached its highest possible level and hence, discarded from further consideration.The next iterations follow for A 3 and the level for the A 3 is upgraded accordingly.AWLS terminates when the heap becomes empty.The total obtained result is shown in Table 3.

Performance Evaluation of AWLS
The performance of the proposed AWLS has been evaluated using simulation-based experiments.We have also compared the performance of the proposed heuristic with the optimal ILP-based strategy.In this current experimental scenario, We have considered that the FPGA can execute deep learning networks in 5 distinct service levels, and FPGA consists of 2 DPUs, as shown in [15].The area consumption and corresponding performance values have been taken from [16].

Results
Experiments have been conducted to evaluate the performance of the proposed strategies i.e., ILP-based technique and AWLS using different performance metrics under varying scenarios.The performance metrics that have been considered for the evaluation are: 1. Average service level allocated to each server 2. Normalized Obtained Performance (NOP), NOP is defined as the ratio between the ultimately achieved performance value for all the applications and the maximum possible achievable performance by executing each application at its highest service level.Mathematically, NOP can be formulated as: Figure 1 shows the plots for the average levels allocated to each application by both the strategies i.e.ILP-based strategies and AWLS.As the overall resource budget ( RES_BGT ) varies from 40% to 100% of the total available resource budget.It may be observed from the figure that the average level allocated to each application increases with the increasing available overall resource budget.This is because the average resource that may be utilized by an application increases as the total available overall increases.
Although the trends for both the allocation strategies in Fig. 1 are mostly similar, AWLS is seen to allocate slightly lower average levels than ILP-based techniques in all the scenarios.However, this difference in performance decreases with the increase in resources.Hence, the performance of both strategies becomes comparable when there exists an adequate amount of resources.This could be attributed to the fact that as the individual resource increases, the difference between the values of IGF and OGF also increases.Hence, DF plays a significant role in level selection.Thus, AWLS takes judicious level selection decisions that are close to the optimal.
Figure 2 depicts the plot for NOP achieved by both the strategies, as the overall resource budget ( RES_BGT ) varies from 40% to 100% of the total available resource budget.It may be observed from the figure that the aggregate NOP obtained by both the strategies increase with increasing available RES_BGT .This is because the NOP obtained by the strategy is directly proportional to the achieved service levels of each application and therefore, obtained levels increase with the available resource budget (as shown in Fig. 1).Additionally, it may be observed from the figure that as the difference between available resources decreases, the performance difference between both strategies is negligible.

Implementation of the Proposed Framework
To further verify the proposed strategy, we implemented a framework using a ZCU104 development board equipped with a Zynq UltraScale+ XCZU9EG MPSoC.In our previous work, a video analysis system is designed using the proposed framework in [17], and in this paper, we further explore the resource scheduling algorithms to achieve optimal performance.AMD-Xilinx DPU IP module and Vitis-AI library are used [18,19] in this framework.Figure 3 shows the overview of the framework diagram.We will now discuss the different components of the proposed framework.

Vitis AI Run-time (VART)
Vitis AI Runtime (VART) is a part of Vitis-AI software that enables the applications to interact with the hardware by calling the unified high-level API.VART offers asynchronous submission and collection of jobs to the accelerator and supports multi-threading, and multi-process execution [19].

DPU (Deep Learning Unit)
DPU is an AMD-Xilinx hardware IP core, which can support DNN instructions that are compiled from conventional DNN development frameworks (e.g.PyTorch, TensorFlow, etc.) using the Vitis-AI toolchain.
There are eight different DPU architectures in the Vitis-AI library, where each architecture is configured according to the three dimensions of parallelism: pixel parallelism (PP), input channel parallelism (ICP), and output channel parallelism (OCP).Figure 4 shows an example of each of the three dimensions.For instance, pixel parallelism (PP) is 2, and input channel parallelism and output channel parallelism are equal to 3. Due to the nature of the calculation, the input channel parallelism is always similar to the output channel parallelism.In general, the larger DPU architectures can achieve better throughput than the smaller DPUs at the cost of more hardware resources.Table 4 lists all DPU architecture and their parallelism parameter configurations.

Proposed Framework
To implement the framework, there are two parts to handle, the hardware designed on the PL (Programmable Logic)/ FPGA part and the Linux system designed on the PS (Processing System)/CPU part.A DFX (Dynamic Function exchange) hardware platform is designed with the DPU IP, and the DPU is set into a special reconfigure area to enable real-time partial reconfiguration.The hardware platform will be exported into AMD-Xilinx official Linuxsystem design tools, Petalinux, and with the help of that, an embedded Linux system with a configurable hardware setting is created.On this system, we can access the DPUs and send tasks to them via VART, so various DNN modelbased applications can be designed and tested on the board.DNN model-based applications run both on ARM-based CPUs and the DPUs accelerators on FPGAs.As can be seen in Fig. 5, The CPU and FPGAs are physically connected through high-speed AXI (Advanced eXtensible Interface) protocol, which enables high bandwidth data movements between hardware and software.For example, the instruction fetch unit in the DPU will fetch the instructions of the DNN models through VART and XRT (i.e.Xilinx runtime),  and then the sequence of instructions will be loaded from DDR memory to the computing unit of DPUs.A DPU "Instruction" is a basic operator for the DPU arithmetic calculation, such as a "convolution operation" which is a sequence of instructions to perform a convolution operation.
In the proposed framework, 2 DPU IP cores are configured to run different DNN models for image classification applications.The three DNN models are compiled into different arithmetic operations, the two DPUs will then process them in the order.

Exhibition of the Proof-of-Concept
Before presenting our proof-of-concept study, we revisit the problem description from a physical implementation perspective.We have several different DNN-based models running on various service levels.A single application on a higher service level will cost more DPU utilization, and the individual cost is varied on the scales of the deployed DNN models.Our goal is to fully use the DPU resources to achieve optimal running performance.A number of onboard tests are designed to verify the proposed strategy, and the proposed DNN-based multiapplication framework is implemented using an AMD-Xilinx ZCU104 development board.In the test, we test three different DNN models for image classification applications (e.g.Resnet50, Resnet18, and Mobilenet), and then the system metrics in real-time are recorded accordingly, which includes the FPS of each application, peak GOP (Giga [billion] Operations Per Second) of DPU, and total time consumption.The following sections introduce the details of the experiment and the definitions of "service level", "resource," and "performance" in the experiments carried out on the physical FPGA.

Representation of "Service Levels"
In the proposed experiment, the number of threads is chosen to represent the different service levels.The experiment aims to classify images, and multi-thread enables it to process multiple images simultaneously.While applications run with multi-threads, the quad-core (ARM Cortex™-A53) will be able to send more image data to the DPUs simultaneously, thus more DPU utilization will be allocated.To highlight the difference in performances at each service level, the following equation is used to describe the thread and service level:

Representation of "Resource" in MPSoC
The DPU utilization is used to represent the notion of resources for physical MPSoC.First, a DPU performance benchmark is proposed to use the DPU resource fully.The benchmark will generate synthetic data and keep DPUs utilization full all the time, as it is shown in Fig. 6, DPU is processing CONV operation all the time.Second, a DL-based image classification application is used in the test, where different threads (e.g.threads 1-8) are used for the testing.In this case, DPUs have some idle time due to waiting for the new data from the CPU, therefore, there exists a gap between two basic DPU operations (CONV) and the total utilization of the DPU is less than 100%.For instance, Fig. 7 shows a comparison of DPU utilization between two threads and four threads settings.In the same period of time, it processes only 6 "CONV" operations when the application runs in 2 threads and processes 8 when runs in 4 threads.
The GOP is considered to describe a proper "resource" value.We choose the average GOP in the time scale (GOP/s) as a standard unit for utilization of DPUs and consider the average GOP in the benchmark as 100% used DPU resource.In order to obtain the resource usage in different service levels with various DNN models, we compare the average GOP in a DL image classification task with the benchmark results respectively.
Thus we obtain all the "resource" values with three DNN models at different service levels presented in Table 5.

Representation of the "Performance" in Experiment
To define the actual onboard performance for each service level.In the proposed experiment, FPS is used as the main performance metric.The equation below explains how the FPS is calculated, where I denotes the number of images processed in each thread, T denotes the total number of threads used, and P denotes the total time consumption.

Result and Analysis
After processing the data in the experiment, we calculate the optimal result using both ILP and AWLS (RES_BGT = 260) algorithms, and the result is the same as the simulation, see Table 7.
An image classification application with three different DNNs models is designed in the proposed framework to (9) FPS = I × T P verify the result given by the proposed scheduling strategy.
In this experiment, we test three different DNN combinations with various service levels at runtime.Then we measure the performance to verify whether the output combination selected by the proposed scheduling strategy can achieve the best result.
As per Table 6, Combination-A of the strategy is set at D1, D2, and D3 on service-level 3, 1, and 3, respectively.We also set the other two combinations according to the resource budget.For example, combination-B is set at D1, D2, and D3 on service-level 1, 3, and 3, respectively.Combination-C is set at D1, D2, and D3 on service levels 2,3, and 2, respectively.
Figure 8, illustrates that combination-A (chosen by the proposed strategy) obtains the highest score compared to combination-B and combination-C.At the same time, the recourse budget (RES_BGT = 260) remains the same as we can observe, combination A that various deep learning model is running at different service level.While another interesting observation can be drawn from this figure, i.e., though combination B obtains less score, its resource  consumption is significantly less.Thus, it exhibits the efficacy of our proposed idea of "service levels".In case of stringent resource constraints, our strategy will be able to select a different combination of deep learning models with less difference in score.It is to be noted that the performance on the physical test bed for multi-application does not match the "score" obtained by the algorithm (AWLS).This is mainly due to the fact that we set FPS to describe the performance, and there is not a simple linear relationship with the FPS while deploying multi-DNNs.The algorithm can guide us to arrange the optimal combination for the applications but can not predict the output FPS (see Table 7).

Conclusion
This work introduces a new concept of efficient implementation of deep learning with multiple service levels for FPGAs.The problem has been formulated as an optimization problem where each DNN can be executed with different service levels by exhibiting performance Vs.Resource trade-offs.A heuristic strategy (AWLS) has been proposed to maximize overall performance without violating resource constraints.Then a proof of the concept for the proposed strategy using a Xilinx ZCU104 development board is presented, and then a set of tests is designed to discuss the DPU resource allocation mechanism and define the onboard concept of "service level", "resource," and "performance" raised in the strategy.Finally, with a framework designed to deploy the multi-DNN application, the proposed solution can achieve the highest performance (FPS) using the same resource budget.Our future work will consider using the different DPU architectures and frequencies.Our final goal is to establish an adaptive system where DNN models and hardware resource utilization can be reconfigured at runtime using a real-time scheduling strategy.
) = 0.8 , DF 2 = max( ) = 0.67 .Max-heap H is constructed using these DF values.A 2 has the highest DF value and is extracted from the heap.RES_BGT is updated as (35 -14) = 21 and Curr 2 lev becomes 2. Now, the DF 2 will be re- calculated as DF 2 = max( ) = 1.75 , and again A 2 has the highest DF value.Hence, the IR becomes (18 -14) = 4 and the condition satisfies the remaining resource becomes RES_BGT = 17 and Curr 2

Figure 4
Figure 4 An example of DPU internal arithmetic operation flow.

Figure 7
Figure 7 DPU utilization in the application tests with different threads.

Table 1
Acronyms and their explanations.

Table 2
Resource and performance values for each DPU.

Table 5
Resource and performance values for each Application.

Table 6
Applications scheduling solution suggested by AWLS.