To cloud or not to cloud: an on-line scheduler for dynamic privacy-protection of deep learning workload on edge devices


Recently deep learning applications are thriving on edge and mobile computing scenarios, due to the concerns of latency constraints, data security and privacy, and other considerations. However, because of the limitation of power delivery, battery lifetime and computation resource, offering real-time neural network inference ability has to resort to the specialized energy-efficient architecture, and sometimes the coordination between the edge devices and the powerful cloud or fog facilities. This work investigates a realistic scenario when an on-line scheduler is needed to meet the requirement of latency even when the edge computing resources and communication speed are dynamically fluctuating, while protecting the privacy of users as well. It also leverages the approximate computing feature of neural networks and actively trade-off excessive neural network propagation paths for latency guarantee even when local resource provision is unstable. Combining neural network approximation and dynamic scheduling, the real-time deep learning system could adapt to different requirements of latency/accuracy and the resource fluctuation of mobile-cloud applications. Experimental results also demonstrate that the proposed scheduler significantly improves the energy efficiency of real-time neural networks on edge devices.


Deep neural networks (DNNs) have shown outstanding performance and versatility in areas from computer vision, virtual reality to speech processing or even general-purpose computing. Nowadays, the applications of deep learning technology spread to the mobile apparatus, such as smartphones, robotics, surveillance systems, and other embedded systems or IoT devices, making them more ‘intelligent’ (Gubbi et al. 2013). However, the edge or embedded computing devices are constrained in power when processing complex applications, such as deep convolutional neural networks (CNN) models. Besides, edge deep learning applications, such as autonomous driving assistant systems (Redmon et al. 2016; Liu et al. 2016), speech interaction on wearable devices (Lane and Georgiev 2015), and other latency-sensitive tasks, are stressed by the requirement of real-time processing. It means that guaranteeing quality of services (QoS) for the users and enhancing power utility are the two critical design objectives for edge deep learning systems.

Therefore, in the scenario of edge computing, some pioneering designs tried to off-load part of computation load from the mobile platform to the cloud for the sake of power saving or performance improvement, such as Xia et al. (2014), Barbera et al. (2013), Wang et al. (2017). Although these prior designs notably improve the energy efficiency of the mobile platform, their strategy of neural network partitioning and scheduling in-between cloud and edge is entirely static. It cannot handle the dynamic mobile-cloud application scenarios. In these situations, the edge devices are suffering from many unstable factors that cannot be assumed static as in prior analysis. For example, the provision of computing resources for neural networks can be changed, and the performance of processor can be fluctuating due to task over-commitment, power-caused frequency or voltage adjustment, and even the wireless network speed can be unstable. And all these factors damage the effect of static partitioning and off-loading methods proposed by prior works, and they may even cause the Quality of Service failures in practice, especially for real-time applications. In many cases of embedded or edge computing, the failure in QoS has the same effect as the failure in the Quality of Result (QoR) for deep learning, which is manifested as prediction or classification inaccuracy. In other words, the energy consumption and the neural network inference success rate, which is influenced by either QoS failure or QoR failure, should be optimized in these scenarios. What is more, the cloud serving time will sharply rise with the growth in the number of users.

Besides, for some sensitive applications and data, the owners of edge devices can handle this data without worrying about data security. However, even if the cloud can provide more powerful computing power support, considering the data security and privacy protection, it is often difficult for users to make the decision of directly delegating tasks to the cloud for processing (Dosovitskiy and Brox 2016; Wang et al. 2018; Melis et al. 2019). When the user carries out some complex processing of the input data to protecting privacy, such as state encryption, which is not necessary for the edge processing strategy, the processing pressure of the cloud will be further increased. The cloud will spend several times more than the scheme of directly processing the original data with the same deep neural network without any privacy-protection design.

Therefore, in this work, we design a dynamic scheduling method for mobile-cloud deep learning system, MC scheduling system. As Table 1 shown, unlike prior works, our defined problem focused on two factors that are not considered in any of the works before: dynamical scheduling and successful rate optimization. The target of our design is to reduce the energy consumption and minimize the inference failure rate caused by either QoS violation or neural network inaccuracy, in a dynamic embedded system where the computing resources dedicated to deep learning application and the communication speed is unstable.

Table 1 The comparison of prior works on mobile-cloud neural network application scheduling and our work

The main contributions of this work include:

  • Dynamic network scheduling algorithm: We design a dynamic scheduling algorithm for the edge deep learning systems to tactically schedule the workload of privacy protection neural networks onto the edge device and the cloud, so that the whole system can adapt different requirements of QoS/QoR and the resource fluctuation.

  • QoS-oriented Neural Network Approximation: We introduce approximation computing into the neural network inference. After our modification, a large neural network can ‘early exit’ from its stacked deep layers, and still produce a result with an acceptable quality loss, to meet the QoS requirement in the case of resource fluctuation.

  • E3 targeted scheduling: A new metric ‘effective energy efficiency (E3)’ is proposed to evaluate the energy efficiency of a mobile platform and the response success rate of the system, which considers both the QoS-violation induced failure and neural network inaccuracy induced failure. With E3 as the target, our scheduling method is optimized to contain the adverse effects brought by approximate computing and increase the overall system energy efficiency.

The rest of this article is organized as follows. In Sect. 2, we introduce the background of neural networks. Section 3 describes related works of privacy protection neural networks and real-time task scheduling for mobile-cloud systems. The motivation and case study of our design are shown in Sect. 4. Section 5 describes our proposed MC scheduling system. In Sect. 6, we introduce the experimental methodology. After that, Sect. 7 reports the experimental result of our design. Finally, we conclude this article in Sect. 8.

Neural networks background

The structure of a typical deep neural network application is shown in Fig. 1. The data flow diagram is a directed acyclic graph composed of several deep neural network application layers. According to their structure, these deep neural network application layers can be divided into several different types, such as convolution layers, pooling layers, full connection layers, and activation layers, and bear different functions in the deep neural network structure:

Fig. 1

A typical CNN containing multiple layers (\(X, Y\) is the size of the input maps, \({D}_{in}\) and \({D}_{out}\) are the number of input and output feature maps, and \(K\) is the convolutional kernel size)

Convolution layer The convolution layer is one of the most core and critical parts in the application of convolution deep neural networks. Its primary function is to extract information from the input data (image, voice, or other encoded information) by using the core (filter). The kernel is a three-dimensional array, size of which is \(\mathrm{N}\times \mathrm{K}\times \mathrm{K}\). After the kernel function processing, the input feature graph represented by the matrices can be further abstracted into the output feature graph. The generation formula of the output feature graph is:

$${f}_{out}\left(x,y\right)=\sigma \left(\sum_{i=0}^{N-1}\sum_{p=0}^{K-1}\sum_{q=0}^{K-1}{f}_{in}\left(x+p,y+q\right)\times K\left(i,p,q\right)\right)$$

where \({f}_{out}\left(x,y\right)\) is the value of the output feature graph at position \(\left(x,y\right)\), which is generated by the input feature graph fin (x,y) and the activation function \(\sigma ()\).

Pooling layer The value of pooling layer is to process the feature map output by the convolutional layer, and reduce the size of the feature map. In general, the maximum-minimum-value or mean-value functions are widely used to realize pooling layers. Once the structure of deep neural network application is determined, these pooling layers are usually fixed accordingly. When training or inferencing a neural network, the pooling layer consume less computing resources than the convolution layers, and processing latency is shorter as well, so data can be quickly passed.

Full-connection layer The full-connection layer’s primary function is to classify the input data according to the feature map extracted from the convolutional layer and pooling layer before. The specific connection mode of the full-connection layer is obtained through repeated learning of the training set during the training process. In the inference process, the deep neural network layer also consumes relatively limited computing resources, and causes relatively less performance and energy consumption cost to the system as a whole.

In summary, on the processing process of deep neural network applications, the deep neural network application layer that causes the highest cost is the convolutional layer. On the other hand, in the deep neural network applications, the feature extraction is also done by convolution layer. In other words, a deep neural network application with more appropriate convolutional layers can often extract more information from the input data. That means that introducing a deeper neural network, which has more deep neural network application layers, especially the convolutional layer, to process the same input data can often achieve better results (Qiu et al. 2016).

However, with the popularization of smart devices, the application scenario of deep neural network applications has grown far beyond the high-performance platforms in their infancy. From computer vision (Redmon et al. 2016) to image processing (Vardhana et al. 2018), from audio analysis (Sehgal and Kehtarnavaz 2018) to natural language processing (Goldberg 2017), various edge portable and low-power embedded platforms represented by smartphones have gradually become the main processing platforms for deep learning applications. The efficient and timely processing of deep learning applications on these embedded platforms has gradually become an increasingly important optimization design problem in deep learning research and practice.

Related work

Some classical works of privacy protection neural networks and real-time task scheduling for mobile-cloud systems are described as follows.

Performance optimization of neural networks for mobile-cloud systems

At first, the constraints of computing resources, memory bandwidth, and the battery lifetime power leave developers with no choices but off-loading the complete applications, such as neural networks, from the mobile to cloud. For instance, the authors of Barbera et al. (2013) considered the bandwidth and energy efficiency of both mobile computation off-loading and mobile data backup. On the other hand, Xia et al. (2014) proposed delay-optimized off-loading control schemes in a heterogeneous environment, while the work in Wang et al. (2017) considering the tradeoff between the energy consumption of mobile devices and the latency of the applications. However, these cloud-only approaches require a large amount of data moving costs through the wireless network, which means a significant communication cost. Moreover, most of the methods assumed an unlimited resource provision on the cloud side and tried not to put computation on the edge, which is impractical.

To address these problems, the prior works proposed the design of a distributed computing approach, which means combining a small neural network on the edge devices and a large neural network on the cloud (Skala et al. 2015). All the input data are firstly processed by the edge neural network, and the confidence of the result is produced. A low-latency result will be provided if the classification is confident. Otherwise, the end device will resend the input data to the cloud for the larger neural network. This method is not very efficient, since the tightened constraints of QoR will lead to frequent data re-processing on both of the edge and the cloud (Teerapittayanon et al. 2017b). Besides, considering the data flow graph of the deep neural network is a directed acyclic graph, which consists of neural layers, there is an opportunity for scheming design for mobile-cloud co-computing (Eshratifar et al. 2019). For example, Kang et al. (2017) developed a system that can automatically partition DNN between the mobile device and cloud at the granularity of neural network layers, which attempts to leave the complex computation to the servers, while introducing in the lowest amount of transmission data.

Overall, these deep learning scheduling designs mainly focus on improving the energy efficiency of the mobile platform, and they all employ a static scheduling strategy between the edge device and the cloud machines by assuming that the workload assignment cannot be changed after deployment (except for the distributed computing approach).

In addition, there are other methods to optimize the execution performance of deep neural network applications on the edge side in cloud computing scenarios. For example, Park et al. (2017) proposed that the method of weights pruning could be used to reduce the complexity of deep neural network, so as to improve the real-time performance of deep neural network on mobile platform. In Ma et al. (2020), it is proposed that weights can be quantified to reduce the computational complexity required for mobile platforms to execute deep neural network applications. What is more, a new deep neural network was proposed in Zhou et al. (2019) specifically for the mobile platform to ensure the high-speed and accurate completion of the same task target and experimental results.

Although these works show a great improvement on energy efficiency or performance of neural networks on mobile, they still need to face some difficulties for mapping them on the mobile platforms. The reason is that these designs are static design as well, which means the optimization strategy of these works are only based on the struction of neural network, rather than computation resource in mobile platforms, which is limited and fluctuant for a mobile platform. This mechanism makes these designs challenging to respond to the users’ latency request. In summary, these prior works ignore two changing factors in the mobile-cloud execution environment: (a) the available resource for an application on the mobile SoC is usually fluctuation due to the other content workloads or other dynamic factors like power or thermal induced CPU frequency variation, and (b) the wireless communication is not stable in practice. Disregarding these conditions will lead these pre-deterministic static off-loading results to QoS failure in real-time applications.

Privacy protection neural networks

With the continuous popularization and development of deep learning applications towards the end, its application scenarios inevitably need to contact or even directly deal with some users’ personal information and other sensitive information related to privacy. In order to protect users’ data security, researchers have proposed many different privacy protection mechanisms (Du et al. 2004; Graepel et al. 2012; Zhang and Zhu 2016; Osia et al. 2020; Li et al. 2017). Among these mechanisms, the scheme with the most extensive impact is to encrypt the private data by means of encryption, so that the original data of the user is not acquired by any other party including the cloud, and the security of the data is protected. Specifically, the strategy consists of a secure multiparty computing method using differential privacy and the method using homomorphic encryption to directly encode the original data.

For the former method, Orlandi et al. (2007) proposed that a deep learning algorithm protocol could be implemented to protect users’ privacy. This work prevents the client from getting the user's information through the task of snooping through the user by obfuscating the input of these calculations when the user is performing the computation delegation of some nonlinear functions. Based on this work, Abadi et al. (2016) further proposed a differential private depth model for the training process, which can introduce certain noise when using stochastic gradient algorithm (SGD) for gradient calculation, so as to prevent information leakage of users. In order to protect private information in features, Wang et al. (2018) introduced data Nullification and random noise addition.

For the latter method, Gilad-Bachrach et al. (2016) proposed the use of homomorphic encryption to achieve the privacy protection of deep learning. They provide a framework for designing neural networks that run on encrypted data, and do some approximation processing. Mohassel and Zhang (2017) tried to use more secure algorithms to further improve the data security of the neural network model. For the training process, researchers also proposed that homomorphic encryption could be used to encrypt the private data, and the encrypted data could be transmitted back to ensure the security of the training process (Zhang et al. 2015).

These methods have alleviated the hidden trouble of data security in deep neural network application to some extent and provided some privacy protection measures. However, these methods of privacy protection are based on higher computing power. Even for the work with the lowest computing power requirements at present, it needs to pay twice as much computing power cost as the original deep neural network to realize the privacy protection of user data (Xiang et al. 2020). The deep neural network consumes a lot of computing power, and the application cost after the modification of these methods may be unacceptable for the cloud machine time occupancy and local real-time constraints.

Case study and motivation

A case study

Suppose that there is a need to detect the many objects from a 12FPS video stream, which means the system only has less than 84 ms to process one image. Moreover, the user wants the system cost as least as possible on energy, while the constraint of accuracy can be appropriate loosen. The constraint of the accuracy for this application is to cover at least 75% of the objects in video. The application is submitted from the mobile platform, an NVIDIA’s Jetson TX1 board, which is connected with a shared fog server the cloud through a Wireless network. A GoogLeNet is invoked to process the application. Neither the distributed computing method nor static partition strategy can provide any QoS-guaranteed result when the performance of the wireless network is unstable (the bandwidth is below 200 Mbps), although they may achieve great energy efficiency on deterministic applications and environment. When the user encrypts the input data for the purpose of privacy protection and then processes the inference, the delay cost will increase even more.

However, in many cases of real-time embedded computing, the failure in QoS is as critical as the failure in QoR, which is projected into the inference or prediction inaccuracy in deep learning. In such a scenario, there are multiple metrics needed to be optimized: the energy consumption, and the neural network inference success rate, which is influenced by QoS failure, and neural network prediction accuracy, which is described by the QoR.


For the real-time deep learning applications, the energy consumption, the inference latency, and the classification accuracy are all very significant, since the failure cases come from two different sources: the failure in QoS, which means no result can be returned before the deadline, and the failure in QoR, which means the prediction inaccuracy of neural network. Thus, in this article, we define a metric ‘effective energy efficiency’, E3 to describe the actual efficiency of mobile deep learning systems. It can be calculated as:

$$E^{3} = \frac{{N_{s} }}{N \cdot E}$$

in which N is the number of applications processed by the system, E is their energy consumption, and \({N}_{s}\) is the number of success processed applications, which means return a correct result without violating time constraints.

Besides, the trade-off among energy consumption, QoS and QoR is available. The reason is that most of the neural network models are used to process Recognition, Ming and Synthesis (RMS) applications. For these applications, 100% accurate golden results are not expected, since (1) the input data will inevitably mix with some noise and (2) the applications can tolerate some range of errors. In another word, there is a feature of approximate computing in neural network applications, which leads to a feasible approach: satisfying the real-time demand by providing an on-demand neural network accuracy on the appropriate platform instead of always choosing the same complex model on the same platform under any circumstance.

Therefore, in this paper, we attempt to design a scheduler between the mobile platform and the cloud, utilizing this approximable feature of neural network for this tradeoff. And our goal of this article is optimization this E3, and dynamic response for both of the various constraints of latency and accuracy of the users, and the fluctuant computing resource and wireless network performance.

The architecture of MC scheduling system

Overview of system architecture

Figure 2 illustrates the architecture of the Mobile-Cloud Scheduling System. The most critical component of this work is the MC scheduler. When a task is submitted to the mobile platform, the mobile platform preprocesses the input data, while the scheduler requests the constraints of QoS and QoR from the user as well. Based on these constraints and the computation resource of mobile and cloud, the MC scheduler sends the task to the appropriate platform, so that the accuracy-acceptable result can be submitted in time.

Fig. 2

Overview of MC scheduling system architecture

On the one hand, considering the computing power of the mobile platform, a small neural network is mapped on it, with which the mobile platform can handle most of the tasks. It means that most of the information can be locally processed, which will avoid compromising privacy security in the cloud. On the other hand, the deeper neural networks, which can achieve higher model quality, are deployed on the cloud. What is more, to make the system more flexible for time/accuracy bound, these neural networks are modified by a method called Branch-Insertion (Tang et al. 2019), which can help a neural network with large scale be processed in time.

Additionally, both the mobile side and the cloud side have performance-monitoring processes, which are the mobile runtime predictor and the cloud performance model, to protect the task from time violation. On the mobile device, the mobile time predictor provides information about computation resources and the neural network on the mobile platform. On the cloud, the cloud performance model will choose an appropriate data path to process the task. What is more, by monitoring wireless network communication, the Cloud Performance Model can submit the result in time or just report failure to the edge side.

MC scheduler

In a Mobile-Cloud Scheduling System, the MC scheduler is located on the CPU side of the Mobile Platform. Algorithm I shows its scheduling procedure.


At the beginning of processing a neural network application, the scheduler will set a basic QoR for users, for example, 75%. After that, when a new neural network application is handled to the system, the scheduler will check service availability, which means whether the neural networks on the mobile and the cloud can provide QoS-guaranteed result (Line 4 to 13). If the constraints given by the user is impossible to be met by the system, the scheduler will refuse the trial of processing the task, and report the failure (Line 14).

Then, according to the constraints of the tasks and the status of the hardware, the scheduler will send the task to the appropriate processing platform. For those tasks which can only be processed on one specific platform (the mobile or the cloud), if the target platform is not available, they have to queue (Line 16 to 21). However, for the tasks that can be processed by both platforms without timing/accuracy violation, the MC scheduling system provides two modes for different purposes, ‘performance priority mode’ (Line 16 to 17) and ‘energy efficiency priority mode’ (Line 18 to 21), respectively. Considering the privacy-protection requirement from the user, the system tends to schedule the mobile platform to process tasks by default, since it can leave the data and the weight locally, unless the neural network on edge cannot provide the acceptable precision (Line 16 to 18), which also means the task needs to be processed as soon as possible (called as ‘performance priority mode’). On the contrary, for the minimum energy consumption of mobile platform, the scheduler will send them to the cloud, alleviating the burden of the mobile platform (called as ‘energy efficiency priority mode’, Line 19 to 21).

It is worth to noting that the content for computing resource on the cloud or the performance of the wireless network may lead the task to failure in QoS. Thus, when the hardware serviceability and the deadline oblige the other scheme (Line 24 to 27), the MC scheduler will send a ‘give up’ instruction to the cloud, and request the GPU on the mobile platform to process the task (Line 24 to 27).

After scheduling the task successfully, the scheduler has to monitor the platform processing the task, and the deadline given by the constraints as well. Whenever the result is submitted to the scheduler, it will be outputted immediately as soon as possible. Otherwise, it will report the failure information to the user. If the user is not satisfied with the accuracy of the result that our system provided, the scheduler can adjust the constraints of QoR with a step parameter (Line 34 to 36).

Mobile time predictor

The mobile time predictor helps the MC scheduler to monitors and adjusts the process of the mobile platform. There are three messages that the mobile time predictor needs to keep updating and report to the MC scheduler in time: the hardware availability of the mobile platform’s GPU, the prediction time of the small neural network on the mobile, and the estimated remaining time of current processing neural network.

Hardware availability The availability of the GPUs is very easy to access. Based on the GPU usage monitoring instruments provided by the vendor, such as tegrastats of the Nvidia’s Jetson TX1, the status of GPUs can be described.

Predicted execution time The small neural network on the mobile platform primarily spends a constant amount of time to produce a result, unless there are other content applications. The reason is that the execution time of the neural network is significantly related to the operation quantity of the layers (He and Sun 2015). Thus, the predicted time of the small neural networks can be obtained during the training phase.

Estimated residual time If the layer before layer-i of a n-layers small neural network with has been processed, its residual execution time can be approximately predicted. It is related to the running time of the GPU for the neural network until this moment, \({T}_{GPU}\), the operation quantity from the beginning to the i-th layer of the neural network, and the operation quantity still needs to be processed:

$$T_{cp} = \frac{{\mathop \sum \nolimits_{i}^{n} V_{j} }}{{\mathop \sum \nolimits_{1}^{i} V_{j} }} \cdot T_{GPU}$$

where \({V}_{*}\) is the operation quantity of layer-\(*\) in this neural network, which can be calculated as Han et al. (2016).

The information about the availability of the GPUs avoids the task from keeping waiting in the task queue, which will lead to stall and long latency. The prediction time of small neural networks on the mobile platform helps the MC scheduler decide whether the system can satisfy the requirements of the users. Moreover, the estimated residual time keeps the MC scheduler knowing the processing status, so that it can adjust the scheduling scheme in time, in case the mobile platform has difficulty in finishing the job without violating its constraints.

Multi-version neural network on cloud

Instead of the power/resource-constrained mobile platform, the power and computing resource of the cloud is comparatively abundant. Thus, the neural networks on the cloud can be relatively deeper, and the result it can provide is preciser as well. In this article, ResNet-152 and GoogLeNet are mapped on the cloud.

However, the massive operation brought by the deep neural networks may cause unacceptable latency, even for those processed on the cloud. Fortunately, for the error-resilient neural network applications, the tradeoff between output quality and performance is feasible. As the method purposed by Felzenszwalb et al. (2010), instead of executing the whole high-quality neural network, the system can only process the input data by a few beginning layers, and producing a less accurate outcome with less latency.

To enable the ‘early exit’ function of the original neural networks, the branch insertion technique for neural networks is introduced in this work, shown as Fig. 4. First of all, the number of arithmetic operations of convolution layers in the original neural network is calculated for computational complexity estimation. Then, based on the model accuracy, the original neural network is divided into n blocks of layers from the bottom (input) to the top (output), which have similar computation complexity. After that, for each layer where the branch output is inserted, the group of this layer and the bottom layers on which it is stacked, can be viewed as a holistic sub-network, and they are evaluated in training in case that insufficient feature extraction leads to a highly erroneous result. Finally, an additional output layer is added onto the insertion point at the end of each group, as the branch output layer for prediction result generation.

Thus, in this work, the original networks modified the cascade version to enable the ‘early exit’ function. After the generation. Then, the modified neural network is trained by the multi-round fine-tuning method. First of all, the neural network is traversed and divided into blocks, so that the backbone part of the neural network is separated by the corresponding output layers according to the inserted branches. Then, the groups of layers are trained from bottom to top, which means the bottom layers and the added output layers (branches) are trained as a whole neural network at first, and then the subsequent layers and their branches are merged with the already-trained bottom blocks and trained group by group in forwarding order.

Finally, tables are created to store the accuracy gained by each sub-network of the modified neural network and the solo execution time of them. With these two steps, the original deep neural network is transferred into the corresponding modified model, which can be dynamically scheduled and flexibly provide different QoS-QoR pairs by choosing a proper execution path.

Cloud performance model

The cloud scheduler in the mobile-cloud scheduling system plays a similar role as the mobile time predictor does on the mobile platform. However, since the modified deep neural network on the cloud is flexible and schedulable, the cloud scheduler also needs to choose a reasonable subnetwork to satisfy the QoS or/and QoR demand. The algorithm of the cloud scheduler is given in Algorithm II.


When the MC scheduler sends the task and the corresponding user-specified QoS/QoR constraints to the cloud, the cloud scheduler will check whether the constraints are satisfiable. If multiple branches can satisfy the QoS/QoR constraints at the same time, the cloud scheduler will find the branch closest to the input according to the tables mentioned in the previous section, then keep on monitoring the execution time, as well as the wireless network performance at the same time.

The cloud scheduler has to decide whether the output layer of this branch is to be loaded to generate the result or keep on pursuing a better result, when the neural network inference procedure approaches to the layer that a particular branch output is stacked on. The decision depends on whether the time to reach the next branch output will violate the QoS constraint or not. Here, the time to reach a branch output is the execution time of the corresponding subnetwork. Although the model is based on wireless network statue a few cycles ago, it is accurate enough since the interval between two adjacent branches is deliberately set narrow to ensure that the wireless performance does not fluctuate heavily in this short period.

Moreover, the cloud scheduler keeps communication with the MC scheduler: reporting the prediction execution time of the current task, and monitoring the ‘give up’ instruction given by the MC scheduler when the time window is only available for the mobile. In other words, the application processed on the cloud will provide an acceptable result, or be killed after receiving the ‘give up’ instruction.

Case study

Considering the case mentioned in Sect. 4.1, and apply our MC scheduling system to process this application. In this case, we map a CaffeNet on the mobile platform, and a GoogLeNet on the cloud for the workload. In order to protect privacy and security, we adopt the method proposed in Osia et al. (2020) on the cloud.

Firstly, MC scheduler checks the service availability, finds that both the neural networks on the cloud and the mobile platform can satisfy the constraints. Based on the strategy of the ‘energy efficiency priority mode’, the task will be sent to the cloud. The transmission time of the image is 22.9 ms. Then, since the system will try to process with a loosen constraint of QoR, is over 75%, there are three available sub-networks generated. Thus, the modified GoogLeNet model is loaded on the GPU of the cloud. The cloud scheduler will monitor the performance of the wireless network, to predict the execution time of the first subnetwork of the modified neural network. In this case, it is 79.8 ms, which is within the QoS constraint, the system will then deploy the neural network layers before the first appropriate branch on the cloud.

While the task is processed by the loaded layers, the cloud scheduler keeps on checking the performance of the wireless network, and calculate the predicted execution time of the next branch of the modified GoogLeNet. Since the performance of the wireless network is unstable, and the speed depresses when the cloud scheduler predicts the execution time in this case, only one subnetwork is marked as reachable in this case. Thus, the system will load the layers before the next branch, without loading their branching output layers.

At this moment, suppose that the performance of the wireless network suffering from further reduction, which means the mobile platform disconnects with the cloud, the system will still ensure the submission of an acceptable result to the user. When the time window of the constraint of QoS is only left by 36 ms, the MC scheduler will check whether the mobile platform has not received the result from the cloud. In this case, there was not any result at that time. Thus, the MC scheduler sends the task to the GPU of the mobile platform. The CaffeNet on the mobile platform processes the task and submits the result in time. In summary, the user will receive the result with an accuracy of 79%, which takes 83.6 ms in general, which satisfies both of the constraints of QoR and QoS. Otherwise, if the user asks for a more precise result, the system will process the flow above again, with a fine tuning.

Experimental setup

We choose Nvidia’s Jetson TX1 Developer Kits as the mobile device of our design, in which a 256 CUDA cores GPU, NVIDIA Maxwell, and a Quad ARM A57 CPU are integrated. On the other hand, a server with Nvidia’s GTX 1080 Ti GPUs and Intel Xeon E5-2630 CPUs is used as the cloud. These two platforms are connected by a wireless router, the maximum bandwidth of which is 300 Mbps.

On the software design, the MC scheduler and the mobile time predictor is processed by the CPU of the Jetson TX1 board, while the cloud scheduler is mapped on the CPU of the cloud server. To illustrate the effect of our design, two sets of neural networks portfolio are evaluated in this article. We map a CaffeNet (Krizhevsky et al. 2012) on the mobile platform, and a GoogLeNet (Szegedy et al. 2015) on the cloud as one set of solutions for tasks (named as set A), while a Network-in-Network (Lin et al. 2013) on the mobile platform, a ResNet-152 (He et al. 2016) on the cloud as the other set (named as set B). Both sets of neural networks are trained for image classification with the dataset of ImageNet 2012.

To evaluate the performance of this work, three different baselines are introduced: (i) processing tasks by original distributed deep neural networks (Teerapittayanon et al. 2017a) (DDNN), (ii) processing tasks only by the neural networks on the cloud (server), by the large neural networks (GoogLeNet for Set A and ResNet-152 for Set B) and (iii) partition by the method introduced in (Eshratifar et al. 2019) (JDNN).

Both DDNN and JointDNN used the shallow depth neural network on the mobile platform for processing first, and then decided whether to directly output the results or hand them over to the cloud for further analysis according to the level of confidence. However, the difference between the two is that in the DDNN method, the deep neural network deployed in the cloud is a new neural network with more complex structure, so the original encoded data transmitted to the cloud is still transmitted to the cloud for re-inference by the cloud deep neural network. And to be within DNN mobile platforms in the depth of the neural network structure and the depth of the cloud neural network structure constitute a greater depth of neural network, so when the depth of the mobile platform structure of neural network output lacks confidence, coding system will directly to the intermediate results, then upload the encoding of the intermediate results to the cloud for further inference.

The accuracy of classification of these neural networks is used to the measure of their QoR, which means the recall rate of their top-5 test result will be considered as the precise of the neural networks.

Experimental results


The performance/quality variation range of neural networks in MC scheduling system is shown in Figs. 3, 4, 5, 6, 7, 8. In these figures, the x-axis shows the adjustable QoR of neural networks, while the y-axis shows the execution time under the QoR constraints specified at the x-axis. Moreover, the system tries to give an acceptable result as soon as possible.

Fig. 3

Set A in Wi-Fi

Fig. 4

Set A in 4G

Fig. 5

Set A in 3G

Fig. 6

Set B in Wi-Fi

Fig. 7

Set B in 4G

Fig. 8

Set B in 3G

We also divide the execution time spent by the cloud from others, for example, the communication time between the cloud and the mobile, and the latency of processing on mobile, to evaluate the cost of cloud serving of different designs. This execution time is shown as different colors in bars of these figures, and marked as ‘_cloud’ as labels. In addition, considering that under different application scenarios, edge devices will connect with high-performance servers in the cloud through different wireless network communication protocols, we respectively evaluated the performance of this method and the control method when connecting with 3G, 4G and Wi-Fi.

According to the result, the baseline ‘DDNN’ shows a low latency under a loosen constraint of QoR, equal to our design, and better than the other two baselines. However, when the requirement of accuracy is high, the performance of this method rapidly drops off, becoming the worst of all the baselines in Set B, since it needs to process both of the neural networks on the mobile and the cloud.

The baseline ‘server’ provides the latest accuracy/latency options, since it is a static scheduling method, and the speed of the network is the only parameter that can affect its performance. Thus, this method can not provide a flexible schedule scheme for users, and its performance is not so good, especially for the users who prefer a lower latency rather than a more precision result. What is worse, the cost of cloud serving of this method is the highest among all the designs that evaluated, which means it relies on the cloud most, and it is hard to expand, since a large number of users may lead to the congestion of the server.

Although the baseline ‘JDNN’ is a dynamic schedule scheme, which can adjust the execution path according to the input, and it has a low cloud serving cost than ‘DDNN’ and ‘server’, and it has less amount of data transferred between the mobile and the cloud, it shows a performance not so well in our experiment. The reason is that to reach the neural network layer of least amount of intermediate data, the mobile platform needs to process a large section of the neural network, which leads to a performance loss in return.

In addition, in this paper, design and DDNN in user requirements for accuracy is not high, or small depth under the premise of neural network output confidence enough can schedule tasks than direct output is directly performed by the mobile platform, JDNN method with the server has to rely on the cloud of the neural network in depth, at least part of the layer of reasoning. This means that codecs for privacy protection must be used regardless of the requested task initiated by the user, which leads to a further loss of performance for these solutions.

In summary, our design has the best performance of all these methods. Our design can achieve the speed up over the average of others from 39.8 to 45.1%, with the constraint of QoR from 75 to 90%. Besides, our design costs the least cloud serve time on average, compared with the baselines (Decreased by 5.72–40.97%). Moreover, our design can dynamic response the users’ requirement on accuracy and latency, as well as the performance of the wireless network, thanks to the multi-version neural networks on the cloud. For example, when an application needs to process the neural networks of set B, and the constraint of quality is over 85% and the latency is not over 75 ms, our design is the only method that can meet the requirement even if the mobile platform and the cloud is connected by 4G network.

Effective energy efficiency

As we mentioned before, the effective energy efficiency (E3) is significant for the real-time embedded deep learning computing. Thus, in this sub-section, we also evaluate the E3 of our method, as well as the three baselines. The result is presented in the Table 2.

Table 2 E3 under QoR and QoS constraints

The baseline ‘server’ is a static schedule scheme, thus it has the same E3 no matter what the constraint of QoR is, so as the ‘JDNN’. On the contrary, the other methods, including ‘DDNN’ and our method, are the dynamic strategies, which means the E3 is vary with the constraints. All in all, our method shows more energy efficient than the design ‘DDNN’ and ‘JDNN’ (increasing 1.6 × and 11.6%, respectively). The baseline ‘server’ spends less energy than us when the requirement of accuracy is not very tight, because the only energy they consume is the communication energy. Besides, when the users require a preciser result, E3 of our method will catch up with the ‘server’, and may even surpass in reverse under some circumstances, thanks to the multi-version neural networks we designed for the cloud.


In this article, we design a real-time task scheduling for privacy-protection neural networks in mobile-cloud systems, MC scheduling system. On the one hand, the small neural network on the mobile platform can process some task under the low requirement of accuracy, which relieves the latency of transmitting to the cloud. On the other hand, for those tasks that the mobile platform cannot deal with, the modified deep neural network on the cloud provides adequate quality/latency combinations. Combining these two parts of designs, our system can flexibly, dynamically schedule the neural network applications to satisfy different constraints of QoR/QoS.

The experiments show that our system can satisfy the real-time neural network applications well under any circumstance with a high E3, even if the constraints are very tight, or the performance of the wireless network is not so stable. Besides, our method is orthogonal with other methods to optimize the structure of deep neural network, as we mentioned before. It means the performance and effective energy efficiency of the system can be further improved by combing our method in this article with these designs.


  1. Abadi, M., Chu, A., Goodfellow, I., McMahan, H.B., Mironov, I., Talwar, K., Zhang, L.: Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318 (2016)

  2. Barbera, M.V., Kosta, S., Mei, A., Stefa, J.: To offload or not to offload? the bandwidth and energy costs of mobile cloud computing. In: 2013 Proceedings Ieee Infocom, pp. 1285–1293. IEEE (2013)

  3. Dosovitskiy, A., Brox, T.: Inverting visual representations with convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4829–4837 (2016)

  4. Du, W., Han, Y.S., Chen, S.: Privacy-preserving multivariate statistical analysis: Linear regression and classification. In: Proceedings of the 2004 SIAM international conference on data mining, pp. 222–233. SIAM (2004)

  5. Eshratifar, A.E., Abrishami, M.S., Pedram, M.: JointDNN: an efficient training and inference engine for intelligent mobile cloud computing services. IEEE Transactions on Mobile Computing (2019)

  6. Felzenszwalb, P.F., Girshick, R.B., McAllester, D.: Cascade object detection with deformable part models. In: Computer vision and pattern recognition (CVPR), 2010 IEEE conference on, pp. 2241–2248. IEEE (2010)

  7. Gilad-Bachrach, R., Dowlin, N., Laine, K., Lauter, K., Naehrig, M., Wernsing, J.: Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. In: International Conference on Machine Learning, pp. 201–210 (2016)

  8. Goldberg, Y.: Neural network methods for natural language processing. Synth. Lect. Hum. Lang. Technol. 10(1), 1–309 (2017)

    Article  Google Scholar 

  9. Graepel, T., Lauter, K., Naehrig, M.: ML confidential: Machine learning on encrypted data. In: International Conference on Information Security and Cryptology, pp. 1–21. Springer (2012)

  10. Gubbi, J., Buyya, R., Marusic, S., Palaniswami, M.: Internet of Things (IoT): a vision, architectural elements, and future directions. Future Gener. Comput. Syst. 29(7), 1645–1660 (2013)

    Article  Google Scholar 

  11. Han, S., Shen, H., Philipose, M., Agarwal, S., Wolman, A., Krishnamurthy, A.: Mcdnn: An approximation-based execution framework for deep stream processing under resource constraints. In: Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services, pp. 123–136 (2016)

  12. He, K., Sun, J.: Convolutional neural networks at constrained time cost. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5353–5360 (2015)

  13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  14. Kang, Y., Hauswald, J., Gao, C., Rovinski, A., Mudge, T., Mars, J., Tang, L.: Neurosurgeon: collaborative intelligence between the cloud and mobile edge. ACM SIGARCH Comput. Archit. News 45(1), 615–629 (2017)

    Article  Google Scholar 

  15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)

  16. Lane, N.D., Georgiev, P.: Can deep learning revolutionize mobile sensing? In: Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications, pp. 117–122. ACM (2015)

  17. Li, M., Lai, L., Suda, N., Chandra, V., Pan, D.Z.: Privynet: A flexible framework for privacy-preserving deep neural network training. CoRR, abs/1709.06161 (2017)

  18. Lin, M., Chen, Q., Yan, S.: Network in network. arXiv preprint arXiv:1312.4400(2013)

  19. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: Ssd: Single shot multibox detector. In: Proceedings of European Conference on Computer Vision (ECCV), pp. 21–37 (2016)

  20. Ma, X., Guo, F. M., Niu, W., Lin, X., Tang, J., Ma, K., & Wang, Y.: PCONV: The Missing but Desirable Sparsity in DNN Weight Pruning for Real-Time Execution on Mobile Devices. In 2020 AAAI Conference on Artificial Intelligence (AAAI), pp. 5117–5124 (2020)

  21. Melis, L., Song, C., De Cristofaro, E., Shmatikov, V.: Exploiting unintended feature leakage in collaborative learning. In: 2019 IEEE Symposium on Security and Privacy (SP), pp. 691–706. IEEE (2019)

  22. Mohassel, P., Zhang, Y.: Secureml: A system for scalable privacy-preserving machine learning. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 19–38. IEEE (2017)

  23. Orlandi, C., Piva, A., Barni, M.: Oblivious neural network computing via homomorphic encryption. EURASIP J. Inf. Secur. 2007, 1–11 (2007)

    Article  Google Scholar 

  24. Osia, S.A., Shamsabadi, A.S., Sajadmanesh, S., Taheri, A., Katevas, K., Rabiee, H.R., Lane, N.D., Haddadi, H.: A hybrid deep learning architecture for privacy-preserving mobile analytics. IEEE Internet Things J. 7(5), 4505–4518 (2020)

    Article  Google Scholar 

  25. Park, E., Ahn, J., & Yoo, S.: Weighted-entropy-based quantization for deep neural networks. In 2017 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5456–5464 (2017)

  26. Qiu, J., Wang, J., Yao, S., Guo, K., Li, B., Zhou, E., Yu, J., Tang, T., Xu, N., Song, S.: Going deeper with embedded fpga platform for convolutional neural network. In: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 26–35. ACM (2016)

  27. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)

  28. Sehgal, A., Kehtarnavaz, N.: A convolutional neural network smartphone app for real-time voice activity detection. IEEE Access 6, 9017–9026 (2018)

    Article  Google Scholar 

  29. Skala, K., Davidovic, D., Afgan, E., Sovic, I., Sojat, Z.: Scalable distributed computing hierarchy: cloud, fog and dew computing. In: IEEE International Conference on Cloud Computing Technology and Science, vol. 1, pp. 16–24 (2015)

  30. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

  31. Tang, Y., Wang, Y., Li, H., Li, X.: MV-Net: toward real-time deep learning on mobile GPGPU systems. ACM J. Emerg. Technol. Comput. Syst. (JETC) 15(4), 35 (2019)

    Google Scholar 

  32. Teerapittayanon, S., McDanel, B., Kung, H.-T.: Distributed deep neural networks over the cloud, the edge and end devices. In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 328–339. IEEE (2017a)

  33. Teerapittayanon, S., Mcdanel, B., Kung, H.T.: Distributed Deep Neural Networks Over the Cloud, the Edge and End Devices. international conference on distributed computing systems, pp. 328–339 (2017b)

  34. Vardhana, M., Arunkumar, N., Lasrado, S., Abdulhay, E., Ramirez-Gonzalez, G.: Convolutional neural network for bio-medical image segmentation with hardware acceleration. Cogn. Syst. Res. 50, 10–14 (2018)

    Article  Google Scholar 

  35. Wang, X., Wang, J., Wang, X., Chen, X.: Energy and delay tradeoff for application offloading in mobile cloud computing. IEEE Syst. J. 11(2), 858–867 (2017)

    Article  Google Scholar 

  36. Wang, J., Zhang, J., Bao, W., Zhu, X., Cao, B., Yu, P.S.: Not just privacy: Improving performance of private deep learning in mobile cloud. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2407–2416 (2018)

  37. Xia, F., Ding, F., Li, J., Kong, X., Yang, L.T., Ma, J.: Phone2Cloud: Exploiting computation offloading for energy saving on smartphones in mobile cloud computing. Inf. Syst. Front. 16(1), 95–111 (2014)

    Article  Google Scholar 

  38. Xiang, L., Ma, H., Zhang, H., Zhang, Y., Ren, J., Zhang, Q.: Interpretable Complex-Valued Neural Networks for Privacy Protection. In International Conference on Learning Representations (2020)

  39. Zhang, T., Zhu, Q.: Dynamic differential privacy for ADMM-based distributed classification learning. IEEE Trans. Inf. Forensics Secur. 12(1), 172–187 (2016)

    MathSciNet  Article  Google Scholar 

  40. Zhang, Q., Yang, L.T., Chen, Z.: Privacy preserving deep computation model on cloud for big data feature learning. IEEE Trans. Comput. 65(5), 1351–1362 (2015)

    MathSciNet  Article  Google Scholar 

  41. Zhou, J., Dai, H.N., Wang, H.: Lightweight convolution neural networks for mobile edge computing in transportation cyber physical systems. ACM Trans. Intell. Syst. Technol. (TIST) 10(6), 1–20 (2019)

    Article  Google Scholar 

Download references


This work was funded in part by the National Key Research and Development Program of China (Grant number 2018AAA0102705), and in part by the National Natural Science Foundation of China (Grant number 61874124, 61876173).

Author information



Corresponding author

Correspondence to Huawei Li.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Tang, Y., Wang, Y., Li, H. et al. To cloud or not to cloud: an on-line scheduler for dynamic privacy-protection of deep learning workload on edge devices. CCF Trans. HPC (2020).

Download citation


  • Real-time
  • Deep learning
  • Edge computing
  • Privacy protection