Distributed deep learning platform for pedestrian detection on IT convergence environment

IT technology and traditional industries have been combined recently, resulting in IT convergence technology in various fields. Through convergence with the automobile, pedestrian detection technology, in particular, is used in the autonomous navigation control service of autonomous vehicles and also applied in various fields such as intelligent CCTV and robot recognition technology. For pedestrian detection, hierarchical classification and feature vector were used in early stage, and deep learning is under active progress. However, since deep learning for pedestrian detection is time-consuming for processing a large volume of image data, it requires a lot of computing resources, and hence building such a system is very expensive. Therefore, in this paper we shall present a distributed deep learning platform which can easily build a cluster, and execute deep learning process in the distributed cloud environment, while achieving performance improvement in various ways. Our platform provides a convenient interface for easily and efficiently executing the deep learning process in a distributed environment by providing a multilayered system architecture. Our system builds and utilizes computing power in easy and efficient way by leveraging container technique, so-called OS-level virtualization, rather than traditional hypervisor-based virtualization. In our system, we improve the whole performance by exploiting both of data and parameter parallelisms at once and reduce the synchronization overhead by exploiting asynchronous communication for parameter updates. Also, we propose an efficient resource allocation scheme for parameter servers and slaves which can improve the performance from the experiment.


Introduction
IT convergence technologies have been recently emerging from various fields, combining IT technology and traditional industries. In IT convergence technology, pedestrian detection is very essential and crucial in particular, since it has been utilized in many different fields in relation with human life such as intelligent CCTV, robot recognition, security, autonomous navigation, crime investigation and entertainment [1]. Pedestrian detection involves finding and recognizing a person in an image or video, and his exact location. This information can be utilized for various uses, providing services needed by the pedestrian or as a method of communication [2]. Such pedestrian detection has been converged with the automobile field and currently used in the intelligent services for autonomous navigation of vehicles.
Pedestrian detection technology in the early years consisted of feature extraction and learning process [3][4][5][6][7][8][9] which have a relatively high false detection rate compared to the deep learning model. Convolutional neural network (CNN) [10][11][12] through deep learning is used widely in recent years, which involves learning pedestrian data and detecting objects through classification. This method has a lower false detection rate compared to the method from early stages, but making the detector is strenuous, and the learning time takes very long [13]. Therefore, in this paper, we adopt faster region with convolutional neural network (Faster R-CNN) [14] technique with excellent detection performance, which uses region proposal network (RPN) for faster detection. However, since deep learning for pedestrian detection is time-consuming for processing a large volume of image data, it requires a large number of high-performance computers or several high-performance graphics processing units (GPUs). Unfortunately, it takes money and time to build such an environment; thus, it must put a lot of pressure on researchers who are in financial difficulty. Cloud for deep learning processing is one of solutions to solve this problem. By providing a deep learning cloud service from an enterprise company which is building a large data center, deep learning researchers can conduct research without any financial or time burdens.
In this paper we shall present a distributed deep learning platform which can easily build a cluster, and execute deep learning process fast in the distributed cloud environment while achieving performance improvements. Our platform provides a convenient interface for easily and efficiently executing the deep learning process in a distributed environment by providing a multilayered system architecture. Our system utilizes computing power in easy and efficient way by providing interface for managing distributed resources and by leveraging container technique, so-called OS-level virtualization, rather than traditional hypervisor-based virtualization. Hypervisor-based machine is operated in the same hierarchy as the emulator and guest OS, making it rather inferior and utilization of resources difficult [15]. On the other hand, the container is an OS-level virtualization method which provides a virtual environment, isolated from the physical system including the CPU, memory and distributed network. By providing deep learning clouds through container-based virtualization, physical resources can be 1 3 used as much as possible without overhead. We achieve performance improvement in various ways. First, we improve the whole performance by exploiting both of data and parameter parallelisms at once. For data parallelism, input data are distributed among slave nodes, each updating the gradient, while for parameter parallelism parameter data among several parameter nodes, respectively. Next, we reduce the synchronization overhead by exploiting asynchronous communication for parameter updates. Most of distributed learning frameworks were developed using a high-performance distributed processing framework called Apache Spark [16], which was based on memory technology. It generally operates in MapReduce and batch processing in synchronous approach for parameter exchanges. Our system improves performance by adopting the asynchronous method for parameter exchange, as the synchronous method takes more time compared to the asynchronous one. Also, we propose an efficient resource allocation scheme for parameter servers and slaves which can improve the performance from the experiment. When parameter servers and slaves are placed in VMs of different physical nodes, remote communication occurs frequently. Therefore, we can reduce the network overhead by placing the parameter server and slave in each VM in one physical node.
The outline of our paper is as follows: In Sect. 2, we explain about related works. In Sect. 3, we describe about system architecture and flow of our platform, and in Sect. 4, we explain about deep learning process. In Sect. 5, we describe about experimental result, and in Sect. 6, we give a conclusion.

Related work
This chapter explores deep learning models for pedestrian detection and programming library which are widely used recently and then describe about distributed parallel processing technique for deep learning and distributed environments.

Deep learning models for pedestrian detection
Convolutional neural networks (CNN) have been widely used recently, due to its excellent performance in the object recognition field [17]. CNN is used in the process of classifying whether or not the extracted image is a pedestrian. This method involves inputting an image of a certain size and outputting whether or not it is a pedestrian in the final output stage through CNN. Feature extraction and classification process are all included in one structure. On the other hand, region with convolutional neural network (R-CNN) [18] uses image segmentation for selective search in order to extract all the object prediction regions and inputs the extracted regions into the learned CNN to classify the results. However, the problem with R-CNN is that training is complicated, causing slow test speeds. Fast R-CNN was developed to solve this problem. Like the existing R-CNN, Fast R-CNN also predicts the object region from the image, but unlike the R-CNN, it can deduce the entire object prediction regions with one CNN calculation for one image. There is no damage to 1 3 Distributed deep learning platform for pedestrian detection… the original image, since the crop and resize operations from the existing technique have been omitted, and processing is completed using the ROI pooling layer in the CNN computed feature. However, region proposal selective search is still conducted outside of the neural network in Fast R-CNN, and region proposal selective search becomes the bottleneck of the entire performance. To solve this problem of Fast R-CNN, Faster R-CNN was proposed, combining RPN to Fast R-CNN. Instead of the slow external selective search, Faster R-CNN uses the fast internal RPN. It also has the advantage of sharing the detection network and convolutional feature map by applying the back-propagation algorithm of RPN. In this paper, Faster R-CNN model is adopted for pedestrian detection due to its fast object detection and excellent accuracy.

Deeplearning4j framework
Our deep learning programming is based on Deeplearning4j [19] which is opensource library based on java and JVM. It supports various deep learning algorithms such as the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec and Glove. These algorithms can be executed in parallel using Hadoop and Spark. Dee-plearning4j is mainly written in Java and includes Scala application programming interface (API). Numerical computation uses ND4J.
Deeplearning4j's open-source library is compatible with both central processing units (CPUs) and GPUs. Deeplearning4j's learning process is structured to take advantage of distributed computing. In other words, it is possible to learn with repetitive MapReduce model based on Hadoop and Spark. In addition, it supports CUDA kernel, which enables computation on GPU and distributed learning using multiple GPUs. With Deeplearning4j framework, any combination of constrained Boltzmann machines, convolution neural networks, autoencoders and recurrent neural networks can be used.

Distributed parallel deep learning
In this section, we shall describe various techniques for distributed parallel deep learning including parallelism, parameter sharing and communication.

Parallelism
Distributed parallel training is used in deep learning for accelerating training, which can be classified into two ways: data parallelism and model parallelism. Figure 1a outlines the data parallelism method, where the input data set for learning is divided among multiple computers for training. In this method, the entire model is loaded into each distributed computer, and the input data are divided into several parts each of which is distributed among computers for training. Then, the weights computed by each computer are merged to update the whole parameters [20].

Parameter sharing
There are two ways to share parameters: full mesh topology-based sharing and star topology-based sharing technique [21,22]. Figure 2a outlines full mesh technique, where each computer delivers the parameter to all the other computers directly. This approach has the advantage of not requiring separate shared storage management, but its expandability is reduced, since the number of communication increases by Distributed deep learning platform for pedestrian detection… N 2 as the number of distributed computers increases. On the other hand, Fig. 2b outlines star approach, which requires the concurrency control and synchronization of the shared parameter storage, but has the advantage of better expandability, since communication does not increase when the number of computers increases.

Synchronous and asynchronous communication
In case of star topology-based parameter sharing technique, the parameter synchronization between the computers is necessary, since the distributed computers need to update their respective parameters in the central computer. In case of a synchronous update, the entire training performance is adjusted to the performance of the last computer which sends its parameter to the central computer, since the execution time differs for each distributed processing computer, depending on their performance and  Fig. 3. To solve this problem, an asynchronous update approach can be used as shown in Fig. 4. This involves executing the training without synchronizing the parameters received from each computer. Most of the distributed frameworks are based on the synchronous technique and provide additional asynchronous technique to partially improve the training speed [23]. Asynchronous processing is more complicated than synchronous processing and has the disadvantage of deadlock.

Akka
In our system, Akka [24] plays an important role in driving system and communication between nodes, which is a free and open-source toolkit, and runtime simplifying the construction of concurrent and distributed applications on the JVM. It supports multiple programming models for concurrency, but it emphasizes actor-based concurrency, with inspiration drawn from Erlang. The actor model implemented by Akka is based on the mathematical model proposed by Carl Hewitt in 1973. Now that concurrent programming is becoming increasingly difficult to write in multithreading environments and the number of CPU cores used by a single computer is increasing rapidly, Akka provides an intuitive and convenient programming model for writing concurrent code. The actor model has several features. An object's methods cannot be called directly, and a message can be only delivered. It is basically asynchronous and non-blocking. It also works concurrently. These Akka features are very powerful. With Akka, it is possible to eliminate all or at least the sequential parts and blocking calls which exist throughout the program. In addition, Akka has the advantage of automatically guaranteeing scale-out.

HDFS
Hadoop Distributed File System (HDFS) [25] is a block-structured file system based on Google File System (GFS). Large files of over tens of terabytes can be stored in the distributed server, and storage can also be configured using the low-end server, which has advantages over existing large file systems (NAS, DAS, SAN, etc.). Network File System (NFS) is a general distributed file system that was developed for many users to access data in a network environment, but there is a restriction that only one logic volume stored in a single device can be accessed remotely [26]. HDFS was designed to overcome the limitations of NFS. It can store terabytes or petabytes by distributing data among multiple computers, supporting files much larger than NFS. Since the reliability of data storage in HDFS is high, individual computers can use the data even if they cause an error.

3
Distributed deep learning platform for pedestrian detection…

Container
Container, also known as OS virtualization, is the process of performing one operating system in another operating system. As shown in Fig. 5, the container shares the kernel of the original OS with other containers and has its own runtime, library, etc., to execute application programs. Unlike the existing hypervisorbased virtualization, the container can make efficient use of resources as it does not require guest OS and emulator.
Linux container (LXC) [27] is an OS-level virtualization technique for running multiple isolated Linux systems (containers) from a single control host. LXD [28] is a pure container hypervisor that performs Linux OS and VM-style operations at a remarkable speed and density. New Linux hypervisors that were developed to replace kernel-based virtualization, such as KVM container technology like Dockers, have security issues. As a result, in spite of the speed and density that enables efficient resource utilization, containers have been used as a traditional hypervisor, without maximizing resources. However, LXD, a new pure containercentric hypervisor, uses the same technology as the Docker container to provide the same level of security as traditional hypervisors like KVM.

Distributed deep learning platform
In this chapter, we shall describe about system architecture for distributed deep learning system.  Fig. 6. In DDLIL, user develops deep learning application programs and submits them to DDLEL, which in turn manages, allocates system resources in the distributed environment and then runs the programs. DSIL exploits HDFS for storing distributed Distributed deep learning platform for pedestrian detection… files and deploys virtual programming environment and deep learning model on container-based cloud infrastructure. Each layer shall be explained in detail in the following sections.

Distributed Deep Learning Interface Layer (DDLIL)
DDLIL is responsible for receiving the request of the user for running application, building a job and delivering it to the system. It consists of job submitter and parallel model manager as shown in Fig. 7. Job submitter consists of submit module, request analyzer and request builder. Submit module receives requests from users for application program execution, and request analyzer analyzes the user request such as parallelism, build file path, main class path, the number of parameter servers and slaves, etc., so that request builder builds parallel application program by interacting with parallel manager which generates a proper parallel model with data and parameter partition modules. Data partition module is responsible for the data parallelism of distributed deep learning. Each application allocates data according to the number of slave nodes and the size of total input data for distributed deep learning. Parameter partition module is responsible for the parameter parallelism of distributed deep learning. It appropriately allocates parameters of the model according to the number of parameter servers. The generated model is created as a job with various information, which is transmitted to DDLECP for deployment and running in the system.

Distributed Deep Learning Execution Layer (DDLEL)
DDLEL manages the clustered resources, distributes the user application to the selected nodes and executes applications. The first is to manage cluster and infrastructure state. The second is to manage the resources to execute the application. The third is to deploy the job to the allocated resources. The forth is to run the application. Figure 8 outlines the control flow and configuration components.
When DDLEL receives a job from user request from DDLIL, stream manager generates a stream controller to handle the job. Stream manager manages the lifecycle of the stream controller. Stream manager creates stream controllers according to the number of jobs submitted from the DDLIL. Each stream controller corresponds to one job requested by the user and managers several resource agent controllers. Stream manager is maintained until the stream controller is finished. Stream controller requests resources to resource manager through resource requester. To communicate with the allocated resources, the stream controller generates resource agent controllers as many as the number of nodes allocated through the resource agent manager. The requested job is deployed to each agent through resource agent controller. Agent actually executes the deployed job. After receiving the job, resource agent generates a task for execution.

Fig. 8 Components of DDLEL
Resource agent directly communicates with the resource agent controller. Once the task is allocated to resource agent, the task is created to execute the user application program. Also, it communicates with infrastructure information manager of the resource manager periodically and sends the condition of the node. Task consists of the execution module and communication module for the user application. It manages the lifecycle of the user application program and is maintained until the application program is completed. Execution module runs the actual application program. The user application program is processed by loading the build file's class in the job, using the class loader. Communication module is used in the communication between nodes and occurs through the distributed processing of the user application program. It is used in the exchange of parameters between the parameter server and the slave during distributed deep learning processing.
Resource manager provides as much resources as user application needs. When the use of resources is completed, the corresponding resources are collected. If there are not enough resources for the user's request, the process is terminated and a warning is issued. Infrastructure information manager manages the status of resources. If one of the nodes in the cluster is killed, remove it from the resource pool. Resource negotiator is responsible for determining the appropriate resources according to the status of each resource when the stream controller requests resources through the resource requester. When the resource negotiator determines the requested resource, resource provider deploys the allocated resources through DSIL.
Stream manager also manages a buffer which consists of several topics each storing real-time images coming from source such as CCTV as shown in Fig. 9. Realtime images in each topic are distributed onto multiple partitions residing in the single or distributed broker nodes based on the file system using Kafka. Each of multiple partitions in the topic is processed by a single node so that multiple partitions can be processed simultaneously for deep learning as a preprocessing step.

Distributed System Infrastructure Layer (DSIL)
DSIL provides container-based cloud and HDFS for distributed file storage as shown in Fig. 10. Distributed System Infrastructure refers to the infrastructure on which our system operates. Distributed System Infrastructure consists of LXD controller and HDFS controller. LXD controller is used to create a container-based virtual machine on the physical node and set the usage amount of each resource. HDFS controller is used to create Hadoop Distributed File System (HDFS) for storing the inputs of the deep learning process. Since HDFS stores files in block units on the distributing nodes, it can reduce the network load caused by requesting the files from multiple nodes.

System control flow
In this section, we shall describe the control flow as well as a hierarchy of our system as shown in Fig. 11. The control flow is largely divided into two parts: DDLIL and DDLEL. An application written by the user is delivered to DDLIL responsible for building and submitting it to our system. The application delivered to the DDLIL is created as a job containing a variety of information for running applications on our system. One example of the information is about the parallelism of the model. The job is delivered to DDLEL, which in turn requests resources from resource manager to run the application. It also deploys the job to the agent of the allocated distributed resource. Each agent receives a job and generates a task to execute the application and creates a communication module in the distributed deep learning application. Then, the application is executed as a task by agent.
Our system has a software hierarchy diagram as shown in Fig. 12. Physical nodes are distributed at the bottom to explain at the lowest level, which provides real resources for running the deep learning application program. Next is LXD machine container for providing resource virtualization service and then HDFS layer for loading input files of the application program. It provides input data to the Distributed deep learning platform for pedestrian detection… distributed nodes for data parallelism. As mentioned previously, resources can be used efficiently if OS-level virtualization is used. Next is distributed deep learning system software for supporting distributed and cloud environment. Our proposed deep learning system uses Deeplearning4j and provides a wrapper for deep learning application.

Distributed deep learning processing
Distributed deep learning processing is executed as follows: First, input data are distributed among slaves for distributed deep learning. All parameters are also distributed among parameter servers. Each slave receives parameters from  parameter servers and then finds the gradient by processing the part of input data with specified batch size, based on the imported parameters. The calculated gradients are sent back to the parameter servers, each updating its corresponding parameters. This train process is repeated until sufficient accuracy is obtained.

Data and parameter parallelism
Our system provides two kinds of parallelism in the distributed environment: data parallelism and parameter parallelism. For the former, the whole model is loaded for each slave, and the input data is distributed among slaves for training as shown in Fig. 13. Each time the training process in each slave is repeated, the modified gradients are exchanged with the parameter servers. Since deep learning training processes very large input data, data parallelism can be an efficient method to reduce the overall training time through parallel processing.
The size of overall model parameters is very large when large-scale deep learning processing is performed as shown in Fig. 14. Therefore, it may incur a network bottleneck during parameter exchange in distributed environment. In order to solve this problem, parameter parallelism is used by distributing the entire parameters across several parameter servers, thus reducing the communication overhead arising in the centralized parameter server.

Parameter exchange method
There are two methods for parameter exchange: asynchronous and synchronous. Figure 15 shows parameter exchange method for both methods. Our system adopts the asynchronous method for the parameter exchange method.
Synchronous method, when performing updates on the parameter servers, is performed synchronously on all nodes. It has advantage for accuracy and implementation cost. However, synchronous method causes all the other slaves to wait, while the parameter server performs each update. Therefore, the overall performance is adjusted to that of the worst performing node, resulting in performance degradation. In other words, each slave node which has already performed the update should wait until the latest update of the gradient is performed. On the other hand, in the asynchronous mode, each slave communicates asynchronously with the parameter servers to update parameter. For this reason, asynchronous method can be quickly trained without sacrificing accuracy as compared to synchronous method. Asynchronous method performs updates more concurrently.
Generally, many distributed frameworks are based on synchronous methods. In particular, Deeplearning4j, the foundation of our deep learning process, implements distributed processing based on the Apache Spark, which is a synchronous model based on MapReduce as a large-scale distributed processing framework. Therefore, distributed deep learning processing based on Apache Spark exchanges parameters in a synchronous manner. As a result, performance is somewhat lower than that of the asynchronous method. However, although our system is based on Deeplearn-ing4j, we shall provide a way for exchanging the parameters asynchronously without deploying Apache Spark, improving the overall performance.

Performance evaluation
For the experimental evaluation, Akka and Deeplearning4j libraries are used in our system for distributed deep learning processing. Akka is open-source toolkit for simplifying the construction of concurrent and distributed programming. It enables us to build our distributed system including clustering, deployment of job, resource management and so on. Our deep learning processing is based on Deeplearning4j, an open-source deep learning library written in Java. It supports a variety of deep learning models, including limited Boltzmann machines, CNN, recurrent neural network and so on. We provide DistMultiLayerNetwork class to run Deeplearning4j in distributed mode on our system, which is a class wrapping MultiLayerNetwork class of Deeplearning4j. The application program written by the user is submitted to the system by executing the main submit class, which analyzes the request, creates the task and sends it to the submit module. It generates Akka Actor System before sending the task to submit module, since submit module is Akka's actor,

System environment
Our system environment for performance evaluation consists of 8 container systems written by LXD based on the OS virtualization technology from 4 physical nodes each with 8 CPU's (Intel (R) Xeon (R) CPU E5606 @ 2.13 GHz) and 24 GB DDR3 main memory. Each container system is allocated 4 CPU's with 4 cores and 10 GB of RAM. Performance is evaluated under various experiments. Figure 16 shows our experimental environment.
We experiment KITTI data set for pedestrian detection on Faster R-CNN deep learning model. We measure performance by changing the number of parameter servers and slaves as well as the placement of parameter servers and slaves. For performance evaluation, we are more concerned with comparing synchronous and asynchronous modes for various configurations in terms of the overall time taken for the whole training process, since it is hard to differentiate the communication time from computation time in asynchronous mode. We try to find the optimal placement from the viewpoint of the whole training process performance by showing that asynchronous mode is better than synchronous one due to the overlapped duration between computation and communication rather than from communication/computation cost.
Our data set for the pedestrian detection experiment on Faster R-CNN is a set of images from KITTI data set. It consists of 7481 training sets and 7518 test sets, each with 1242 × 375 image size. Figure 17 shows examples of our KITTI data sets.

Experimental result
First, we compare performance when assigning various placements to VMs. Table 1 shows four experimental configurations with the fixed number of slaves and parameter servers: four slaves in one VM, two slaves in two VMs in one node and two nodes, respectively, one slave in one VM in four nodes. Figure 18 shows performance for each configuration for both of synchronous and asynchronous cases. Experimental result shows that if multiple slaves are placed in a single VM, the performance is not improved, since CPUs of single VM are shared. In other words, assigning one role of parameter server or slave to one VM shows the best performance.
Next, we measure the performance by varying the number of parameter servers and slaves. First, we compare the performance by increasing the number of slaves each in a single VM with respect to different number of parameter servers as shown in Figs. 19, 20 and 21. Both of synchronous and asynchronous cases show that performance increases and then decreases as the number of slaves increases, since frequent networks occur as the slave increases. In particular, the asynchronous case shows better performance than the synchronous one. The best performance case is for one parameter server and five slaves in asynchronous mode. In this case, it is 1.98 times faster than when it is not parallelized and 1.66 times faster than the synchronization case. In addition, it is 1.31 times faster than the best performance when using multiple parameter servers, 4 parameter servers and 4 slaves. Next, we compare the performance by increasing the number of parameter servers with respect to different number of slaves as shown in Figs. 22, 23 and 24. This experimental result shows that the performance does not always increase when the number of parameter servers increases. Rather, the overall performance decreases, since the size of the parameters is rather small. The best performance can be achieved by selecting the number of parameter servers according to the number of parameters and data distributed to each node and the network speed.

Optimal placement of slave and parameter server
In this section, we suggest a placement strategy of parameter servers and slaves to get optimal performance of distributed deep running. It attempts to reduce the  Table 2 shows the placement of these strategies. Figure 25 shows that this placement is performed better than the randomly placed method. In optimal results, the asynchronous communication method is 1.55 times faster than that without parallelism and 1.27 times faster than the synchronous communication method with 4 parameter servers and 4 slaves.

Conclusion
In this paper we have presented the distributed deep learning platform which can easily build a cluster, and execute deep learning process fast in the distributed cloud environment while achieving performance improvement. Our platform provides a convenient interface for easily and efficiently executing the deep learning process in a distributed environment by providing a system architecture which consists of three layers: DDLIL, DDLEL and DSIL. In DDLIL, user develops deep learning application programs and submits them to DDLEL, which in turn allocates system resources in the distributed environment and then runs the programs. DSIL provides  HDFS for storing distributed files and deploys virtual programming environment and deep learning model on container-based cloud infrastructure. Our system can easily build clustering and run distributed processing in a distributed environment using one simple command.
The various features of our platform can efficiently handle deep learning processing efficiently in a distributed environment. Our system utilizes computing power in easy and efficient way by leveraging container technique, so-called OS-level virtualization, rather than traditional hypervisor-based virtualization. We achieve performance improvement in various ways. First, we have improved the whole performance by exploiting both of data and parameter parallelisms at once. Next, we have reduced the synchronization overhead by exploiting asynchronous communication for parameter updates. Also, we have proposed an efficient resource allocation  25 Performance with respect to optimal placement scheme for parameter servers and slaves which can improve the performance from the experiment. When parameter servers and slaves are placed in VMs of different physical nodes, remote communication occurs frequently. We have suggested a way to improve performance by reducing the frequency of remote communication by placing parameter servers and slaves on the same physical node. Experimental results have shown that our scheme for distributed deep learning provides higher performance on processing time by adopting asynchronous communication method. In both experiments, asynchronous communication method has shown better performance than synchronous communication method. In addition, the strategy of placing parameter servers and slaves on each VM of a single physical node is able to boost performance.
As a future work, we are going to further develop distributed parallel platform for fast deep learning in other computation-intensive applications.