CHAOS: A Parallelization Scheme for Training Convolutional Neural Networks on Intel Xeon Phi

Deep learning is an important component of big-data analytic tools and intelligent applications, such as, self-driving cars, computer vision, speech recognition, or precision medicine. However, the training process is computationally intensive, and often requires a large amount of time if performed sequentially. Modern parallel computing systems provide the capability to reduce the required training time of deep neural networks. In this paper, we present our parallelization scheme for training convolutional neural networks (CNN) named Controlled Hogwild with Arbitrary Order of Synchronization (CHAOS). Major features of CHAOS include the support for thread and vector parallelism, non-instant updates of weight parameters during back-propagation without a significant delay, and implicit synchronization in arbitrary order. CHAOS is tailored for parallel computing systems that are accelerated with the Intel Xeon Phi. We evaluate our parallelization approach empirically using measurement techniques and performance modeling for various numbers of threads and CNN architectures. Experimental results for the MNIST dataset of handwritten digits using the total number of threads on the Xeon Phi show speedups of up to 103x compared to the execution on one thread of the Xeon Phi, 14x compared to the sequential execution on Intel Xeon E5, and 58x compared to the sequential execution on Intel Core i5.


Introduction
Traditionally engineers developed applications by specifying computer instructions that determined the application behavior. Nowadays engineers focus on developing and implementing sophisticated deep learning models that can learn to solve complex problems. Moreover, deep learning algorithms [27] can learn from their own experience rather than that of the engineer.
Many private and public organizations are collecting huge amounts of data that may contain useful information from which valuable knowledge may be derived. With the pervasiveness of the Internet of Things the amount of available data is getting much larger [20]. Deep learning is a useful tool for analyzing and learning from massive amounts of data (also known as Big Data) that may be unlabeled and unstructured [47,44,36]. Deep learning algorithms can be found in many modern applications [54,50,19,56,48,17,21,59], such as, voice recognition, face recognition, autonomous cars, classification of liver diseases and breast cancer, computer vision, or social media.
A Convolutional Neural Network (CNN) is a variant of a Deep Neural Network (DNN) [14]. Inspired by the visual cortex of animals, CNNs are applied to state-of-the-art applications, including computer vision and speech recognition [15]. However, supervised training of CNNs is computationally demanding and time consuming, and in many cases, several weeks are required to complete a training session. Often applications are tested with different parameters, and each test requires a full session of training.
Multi-core processors [55] and in particular many-core [5] processing architectures, such as the NVIDIA Graphical Processing Unit (GPU) [37] or the Intel Xeon Phi [8] co-processor, provide processing capabilities that may be used to significantly speed-up the training of CNNs. While existing research [12,53,48,57,41] has addressed extensively the training of CNNs using GPUs, so far not much attention is given to the Intel Xeon Phi co-processor. Beside the performance capabilities, the Xeon Phi deserves our attention because of programmability [38] and portability [23].
In this paper, we present our parallelization scheme for training convolutional neural networks, named Controlled Hogwild with Arbitrary Order of Synchronization (CHAOS). CHAOS is tailored for the Intel Xeon Phi coprocessor and exploits both the thread-and SIMD-level parallelism. The threadlevel parallelism is used to distribute the work across the available threads, whereas SIMD parallelism is used to compute the partial derivatives and weight gradients in convolutional layer. Empirical evaluation of CHAOS is performed on an Intel Xeon Phi 7120 co-processor. For experimentation, we use various number of threads, different CNNs architectures, and the MNIST dataset of handwritten digits [29]. Experimental evaluation results show that using the total number of available threads on the Intel Xeon Phi we can achieve speedups of up to 103× compared to the execution on one thread of the Xeon Phi, 14× compared to the sequential execution on Intel Xeon E5, and 58× compared to the sequential execution on Intel Core i5. The error rates of the parallel execution are comparable to the sequential one. Furthermore, we use performance prediction to study the performance behavior of our parallel solution for training CNNs for numbers of cores that go beyond the generation of the Intel Xeon Phi that was used in this paper. The main contributions of this paper include: design and implementation of CHAOS parallelization scheme for training CNNs on the Intel Xeon Phi, performance modeling of our parallel solution for training CNNs on the Intel Xeon Phi, measurement-based empirical evaluation of CHAOS parallelization scheme, model-based performance evaluation for future architectures of the Intel Xeon Phi.
The rest of the paper is organized as follows. We discuss the related work in Section 2. Section 3 provides background information on CNNs and the Intel Xeon Phi many-core architecture. Section 4 discusses the design and implementation aspects of our parallelization scheme. The experimental evaluation of our approach is presented in Section 5. We summarize the paper in Section 6.

Related Work
In comparison to related work that target GPUs, the work related to machine learning for Intel Xeon Phi is sparse. In this section, we describe machine learning approaches that target the Intel Xeon Phi co-processor, and thereafter we discuss CNN solutions for GPUs and contrast them to our CHAOS implementation.

Machine Learning targeting Intel Xeon Phi
In this section, we discuss existing work for Support Vector Machines (SVMs), Restricted Boltzmann Machines (RBMs), sparse auto encoders and the Brain-State-in-a-Box (BSB) model.
You et al. [58] present a library for parallel Support Vector Machines, MIC-SVM, which facilitates the use of SVMs on many-and multi-core architectures including Intel Xeon Phi. Experiments performed on several known datasets showed up to 84x speed up on the Intel Xeon Phi compared to the sequential execution of LIBSVM [6]. In comparison to their work, we target deep learning.
Jin et al. [22] perform the training of sparse auto encoders and restricted Boltzmann machines on the Intel Xeon Phi 5110p. The authors reported a speed up factor of 7 − 10× times compared to the Xeon E5620 CPU and more than 300× times compared to the un-optimized version executed on one thread on the co-processor. Their work targets unsupervised deep learning of restricted Boltzmann machines and sparse auto encoders, whereas we target supervised deep learning of CNNs.
The performance gain on Intel Xeon Phi 7110p for a model called Brain-State-in-a-Box (BSB) used for text recognition is studied by Ahmed et al. in [2]. The authors report about two-fold speedup for the co-processor compared to a CPU with 16 cores when parallelizing the algorithm. While both approaches target Intel Xeon Phi, our work addresses training of CNNs on the MNIST dataset.

Related Work Targeting CNNs
In this section, we will discuss CNNs solutions for GPUs in the context of computer vision (image classification). Work related to MNIST [29] dataset is of most interest, also NORB [30] and CIFAR 10 [25] is considered. Additionally, work done in speech recognition and document processing is briefly addressed. We conclude this section by contrasting the presented related work with our CHAOS parallelization scheme.
Work presented by Cireşan et al. [12] target a CNN implementation raising the bars for the CIFAR10 (19.51% error rate), NORB (2.53% error rate) and MNIST (0.35% error rate) datasets. The training was performed on GPUs (Nvidia GTX 480 and GTX 580) where the authors managed to decrease the training time severely -up to 60× compared to sequential execution on a CPU -and decrease the error rates to an, at the time, state-of-the-art accuracy level.
Later, Cireşan et al. [11] presented their multi-column deep neural network for classification of traffic sings. The results show that the model performed almost human-like (humans'error rate about 0.20%) on the MNIST dataset, achieving a best error rate of 0.23%. The authors trained the network on a GPU.
Vrtanoski et al. [53] use OpenCL for parallelization of the back-propagation algorithm for pattern recognition. They showed a significant cost reduction, a maximum speedup of 25.8× was achieved on an ATI 5870 GPU compared to a Xeon W3530 CPU when training the model on the MNIST dataset.
The ImageNet challenge aims to evaluate algorithms for large-scale object detection and image classification based on the ImageNet dataset. Krizhevsky et al. [26] joined the challenge and reduced the error rate of the test set to 15.3% from the second best 26.2% using a CNN with 5 convolutional layers. For the experiments, two GPUs (Nvidia GTX 580) were used only communicating in certain layers. The training lasted for 5 to 6 days.
In a later challenge, ILSVRC 2014, a team from Google entered the competition with GoogleNet, a 22-layer deep CNN and won the classification challenge with a 6.67% error rate. The training was carried out on CPUs. The authors state that the network could be trained on GPUs within a week, illuminating the limited amount of memory to be one of the major concerns [48].
Yadan et al. [57] used multiple GPUs to train CNNs on the ImageNet dataset using both data-and model-parallelism, i.e. either the input space is divided into mini-batches where each GPU train its own batch (data paral-lelism) or the GPUs train one sample together (model parallelism). There is no direct comparison with the training time on CPU, however, using 4 GPUs (Nvidia Titan) and model-and data-parallelism, the network was trained for 4.8 days.
Song et al. [46] constructed a CNN to recognize face expressions and developed a smart-phone app in which the user can capture a picture and send it to a server hosting the network. The network, predicts a face expression and sends the result back to the user. With the help of GPUs (Nvidia Titan), the network was trained in a couple of hours on the ImageNet dataset.
Scherer et al. [42] accelerated the large-scale neural networks with parallel GPUs. Experiments with the NORB dataset on an Nvidia GTX 285 GPU showed a maximal speedup of 115× compared to a CPU implementation (Core i7 940). After training the network for 360 epochs, an error rate of 8.6% was achieved.
Cireşan et al. [10] combined multiple CNNs to classify German traffic signs and achieved a 99.15% recognition rate (0.85 % error rate). The training was performed using an Intel Core i7 and 4 GPUs (2 x GTX 480 and 2 x GTX 580).
More recently Abadi et al. [1] presented TensorFlow, a system for expressing and executing machine learning algorithms including training deep neural network models.
Researchers have also found CNNs successful for speech tasks. Large vocabulary continuous speech recognition deals with translation of continuous speech for languages with large vocabularies. Sainath et al. [41] investigated the advantages of CNNs performing speech recognition tasks and compared the results with previous DNN approaches. Results indicated on a 12-14% relative improvement of word error rates compared to a DNN trained on GPUs.
Chellapilla et al. [7] investigated GPUs (Nvidia Geforce 7800 Ultra) for document processing on the MNIST dataset and achieved a 4.11× speed up compared to the sequential execution a Intel Pentium 4 CPU running at 2.5 GHz clock frequency.
In contrast to CHAOS, these studies target training of CNNs using GPUs, whereas our approach addresses training of CNNs on the MNIST dataset using the Intel Xeon Phi co-processor. While there are several review papers (such as, [4,45,49]) and on-line articles (such as, [35]) that compare existing frameworks for parallelization of training CNN architectures, we focus on detailed analysis of our proposed parallelization approach using measurement techniques and performance modeling. We compare the performance improvement achieved with CHAOS parallelization scheme to the sequential version executed on Intel Xeon Phi, Intel Xeon E5 and Intel Core i5 processor.

Background
In this section, we first provide some background information related to the neural networks focusing on convolutional neural networks, and thereafter we provide some information about the architecture of the Intel Xeon Phi.

Neural Networks
A Convolutional Neural Network is a variant of a Deep Neural Network, which introduces two additional layer types: convolutional layers and pooling layers. The mammal visual processing system is hierarchical (deep) in nature. Higher level features are abstractions of lower level ones. E.g. to understand speech, waveforms are translated through several layers until reaching a linguistic level. A similar analogy can be drawn for images, where edges and corners are lower level abstractions translated into more spatial patterns on higher levels. Moreover, it is also known that the animal cortex consists of both simple and complex cells firing on certain visual inputs in their receptive fields. Simple cells detect edge-like patterns whereas complex cells are locally invariant, spanning larger receptive fields. These are the very fundamental properties of the animal brain inspiring DNNs and CNNs.
In this section, we first describe the DNNs and the Forward-and Backpropagation, thereafter we introduce the CNNs.

Deep Neural Networks
The architecture of a DNN consists of multiple layers of neurons. Neurons are connected to each other through edges (weights). The network can simply be thought of as a weighted graph; a directed acyclic graph represents a feedforward network. The depth and breadth of the network differs as may the layer types. Regardless of the depth, a network has at least one input and one output layer. A neuron has a set of incoming weights, which have corresponding outgoing edges attached to neurons in the previous layer. Also, a bias term is used at each layer as an intercept term. The goal of the learning process is to adjust the network weights and find a global minimum by reducing the overall error, i.e. the deviation between the predicted and the desired outcome of all the samples. The resulting weight parameters can thereafter be used to make predictions of unseen inputs [3].

Forward Propagation
DNNs can make predictions by forward propagating an input through the network. Forward propagation proceeds by performing calculations at each layer until reaching the output layer, which contains a vector representing the prediction. For example, in image classification problems, the output layer contains the prediction score that indicates the likelihood that an image belongs to a category [18,3].
The forward propagation starts from a given input layer, then at each layer the activation for a neuron is activated using the equation y l i = σ(x l i ) + I l i where y l i is the output value of neuron i at layer l, x l i is the input value of the same neuron, and σ (sigmoid) is the activation function. I l i is used for the input layer when there is no previous layer. The goal of the activation function is to return a normalized value (sigmoid return [0,1] and tanh is used in cases where the desired return values are [-1,1]). The input x l i can be calculated as x l i = j (w l ji y l−1 j ) where w l ji denotes the weight between neuron i in the current layer l, and j in the previous layer, and y l−1 j the output of the jth neuron at the previous layer. This process is repeated until reaching the output layer. At the output layer, it is common to apply a soft max function, or similar, to squash the output vector and hence derive the prediction.

Back-Propagation
Back-propagation is the process of propagating errors, i.e. the loss calculated as the deviation between the predicted and the desired output, backward in the network, by adjusting the weights at each layer. The error and partial derivatives δ l i are calculated at the output layer based on the predicted values from forward propagation and the labeled value (the correct value). At each layer, the relative error of each neuron is calculated and the weight parameters are updated based on how much the neuron participated in the faulty prediction. The equation: denotes that the partial derivative of neuron i at the current layer l is the sum of the derivatives of connected neurons at the next layer multiplied with the weights, assuming w l denotes the weights between the maps. Additionally, a decay is commonly used to control the impact of the updates, which is omitted in the above calculations. More concretely, the algorithm can be thought of as updating the layer's weights based on "how much it was responsible for the errors in the output" [18,3].

Convolutional Neural Networks
A Convolutional Neural Network is a multi-layer model constructed to learn various levels of representations where higher level representations are described based on the lower level ones [43]. It is a variant of deep neural network that introduces two new layer types: convolutional and pooling layers.
The convolutional layer consists of several feature maps where neurons in each map connect to a grid of neurons in maps in the previous layer through overlapping kernels. The kernels are tiled to cover the whole input space. The approach is inspired by the receptive fields of the mammal visual cortex. All neurons of a map extract the same features from a map in the previous layer as they share the same set of weights.
Pooling layers intervene convolutional layers and have shown to lead to faster convergence. Each neuron in a pooling layer outputs the (maximum/average) value of a partition of neurons in the previous layer, and hence only activates if the underlying grid contains the sought feature. Besides from lowering the computational load, it also enables position invariance and down samples the input by a factor relative to the kernel size [28]. Figure 1 shows LeNet-5 that is an example of a Convolutional Neural Network. Each layer of convolution and pooling (that is a specific method of sub-sampling used in LeNet) comprise several feature maps. Neurons in the feature map cover different sub-fields of the neurons from the previous layer. All neurons in a map share the same weight parameters, therefore they extract the same features from different parts of the input from the previous layers.
CNNs are commonly constructed similarly to the LeNet-5, beginning with an input layer, followed by several convolutional/pooling combinations, ending with a fully connected layer and an output layer [28]. Recent networks are much deeper and/or wider, for instance, the GoogleNet [48] consists of 22 layers.
Various implementations target the Convolutional Neural Networks, such as: EbLearn at New York University and Caffe at Berkeley. As a basis for our work we selected a project developed by Cireşan [9]. This implementation targets the MNIST dataset of handwritten digits, and has the possibility to dynamically configure the definition of layers, the activation function and the connection types using a configuration file. Figure 2 depicts an overview of the Intel Xeon Phi (codenamed Knights Corner) architecture. It is a many-core shared-memory co-processor, which runs a lightweight Linux operating system that offers the possibility to communicate with it over ssh. The Xeon Phi offers two programming models:

Parallel Systems accelerated with Intel®Xeon Phi™
1. offload -parts of the applications running on the host are offloaded to the co-processor 2. native -the code is compiled specifically for running natively on the coprocessor. The code and all the required libraries should be transferred on the device. In this paper, we focus on the native mode.
The Intel Xeon Phi (type 7120P used in this paper) comprises 61 x86 cores, each core runs at 1.2 GHz base frequency, and up to 1.3GHz on max turbo frequency [8]. Each core can switch between four hardware threads in a  round-robin manner, which amounts to a total of 244 threads per co-processor. Theoretically, the co-processor can deliver up to one teraFLOP/s of double precision performance, or two teraFLOP/s of single precision performance. Each core has its own L1 (32KB) and L2 (512KB) cache. The L2 cache is kept fully coherent by a global distributed tag-directory (TD). The cores are connected through a bidirectional ring bus interconnect, which forms a unified shared L2 cache of 30.5MB. In addition to the cores, there are 16 memory channels that in theory offer a maximum memory bandwidth of 352GB/s. The GDDR memory controllers provide direct interface to the GDDR5 memory, and the PCIe Client Logic provides direct interface to the PCIe bus. Efficient usage of the available vector processing units of the Intel Xeon Phi is essential to fully utilize the performance of the co-processor [52]. Through the 512-bit wide SIMD registers it can perform 16 (16 wide × 32 bit) singleprecision or 8 (8 wide × 64 bit) double-precision operations per cycle.
The performance capabilities of the Intel Xeon Phi are discussed and investigated empirically by different researches within several domain applications [16,32,34,51,31,33].

Our Parallelization Scheme for Training Convolutional Neural Networks on Intel Xeon Phi
The parallelism can be either divided data-wise, i.e. threads process several inputs concurrently, or model-wise, i.e. several threads share the computational burden of one input. Whether one approach can be advantageous over the other mainly depends on the synchronization overhead of the weight vectors and how well it scales with the number of processing units.
In this section, we first discuss the design aspects of our parallelization scheme for training convolutional neural networks. Thereafter, we discuss the implementation aspects that allow full utilization of the Intel Xeon Phi coprocessor.

Design Aspects
On-line stochastic gradient descent has the advantage of instant updates of weights for each sample. However, the sequential nature of the algorithm yields impediments as the number of multi-and many-core platforms are emerging. We consider different existing parallelization strategies for stochastic gradient descent: Strategy A: Hybriduses both data-and model parallelism, such that data parallelism is applied in convolutional layers, and the model parallelism is applied in fully connected layers [24].
Strategy B: Averaged Stochastic Gradientdivides the input into batches and feeds each batch to a node. This strategy proceeds as follows: (1) Initialize the weights of the learner by randomization; (2) Split the training data into n equal chunks and send them to the learners; (3) each learner process the data and calculates the weight gradients for its batch; (4) send the calculated gradients back to the master; (5) the master computes and updates the new weights; and (6) the master sends the new weights to the nodes and a new iteration begins [13]. The convergence speed is slightly worse than for the sequential approach, however the training time is heavily reduced.
Strategy C: Delayed Stochastic Gradientsuggests updating the weight parameters in a round-robin fashion by the workers. One solution is splitting the samples by the number of threads, and let each thread work on its own distinct chunk of samples, only sharing a common weight vector. Threads are only allowed to update the weight vector in a round-robin fashion, and hence each update will be delayed [60].
Strategy D: HogWild!is a stochastic gradient descent without locks. The approach is applicable for sparse optimization problems (threads/core updates do not conflict much) [40].
In this paper, we introduce Controlled Hogwild with Arbitrary Order of Synchronization (CHAOS), a parallelization scheme that can exploit both thread-and SIMD-level parallelism available on Intel Xeon Phi. CHAOS is a data-parallel controlled version of HogWild! with delayed updates, which combines parts of strategies A-D. The key aspects of CHAOS are: -Thread parallelism -The overview of our parallelization scheme is depicted in Figure 3. Initially for as many threads as there are available network instances are created, which share weight parameters, whereas to support concurrent processing of images some variables are private to each thread. After the initialization of CNNs and images is done, the process of training starts. The major steps of an epoch include: Training, Validation and Testing. The first step, Training, proceeds with each worker picking an image, forward propagates it through the network, calculates the error, and backpropagates the partial derivatives, adjusting the weight parameters. Since each worker picks a new image from the set, other workers do not have to wait for significantly slow workers. After Training, each worker participates in Validation and Testing evaluating the prediction accuracy of the network by predicting images in the validation and test set accordingly. Adoption of data parallelism was inspired by Krizhevsky [24], promoting data parallelism for convolutional layers as they are computationally intensive. -Controlled HogWild -during the back-propagation the shared weights are updated after each layer's computations (a technique inspired by [60]), whereas the local weight parameters are updated instantly (a technique inspired by [40]), which means that the gradients are calculated locally first then shared with other workers. However, the update to the global gradients can be performed at any time, which means that there is no need to wait for other workers to finish their updates. This technique, which we refer to as non-instant updates of weight parameters without significant delay, allows us to avoid unnecessary cache line invalidation and memory writes. -Arbitrary Order of Synchronization -There is no need for explicit synchronization, because all workers share weight parameter. However, an implicit synchronization is performed in an arbitrary order because writes are controlled by a first-come-first schedule and reads are performed on demand.
The main goal of CHAOS is to minimize the time spent in the convolutional layers, which can be done through data parallelism, adapting the knowledge presented in strategy A. In strategy B, the synchronization is performed because of averaging worker's gradient calculations. Since work is distributed, computations are performed on stale parameters. The strategy can be applied in distributed and non-distributed settings. The division of work over several distributed workers was adapted in CHAOS. In strategy C, the updates are postponed using a round-robin-fashion where each thread gets to update when it is its turn. The difference compared to strategy B is that instances train on the same set of weights and no averaging is performed. The advantage is that all instances train on the same weights. The disadvantage of this approach is the delayed updates of the weight parameters as they are performed on stale data. Training on shared weights and delaying the updates are adopted in CHAOS. Strategy D presents a lock-free approach of updating the weight parameters, updates are performed instantly without any locks. Our updates are not instant, however, after computing the gradients there is nothing prohibiting a worker contributing to the shared weights, the notion of instant inspired CHAOS.

Implementation Aspects
The main goal is to utilize the many cores of the Intel Xeon Phi co-processor efficiently to lower the training time (execution time) of the selected CNN algorithm, at the same time maintaining low deviation in error rates, especially on the test set. Moreover, the quality of the implementation is verified using errors and error rates on the validation and test set.
In the sequential version, only minor modifications of the original version were performed. Mainly, we added a Reporter class to serialize execution results. The instrumentation should not add any time penalties in practice. However, if these penalties occur in the sequential version they are likely to imply corresponding penalties in the parallel version, therefore it should not impact the results.
The main goal of the parallel version is to lower the execution time of the sequential implementation and to scale well with the number of processing units on the co-processor. To facilitate this, it is essential to fully consider the characteristics of the underlying hardware. From results derived in the sequential execution we found the hotspots of the application to be predominantly the convolutional layers. The time spent in both forward-and back-propagation is about 94% of the total time of all layers (up to 99% for the larger network), which is depicted in the Table 1.
In our proposed strategy, a set of N network instances are created and assigned to T threads. We assume T == N , i.e. one thread per network instance. T threads are spawned, each responsible for its own instance.
The overview of the algorithm is shown in Fig. 3. In Fig. 4 the training, testing and back-propagation phase are shown in details. Training (see Fig. 4a) picks an image, forward propagates it, determines the loss and backpropagates the partial derivatives (deltas) in the network -this process is done simultaneously by all workers, each worker processing one image. Each worker participating in testing (see Fig. 4b), picks an image, forward propagates it and then collects errors and error rates. The results are cumulated for all threads. Perhaps the most interesting part is the back-propagation (see Fig. 4c). The shared weights are used when propagating the deltas, however, before updating the weight gradients, the pointers are set to the local weights. Thereafter the algorithm proceeds by updating the local weights first. When a worker has contributions to the global weights it can update in a controlled manner, avoiding data races. Updates immediately affect other workers in their training process. Hence the update is delayed slightly, to decrease the invalidation of cache lines, yet almost instant and workers do not have to wait for a longer period before contributing with their knowledge.
To see why delays are important, consider the following scenario: If training several network instances concurrently, they share the same weight vectors, other variables are thread private. The major consideration lies in the weight updates. Let W j l be the j-th weight on the l-th layer. In accordance with the current implementation, a weight is updated several times since neurons in a map (on the same layer) share the same weights, and the kernel is shifted over the neurons. Further assume that several threads work on the same weight W j l at some point in time. Even if other threads only read the weights, their local data, as saved in the Level 2 cache, will be invalidated and a re-fetch is required to assert their integrity. This happens because cache lines are shared between cores. The approach of slightly delaying the updates and forcing one thread to update in atomicity leads to fewer invalidations. Still a major disadvantages  is that the shared weights does not infer any data locality (data cannot retain completely in Level 2 cache for a longer period).
Listing 1: An extract from the vectorization report for the partial derivative updates in the convolutional layer. To further decrease the time spent in convolutional layers, loops were vectorized to facilitate the vector processing unit of the co-processor. Data was allocated using mm malloc() with 64 byte alignment increasing the accuracy of memory requests. The vectorization was achieved by adding #pragma omp simd instructions and explicitly informing the compiler of the memory alignment using assume aligned(). Some unnecessary overhead is added through the lack of data alignment of the deltas and weights. The computations of partial derivatives and weight gradients in the convolutional layers are performed in a SIMD way, which allows efficient utiliziation of the 512 bit wide vector processing units of the Intel Xeon Phi. An extract from the vectorization report (see Listing 1), for the updates of partial derivatives in the convolutional layer shows an estimated potential speed up of 3.98× compared to the scalar loop.
Further algorithmic optimizations were performed. For example: (1) The images are loaded into a pre-allocated memory instead of allocating new memory when requesting an image; (2) Hardware pre-fetching was applied to mitigate the shortcomings of the in-order-execution scheme. Pre-fetching loads data to L2 cache to make it available for future computations; (3) Letting workers pick images instead of assigning images to workers, allow for a smaller overhead at the end of a work-sharing construct; (4) The number of locks are minimized as far as possible; (5) We made most of the variables thread private to achieve data locality.
The training phase was distributed through thread parallelism, dividing the input space over available workers. CHAOS uses the vector processing units to improve performance and tries to retain local variables in local cache as far as possible. The delayed updates decrease the invalidation of cache lines. Since weight parameters are shared among threads, there is a possibility that data can be fetched from another core's cache instead of main memory, reducing the wait times. Also, the memory was aligned to 64 bytes and unnecessary system calls were removed from the parallel work.

Evaluation
In this section, we first describe the experimentation environment used for evaluation of our CHAOS parallelization scheme. Thereafter, we describe the development of a performance model for CHAOS. Finally we discuss the obtained results with respect to scalability, speedup, and prediction accuracy.

Experimental Setup
In this study, OpenMP was selected to facilitate the utilization of threadand SIMD-parallelism available in the Intel Xeon Phi co-processor. C++ programming language is used for algorithm implementation. The Intel Compiler 15.0.0 was used for native compilation of the application for the co-processor, whereas the O3 level was used for optimization.
System Configuration -To evaluate our approach we use an Intel Xeon Phi accelerator that comprises 61 cores that run at 1.2 GHz. For evaluation 1, 15, 30, 60, 120, 180, 240, and 244 threads of the co-processor were used. Each thread was responsible for one network instance. For comparison, we use two general purpose CPUs, including the Intel Xeon E5-s695v2 that runs at 2.4 GHz clock frequency, and the Intel Core i5 661 that runs at 3.33GHz clock frequency.
Data Set -To evaluate our approach, the MNIST [29] dataset of handwritten digits is used. In total the MNIST dataset comprises 70000 images, 60000 of which are used for training/validation and the rest for testing.
CNN Architectures -Three different CNN architectures were used for evaluation, small, medium and large. The small and medium architecture were trained for 70 epochs, and the large one for 15 epochs, using a starting decay (eta) of 0.001 and factor of 0.9. The small and medium network consist of seven layers in total (one input layer, two convolutional layers, two maxpoling layers, one fully connected layer and the output layer). The difference between these two networks is in the number of feature maps per layer and the number of neurons per map. For example, the first convolutional layer of the small network has five feature maps and 3380 neurons, whereas the first convolutional layer of the medium network has 20 feature maps and 13520 neurons. The large network differs from the small and the medium network in the number of layers as well. In total, there are nine layers, one input layer, three convolutional layers, three max-pooling layers, one fully connected layer and the output layer. Detailed information (including the number and the size of feature maps, neurons, the size of the kernels and the weights) about the considered architectures is listed in Table 2.
To address the variability in performance measurements we have repeated the execution of each parallel configuration for three times.

Performance Model
A performance model [39] enables us to reason about the behavior of an implementation in future execution contexts. Our performance model for CHAOS implementation can predict the performance for numbers of threads that go beyond the number of hardware threads supported in the Intel Xeon Phi model that we used for evaluation. Additionally, it can predict the performance of different CNN architectures with various number of images and epochs.
The goal is to construct a parametrized model with the following parameters ep, i, it and p, where ep stands for the number of epochs, i indicates the Listing 2: The formula for our performance prediction model. number of images in the training/validation set, it stands for the number of images in the test set, and p is the number of processing units. Table 3 lists the full set of variables used in our performance model, some of which are hardware dependent and some others are independent of the underlying hardware. Each variable is either measured, calculated, constant, or parameter in the model. Listing 2 shows the formula used for our performance prediction model. The total execution time (T ) is the sum of computations time (T comp ) and memory operations (T mem ). T depends on several factors including: speed, number of processing units, communication costs (such as network latency), and memory contention. The T comp is sum of sequential work, training, validation, and testing. Most interesting is contentions causing wait times, including memory latencies and synchronization overhead. T mem adds memory and synchronization overheads. The contention is measured through an experimental approach by executing a small script on the co-processor for different thread counts, weights and layers.
We define T mem (ep, i, p) = M emoryContention * ep * i p where M emoryContention is the measured memory contention when p threads are fighting for the I/O weights concurrently. Table 4 depicts the measured and predicted memory contentions for the Intel Xeon Phi. Our performance prediction model is not concerned with any practical measurements except for T mem . Along with the CPI and OperationFactor it is possible to derive the number of instructions (theoretically) per cycle that each thread can perform.
We use P rep to be different for each CNN architecture (10 9 , 10 10 and 10 11 for small, medium and large architecture respectively). The OperationF actor is adjusted to closely match the measured value for 15 threads, and mitigate  When one hardware thread is present per core, one instruction per cycle can be assumed. For 4 threads per core, only 0.5 instructions per cycle can be assumed, which means that each thread gets to execute two instructions every fourth cycle (CP I of 2) and hence we use the CP I factor to control the best theoretical amount of instructions a thread can retire. The speed s is defined in Table 3. F P rop and BP rop are placeholders for the actual number of operations.

Results
In this section, we analyze the collected data with regards to the execution time and speedup for varying number of threads and CNN architectures. The errors and error rates (incorrect predictions) are used to validate our implementation. Furthermore, we discuss the deviation in number of incorrectly predicted images. The execution time is the total time the algorithm executes, excluding the time required to initialize the network instances and images (for both the sequential and parallel version). The speed up is measured as the relativeness between two execution times, with the sequential execution times of Intel Xeon E5, Intel Core i5, and Xeon Phi as the base. The error rate is the fraction of images the network was unable to predict and the error the cumulated loss from the loss function.
In the figures and tables in this section, we use the following notations: Par refers to the parallel version, Seq is the sequential version, and T denotes threads, e.g. Phi Par. 1 T is the parallel version and one thread on the Xeon Phi.
Result 1: The CHAOS parallelization scheme scales gracefully to large numbers of threads. Figure 5 depicts the total execution time of the parallel version of the implementation running on the Xeon Phi and the sequential version running on the Xeon E5 CPU. We vary the number of threads on the Xeon Phi between 1, 15, 30, 60, 120, 180, 240, and 244, and the CNN architectures between small, medium and large. We elide the results of Xeon E5 Seq. and Phi Par. 1T for simplicity and clarity. The large CNN architecture requires 31.1 hours to be completed sequentially on the Xeon E5 CPU, whereas using one thread on the Xeon Phi requires 295.5 hours. By increasing the number of threads to 15, 30, and 60, the execution time decreases to 19.7, 9.9, and 5.0 hours respectively. Using the total number of threads (that is 244) on the Xeon Phi the training may be completed in only 2.9 hours. We may observe a promising scalability while increasing the number of threads. Similar results may be observed for the small and medium architecture. It should be considered that the selected CNN architectures were trained for different number of epochs, and that larger networks tend to produce better predictions (lower error rates). A fairer comparison would be to compare the execution times until reaching a specific error rate on the test set. In Fig. 6 the total execution times for the different CNN architectures and threads on the Xeon Phi is shown. We have set the stop criteria as the error rate ≤ 1.54%, which is the ending error rate of the test set for the small architecture. The large network executes for a longer period even if it converges in fewer epochs, and that the medium network needs less time to reach an equal (or better) ending error rate than the small and large network. Note that several other factors impact training, including the starting decay, the factor which the decay is decreased, dataset, loss function, preparation of images, initial weight values. Therefore, several combinations of parameters need to be tested before finding a balance. In this study, we focus on the number of epochs as the stop criteria and draw conclusions from this, considering the deviation of the error and error rates.
Result 2: The total execution time is strongly influenced by the forwardpropagation and back-propagation in the network. The convolutional layers are the most computationally expensive. Table 5     We have observed that the more threads involved in training the more percentage of the total time each thread spends in the back-propagation of the convolutional layer, and less time in the others. Overall, the time spent at each layer is decreased per thread when increasing the number of threads. Therefore, there is an interesting relationship between the layer times and the speed up of the algorithm.  Table 6 presents the speed up relative to the Phi Par. 1 T for the different architectures on the convolutional layer. The times are collected by each network instance (through instrumentation of the forward-and back-propagate function) and averaged over the number of network instances and epochs. As can be seen, in almost all cases there is an increase in speed up when increasing the network size, more importantly, the speed up does not decrease. Maybe the most interesting phenomena is that the speed up per layer have an almost direct relationship to the speed up of the algorithm, especially if compared to the back-propagation part. This emphasizes the importance of reducing the time spent in the convolutional layers.
Result 3: Using CHAOS parallel implementation for training of CNNs on Intel Xeon Phi we achieved speedups of up to 103×, 14×, and 58× compared to the single-thread performance on Intel Xeon Phi, Intel Xeon E5 CPU, and Intel Core i5, respectively.  Figure 7 depicts the speedup compared to the sequential execution on Xeon E5 (Xeon E5 Seq.) for various number of threads and CNN architectures. As can be seen, adding more threads results in speedup increase in all cases. Using 240 threads on the Xeon Phi infer a 13.26× speedup for the small CNN architecture. Utilizing the last core of the Xeon Phi, which is used by the OS, shows even higher speedup (14.07×). We may observe that doubling the number of threads from 15, to 30, and from 30 to 60 almost doubles the speedup (2.03, 4.03, and 7.78). Increasing the number of threads further results with significant speedup, but the double speedup trend breaks. Figure 8 shows the speedup compared to the execution running in one thread of the Xeon Phi (Phi Par. 1 T ) while varying the number of threads and the CNN architectures. We may observe that the speedup is close to linear for up to 60 threads for all CNN architectures. Increasing the number of threads further results with significant speedup. Moreover it can be seen that when keeping the number of threads fixed and increasing the architecture size, the speed up increases with a small factor as well, except for 244 threads. It seems like larger architectures are beneficial. However, it could also be the   case that Phi Par. 1 T executes relatively slower than Xeon E5 Seq. for larger architectures than for smaller ones. Figure 9 shows the speedup compared to the sequential version executed in Intel Core i5 (Core i5 Seq.) while varying the number of threads and the CNN architectures. We may observe that using 15 threads we gain 10× speedup. Doubling the number of threads to 30, and then to 60 results with close to double speedup increase (19.8 and 38.3). By using 120 threads (that is two threads per core) the trend of double speedup increase breaks (55.6×). Increasing the number of threads per core to three and four results with modest speedup increase (62× and 65.3×).    Fig. 9: Speedup of the three CNN architectures by varying the number of threads compared to one thread on Intel Core i5.

Result 4:
The image classification accuracy of parallel implementation using CHAOS is comparable to the one running sequentially. The deviation error and the number of incorrectly predicted images is not abundant.
We validate the implementation by comparing the error and error rates for each epoch and configuration. Figure 10 depicts the ending errors for the three considered CNN architectures for both validation and test set. The black dashed line delineates the base line (that is a ratio of 1). Values below the line are considered better, whereas those above the line are worse than for Xeon E5 Seq. As a base line, we use the Xeon E5, however identical results are derived executing the sequential version on any platform. As can be seen in Fig. 10, the largest difference is encountered by Phi Par. 244 T, about 22 units (0.05%) worse than the base line. On the contrary, Phi Par. 15 T has 9 units'lower error compared to the base line for the large test set. The validation sets are rather stable whereas the test set fluctuates more heavily. Although one should consider the deviation in error respectfully, they are not abundant in this case. Please note that the diagram has a high zoom factor, hence the differences are magnified. Table 7 lists the number of incorrectly classified images for each CNN architecture. For each architecture, the total (Tot) number of images and the difference (Diff ) compared to the optimal numbers of Xeon E5 Seq. are shown. Negative values indicate that the ending error rate was better than optimal (less images were incorrectly predicted), whereas positive values indicate that more images than Xeon E5 Seq. were incorrectly predicted. For each column in the table, best and worst values are annotated with underline and bold fonts, respectively. No obvious pattern can be found, however, increasing the number of threads does not lead to worse prediction in general. Phi Par. 180 T stands out as it was 17 images better than Xeon E5 Seq. for small architecture on validation set. Phi Par. 15 T also performs worst on the small architecture on the validation set. The overall worst performance is achieved by Phi par.  Fig. 10: The relative cumulative error (loss) for the three considered CNN architectures (small, medium, and large) for both validation and test set. 120 T on the test set for small CNN architecture. Please note that the total number of images in the validation set is 60, 000 and 10, 000 for the test set.
Overall, the number of incorrectly predicted images and the deviation from the base line is not abundant.

Result 5:
The predicted execution times obtained from the performance model match well the measured execution times. Figures 11,12, and 13 depict the predicted and measured execution times for small, medium and large CNN architecture. For the small network (see Fig.  11), the predictions are close to the measured values with a slight deviation at the end. The prediction model seems to over-estimate the execution time with a small factor.
For the medium architecture (see Fig. 12) the prediction follow the measured values closely, although it underestimates the execution time slightly. At 120 threads, the measured and predicted values starts to deviate, which are recovered at 240 threads. The large architecture yields similar results as the medium. As can be seen, the measured values are slightly higher than the predictions, however, the predictions follow the measured values. As can be seen for 120 threads there is a deviation which is recovered for 240 threads. Also, the predictions increase between 120 and 180, and 180 and 240 threads for both predictions, whereas the actual execution time is lowered. This is most probably due to the CPI factor that is added when 3 or more threads are present on the same core.  We use the expression x = |m − p| p to calculate the deviation in predictions for our prediction model and all considered architectures, where m is the measured and p is the predicted value. The average deviations over all measured thread counts are as follows: 14.57% for the small CNN, 14.76% for medium, and 15.36% for large CNN.
Result 6: Prediction of execution time for number of threads that go beyond the 240 hardware threads of the model of Intel Xeon Phi used in this paper show that CHAOS scales well up to several thousands of threads.
We used the prediction model to predict the execution times for 480, 960, 1920, and 3840 threads for different CNN architectures, using the same parameters. The results in Table 8 show that if 3,840 threads were available, the small network should take about 4.6 minutes to train, the medium 14.5 minutes and the large 36.8 minutes. The predictions for the large CNN architecture are not as well aligned when increasing to larger thread counts as for small and medium.
Additionally, we evaluated the execution time for varying image counts, and epochs, for 240 and 480 threads for the small CNN architecture. As can be seen in Table 9 doubling the number of images or epochs, approximately

Summary and Future Work
Deep learning is important for many modern applications, such as, voice recognition, face recognition, autonomous cars, precision medicine, or computer vision. We have presented CHAOS that is a parallelization scheme to speed up the training process of Convolutional Neural Networks. CHAOS can exploit both thread-and SIMD-parallelism of Intel Xeon Phi co-processor. Moreover, we have described our performance prediction model, which we use to evaluate our parallelization solution and infer the performance on future architectures of the Intel Xeon Phi. Major observations include, -CHAOS parallel implementation scales well with the increase of the number of threads; convolutional layers are the most computationally expensive part of the CNN training effort; for instance, for 240 threads, 88% of the time is spent on the back-propagation of convolutional layers; using CHAOS for training CNNs on Intel Xeon Phi we achieved up to 103×, 14×, and 58× speedup compared to the single-thread performance on Intel Xeon Phi, Intel Xeon E5 CPU, and Intel Core i5, respectively; image classification accuracy of CHAOS parallel implementation is comparable to the one running sequentially; predicted execution times values obtained from our performance model match well the measured execution times; results of the performance model indicate that CHAOS scales well beyond the 240 hardware threads of the Intel Xeon Phi that is used in this paper for experimentation.
Future work will extend CHAOS to enable the use of all cores of host CPUs and the co-processor(s).