Automated CNN back-propagation pipeline generation for FPGA online training

Training of convolutional neural networks (CNNs) on embedded platforms to support on-device learning has become essential for the future deployment of CNNs on autonomous systems. In this work, we present an automated CNN training pipeline compilation tool for Xilinx FPGAs. We automatically generate multiple hardware designs from high-level CNN descriptions using a multi-objective optimization algorithm that explores the design space by exploiting CNN parallelism. These designs that trade-off resources for throughput allow users to tailor implementations to their hardware and applications. The training pipeline is generated based on the backpropagation (BP) equations of convolution which highlight an overlap in computation. We translate the overlap into hardware by reusing most of the forward pass (FP) pipeline reducing the resources overhead. The implementation uses a streaming interface that lends itself well to data streams and live feeds instead of static data reads from memory. Meaning, we do not use the standard array of processing elements (PEs) approach, which is efficient for offline inference, instead we translate the architecture into a pipeline where data is streamed through allowing for new samples to be read as they become available. We validate the results using the Zynq-7100 on three datasets and varying size architectures against CPU and GPU implementations. GPUs consistently outperform FPGAs in training times in batch processing scenarios, but in data stream scenarios, FPGA designs achieve a significant speedup compared to GPU and CPU when enough resources are dedicated to the learning task. A 2.8×, 5.8×, and 3× speed up over GPU was achieved on three architectures trained on MNIST, SVHN, and CIFAR-10 respectively.


Introduction
Recent literature shows a clear demand for embedded deep learning solutions for hardware-constrained designs and novel compression techniques.CPUs and GPUs have been prominent for executing CNNs on offline training settings however, their energy efficiency and low throughput have made them less attractive for embedded use.For their power-efficient performance and highly parallelised flexible architecture, FPGAs presented themselves as a viable option for hard real-time computation of heavy, deep learning applications [1].Also, FPGAs allow designers to develop modular IP cores allowing for easier prototyping with the option to selectively deploy design areas at runtime without risk to the overall application.This is crucial given the rapid changes in CNN architectures.Until recently, most works only investigated the hardware implementation of forward pass CNNs as inference engines and accelerators, there is plenty of research done to map the CNN forward pass unto FPGAs for embedded inference [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18], in contrast, there is a clear lack of work in the areas of online deployment and training on FPGAs.But with the recent breakthroughs in the new field of Continuous Learning [19][20][21][22][23], online training on embedded platforms has attracted more research.In [20] the authors test continuous learning scenarios on benchmark datasets, they investigate the required parameters and conditions for continuous learning to be effective, their main findings show that it is possible and effective if the models account for the catastrophic forgetting problem by retaining previously learned information.In [23], the authors proposed an adaptive hierarchical network structure composed of CNNs that can grow to accommodate new classes and tasks when new data becomes available.Similarly, in [21], Yoon et al. propose an expandable architecture capable of selective retraining when presented with new tasks and data.The continuous learning research community is mainly interested in scenarios [24] where datasets are unavailable in the form of neatly organized fully tagged datasets, but in an evolving application where new samples are made available at different times after the models are trained on the main dataset.An example of this would be a self-driving car trained on an object detection dataset receiving new data samples during deployment.Space agents such as Satellites and Rovers [25,26] can benefit from such capabilities as well, changes to the mission parameters or visual environment can necessitate online learning to remotely account for these unexpected changes.Such new online data could be vital for the system's real-world performance as opposed to the simulated offline performance benchmarked on an organized dataset.For these reasons, training on FPGA platforms becomes even more appealing.
In this work, we identify the significant computational overlap between the inference task and training task in CNNs and translate it into a new FPGA pipeline.This allows for the design of CNN hardware models capable of both inference and online training whilst minimizing the resource overhead.Despite the overlap, the training process of CNNs is still computationally complicated and requires significant changes to the data path of the hardware design, managing the data from the forward pass, updating parameters, and calculating gradients.Compared to inference, the training phase involves a much higher number of operations (> 3×) with increased mathematical complexity [27], and a backward pass can take twice the time of a forward pass [28].
The training phase also involves high intermediate datasets, necessitating high memory bandwidth, and large storage.We acknowledge that GPUs have been the de-facto for training tasks to meet immense computation requirements.However, compared to FPGAs, typical GPUs' energy efficiency is poor [1].Nvidia's Jetson family provides an alternative low power approach with a focus on embedded AI operating at 5-30 W.However, the Jetson GPUs were designed as low-power inference engines and not optimized for training purposes like our proposed combined real-time training and inference FPGA solution.

Related work
In both academic and professional fields, GPUs are still the platform of choice for CNN training, and companies such as Facebook and Google maintain datacentres with GPU clusters.Such clusters are employed for training on large amounts of data and incur high costs due to the GPUs' highpower consumption rates and the energy overhead needed for cooling, making them expensive to maintain operationally.Google sought a solution in researching the possibility of using ASIC and FPGAs within their servers so the workload of training and running deep learning models is offloaded.This resulted in the successful design and widescale deployment of the Tensor Processing Unit (TPU) [29] for training and inference.Other companies such as Microsoft [30,31] and Amazon's "AWS EC2 F1" instance followed suit in using FPGA clusters within their data centres and servers for back-end training and inference at a lower power cost-highlighting the trend for low-power solutions utilising FPGAs.CNN training on FPGA platforms has not been investigated thoroughly with only two exceptions that focus on batch training which uses FPGA platforms as replacements for GPUs clusters in offline training [28,32].In [27] Wenlai et al. presented F-CNN, the first CPU/FPGA hybrid design for deploying and training CNN networks.The CPU is used as a controller and the FPGA as an accelerator.
They employ an analytical model to generate a backpropagation model.Their implementation of CNN training involves a direct translation of backpropagation equations for error calculation and parameter updates.This requires the introduction of significant resource overheads since it does not fully consider the overlap in calculations within the forward pass.In [32] Venkataramanaiah et al. extends work from [28] and introduces a hardware CNN training RTL compiler.Their work is purely FPGA and relies on static processing element arrays for convolutional calculations.It uses preoptimized and precompiled Verilog CNN hardware modules, but unlike [27] has no analytical model to inform the compilation and provide design space exploration.Both works do not present extensive experimental results and only cover Ler-Net 5 [33] and custom architectures rather than modern CNNs as benchmarks.In addition, they deploy different data representation schemes, 32-bit and 16-bit, respectively, and use different development boards operating at different frequencies which makes it challenging to compare the implementations in terms of resource and latency overheads.
One area of research that remains unexplored is continuous learning on embedded platforms, mainly FPGAs.This requires on-board implementations of training algorithms for CNN models which are flexible and allow users to easily explore the design space for suitable designs based on the hardware and its applications.In our previous work [18] we proposed an FPGA compiler for CNN inference, the compiler generates FPGA pipelines from high-level CNN input architectures, it generates a Simulink hardware model which is validated then translated into RTL/VHDL.For this research, we expand on the same compiler to include an improved design space exploration step and a training pipeline with the aim of reusing most of the original forward pass pipeline resources.
The main contributions of this work are, • Providing deep learning engineers with an automated CNN compiler tool for FPGAs which supports inference and online training.• Using analytical equations to model the complied design's expected performance trade-space using a multiobjective optimization genetic algorithm, allowing the user to automate the design space exploration for tradeoffs in accuracy, latency, and resources.This paper will cover the fundamentals of CNN backpropagation and highlight the areas where they overlap with the forward pass can be exploited in Sect. 2. Then in Sect.3, the proposed research will be presented, this includes the hardware models of these functionalities with the analytical equations for latency and resources.The section also covers design space exploration using the multi-objective optimization algorithm.Section 4 will be dedicated to the experiments and results, first, MOGA results will be presented on three different datasets and four architectures.Finally, the section will cover the implementation of the models obtained from the previous sections on a Xilinx FPGA, a breakdown of latency and resources results will be given followed by discussion and comparison with related works.

Convolution
A convolutional layer and a convolutional network, as a whole, can be treated as a computational graph.Let us say we have a gate f in a computational graph with inputs X and Y which outputs Z.For a simple function F which takes X and Y as inputs and outputs Z, local gradients can be computed by differentiating Z with respect to X and Y as ∂z/∂x and ∂z/∂y.For a convolutional forward pass, the inputs process through the CNN layer, and at the output, the loss is obtained using a loss function which quantifies the error between the ground truth and the real output.When we process the loss backward, layer across layer, we obtain the gradient of the loss from the previous layer as ∂L/∂z.For the loss to be propagated to the other gates, we need to find ∂L/∂x and ∂L/∂y.Using the chain rule, we can calculate ∂L/∂x and ∂L/∂y, which would propagate to the other layers.Let us assume the function F is a convolutional operation between input matrix X and a filter matrix F. convolution between Input X and Filter F, gives us an output matrix O.This process describe the convolutional Forward Pass (FP) and the output matrix O will be forwarded to the next layer as an input matrix X.For the Backward Pass (BP) we find the loss gradient with respect to the Output O from the next layer as ∂L/∂O.As seen in Fig. 1, we can find the local gradients ∂O/∂X and ∂O/∂F with respect to Output O .Using chain rule and the loss gradient from the previous layer ∂L/∂O, we can calculate ∂L/∂X and ∂L/∂F.In the following subsections, we briefly show the process of calculating both gradients to help determine the critical overlapping elements.

Finding filter local gradient ∂L/∂F
The first step is to calculate the local gradient ∂L/∂F which will be used to update the new weight matrix Therefore, finding derivatives with respect to F gives Eq. (1), for every element of F where M and N are the dimensions of matrix O: Expanding and substituting the values of ∂O/∂F with X gives a set of equations which represent a convolutional operation between input X and loss gradient ∂L/∂O as shown in Fig. 2. (1) F i, j .
Fig. 1 Function F during a backward pass to calculate input local gradients

Finding previous layer local gradient ∂L/∂X
Finding the local gradient ∂O/∂X which will be backpropagated to the previous layer is achieved using the Chain rule as in Eq. ( 2): For every element of X where M and N are the dimensions of matrix O: Expanding Eq. ( 2) and substituting ∂O/∂X with F which is equal to the derivative of O with respect to X gives a full set of equations which show that ∂L/∂X can be represented as a full convolution between a 180-degree rotated Filter F and loss gradient ∂L/∂O as seen in Fig. 3. Full convolution is different from regular or valid convolution where a filter only scans elements within the feature map without going outside (adding padding), full convolution applies filters starting from the top-left element being centered, and padding is needed for the elements outside the feature map.Equations ( 3) and ( 4) highlight an overlap in computation between the forward and backward pass, convolutional computations are required for feature extraction and gradient calculations.An efficient design should take advantage of this X i, j . ( by reusing computational resources dedicated to the forward pass during the backward pass, we describe this in more detail in the following sections, we will also use dF and dX to refer to L F and L X in the following sections.

Pooling
No learning takes place on the pooling layers.The function of the pooling layer is to progressively reduce the convolution spatial size and reduce the number of parameters and computation in the network.This also controls overfitting by reducing the feature space available during training.There are two common types of pooling: average and maximum.

Average pooling
During the FP, average pooling outputs the average of the input elements using a scanning window where a K by K window scans through the input matrix from the top left to the bottom right.During the BP, the error ∂L/∂O is multiplied by 1 K 2 where K is the dimension of the scanning window, the result is assigned to the whole pooling block (all units get the same value).

Max pooling
During the FP, max-pooling outputs the maximum of the input elements (winning element) using a scanning window.The BP error is simply assigned to the "winning element" because other units in the previous layer's pooling blocks did not contribute to the output, hence all the others are assigned values of zero.

Rectified linear unit activation (ReLU)
During the FP, the ReLU layer changes all negative elements to zero while retaining the value of the positive elements.No learning takes place and no spatial/depth information is changed.During the BP, gradients of the positive elements retain their value while the rest become zero.

Fully connected
During the FP, the fully connected layer calculates the dot product of the vectorized input X = [X 1 X 2 X n ] , the weight matrix W, and bias vector B, as described in Eq. ( 5).
(5)  6) and ( 7) describe the backward pass to calculate local gradients dW and dX using the loss gradient from the previous layer (softmax) and input and weight vectors X and W:

Softmax loss function
Softmax function takes an N-dimensional vector of real numbers x from the fully connected layer and transforms it into a vector of real number probabilities p , size k, in range (0, 1) which add up to 1, as described in Eq. ( 8).

Cross entropy indicates the distance between what the model believes the output distribution should be, and what
the distribution is.It is defined as Eq. ( 9) Equation ( 10) describes the final derivative of the Crossentropy loss function with a softmax, with p being the output vector and y being the ground truth target, this is the global gradient for the Softmax layer which will be backpropagated to the previous layers.

Automated hardware design generation
The model-based hardware design workflow shown in Fig. 4, automatically generates target agnostic RTL for CNN architectures for series-networks and directed acyclic graphs.Our tool first parses a pre-trained input graph provided by the user for network information and parameters.This information is used to initialize and generate PEs for the different functionalities, these PEs are then fetched to populate the design space creating a hardware model representation of the input graph.This is achieved in Simulink, this hardware model is explored using a Multi-Objective Genetic Algorithm (MOGA) by taking advantage of the inter-layer parallelism in Conv layers and dedicating more or fewer PEs for convolutional operations.Once the hardware model is verified, RTL code is generated for the whole design.The tool uses MATLAB scripts, Simulink environment, and Vivado for a complete compilation of a pre-trained CNN.The layers presented in Sect. 2 are supported by the compiler to be translated into hardware for FPGA use within a streaming interface where data inputs are streamed through the design pipeline with control signals.The control signal is generated alongside the pixel stream to schedule and control the process, this signal specifies the beginning and end of rows and columns plus the validity of each pixel using a 5-bit binary vector as seen in Fig. 5.The different layers are converted into separate processing units, each unit contains a number of Processing Elements (PEs) such as convolution, pooling, non-linearity, and fully connected operations.The PEs for the FP are described first, then changes to the design will be introduced to accommodate backpropagation and learning.The verification environment, seen in Fig. 6, is a co-simulation environment between the hardware Simulink model and an HDL simulation tool, Vivado in our case.The tool Fig. 4 Overall system diagram from CNN input to generated and verified HDL design compiles an HDL project and generates a Simulink co-simulation block.The block is used to communicate between the Simulink environment and the HDL code.The clientside Simulink environment allows the user to provide input stimulus in the form of a MATLAB classification dataset or test images.The server reads the dataset samples and corresponding ground truth information, it generates a data stream and a control signal from the images and feeds it into the HDL code block.The HDL code block communicates with the HDL simulation tool by sending this data and reading its outputs.This is achieved using Tool Command Language (TCL) scripts.The output data is sent back to the Simulink environment where it can be displayed, saved, visualized, and analysed.
Our implementation uses a streaming interface that lends itself well to real-time data streams instead of static data reads from memory.This means we do not use the standard array of PEs approach, which is typically efficient for offline inference setups.Instead, we translate the architecture into a pipeline where data is streamed through allowing for new samples to be read as they become available.The blocks using this interface do not need a configuration option for the exact image size or the size of the inactive regions.In addition, if image parameters are changed, there will be no need to update each block.Instead, an update to the image parameters once at the serialization step is sufficient and subsequent tuneable parameters for the other blocks such as the size of the line buffers will be changed as a function of the new input data size by the compiler.Once these parameters are fixed and a hardware model is generated, it is not possible to modify them at runtime.By using a streaming pixel interface with control signals, each block or object starts computation on a new segment of pixels at the start-of-line or start-of-frame signal.The computation occurs whether the block or object receives the end signal for the previous segment or not.The work on the automated generation of the FP pipeline is described at length in our previous publication [18].To develop our new online system, we revisit and update critical information for the scientific community.

Proposed convolutional layer design
The convolutional PE C PE requires two main processing blocks, a line buffer, and a Multiply and Accumulate (MAC) block.The line buffer uses FIFOs to buffer the input feature maps as they are streamed into the PE, then tap out the elements required for the currently active window, the active window is always of the size K × K which is the size of the applied filter.For instance, a 3 × 3 filter requires 9 elements from different rows and is synchronized using a control signal.The MAC core simply receives the tapped-out elements, multiplies them by their corresponding filter elements, and accumulates them into an element of the output feature map using an adder tree.The output feature map is streamed further into the pipeline for subsequent processing as new elements become available each cycle [18].Equations ( 11), (12), and (13) show the resource modelling equations Where K is the filter dimension, N mult is the number of multipli- ers,N add is the number of adders and N add stages is the number of stages needed in the adder tree.Equation (14) shows the main latency modelling equation for the convolutional pipeline. (11) Max N add stages N add stages , Fig. 5 Control signal for a 3 × 3 window with 1-pixel padding interval [17] Fig. 6 Co-simulation and verification environment using Simulink as a client and Vivado as server allowing for sending stimuli and visualizing responses The variables for the latency modelling equation are: • BP, FP: Back Porch and Front Porch values for vertical blanking.The total of Back porch + Front porch must be at least 2 times the largest filter size.• FMi W FMi H : Feature Map width and height in pixels.For the backward pass, two more C PEs are required to calculate dF and dX as described in Sect.3.1 and Eq. ( 3) and (4).These PEs require weights and input data from the last forward pass, thus every training cycle requires saving the input feature maps and last updated weights for the FP C PE into an internal memory block.Weights for the C PE are K 2 elements saved as FixP = 16-bit fixed point and thus require K 2 × FixP ∕8 Bytes of memory.A Conv Layer with N Conv PEs, therefore, requires N × K 2 × FixP ∕8 Bytes of internal mem- ory to buffer in the weights for a fully parallelised convolutional layer.The feature maps require buffering of N (FMw × FMh × FixP)∕8 Bytes while consider- ing that feature map dimensions are spatially reduced deeper in the network from pooling and convolution, dimensionality is described by the following equation FMw = ((FMw − K − P∕S) + 1 where P is padding and S is stride.During backpropagation, weights and input . feature maps are read by the dF and dX units requiring additional registers to save the intermediate results and two more C PEs to calculate new local gradients and update the weights.The results are saved into BRAM.dF is used to update the weights and the new weights are read again next FP cycle, dX is read by the next layer.This process needs to be synchronized so needed inputs are available to be read and written from/to memory when needed.This is achieved using a new scheduler and a memory-control block added to each convolutional unit as needed.The C PE used for the FP convolution can be reused during the BP to perform the 2D convolution required to calculate dF and update the weights.The same C PE can be adjusted to calculate dX as well, if latency is not a priority.Otherwise, parallelising dF and dX calculations would require an additional C PE .Figure 7 shows a backpropagation unit, input X from the previous FP is saved and read from the memory block to be used in calculating the gradients, two C PE are dedicated to calculating dF and dX.dX is calcu- lated using the rotated F matrix and dY of the next layer, dX is passed to the next layer as dY.dF from the current cycle is used to update the weights F for the next FP using learning rate α.The new weights F are used for the next FP and the outputs are again saved in memory for the next BP cycle.The generated Simulink hardware model with the same process can be seen in Fig. 8 the inputs are a BP enable signal, previous layer dL stream, learning factor,

Proposed pooling layer design
For our work, we consider both Maximum and average pooling implementations: 4.1.2.1 Average pooling A stripped-down version of the CL PE is used for average pooling, the same memory control, and kernel core structure is used to apply weights that result in averaged values of the scanning windows, no registers are needed to save the kernel coefficients and no weights and biases are read from memory.For the BP, the same line buffer FIFOs are used to tap a zero mask for every element from dX and add the value of that element to the zero mask.

Maximum pooling
Maximum pooling uses the same Memory control structure but replaces the MAC core with a comparator tree.For the Backward pass, there is no gradient with respect to non-maximum values.Thus, the gradient dX from the next layer is passed back to only the winning element (maximum element).All other neurons get zero gradients.To implement, this the same line buffers are used to store elements from dX, the input X feature map elements from the FP are saved into on-chip memory FIFO18E1 or FIFO36E1 alongside their indices.These elements are used to upsample the input feature map using demultiplexers.

Proposed fully connected layer design
The fully connected PE ( FC PE ) is implemented as paral- lelized MAC units.These units multiply the feature-map elements streamed in from the last layer and multiply them by their corresponding weights and accumulates them into an output register.This is done for each FC output head.The process can be further parallelized to alleviate the processing bottleneck of streamed data by using multiple FC PEs in parallel to process the data channels from the last layer concurrently.Equation (15)(16)(17) show the resource modelling equations while Eq. ( 18) shows the latency modelling equation.
where FC out is the number of output heads in the FC layer.N is the number of FC PE dedicated for every output head.N add is the number of adders needed for the adder tree.N reg is the number of registers needed to accumulate the outputs of the different PEs.
P is the Parallelism-coefficient, a new term equal to the ratio between the number of input data channels over the number of FC PEs .So Ch D ∕FC PE equals to 1 when the number of FC PE is equal to the number of input data chan- nels, fully paralyzing the computation at the expense of chip resources.
For the BP, the FP PEs can also be reused for the MAC calculations to calculate dF and dX which require matrix multiplications, dX calculations are parallelized to improve throughput and thus require additional resources.Also, a scheduler and a memory control block are required to both synchronize the process and save intermediate results from the forward pass.Figure 9 shows how the FC PEs are used to both calculate the FP output FC_Out which are saved in memory for the BP, the gradients dF used to update the weights, and dX which is backpropagated to the next layer.

Resource estimation for fully connected units and FC
PEs for FP The generated hardware model uses generic PEs for the FP pipelines, this allows us to estimate the resources needed for the generated pipeline.
• DSP Slices: 1 DSP slice is mapped to every FC unit dedicated to multiplying and accumulating weights assigned to the unit with incoming feature map elements.• LUT slices: An upward of ~ 20 slices are dedicated to every FC PE as logic. ( FC PE (T) • Block RAM: No BRAM is required for the FC units and PEs.

Resource estimation for fully connected units and FC
PEs for BP The FPGA hardware representation of the BP pipeline of the fully connected PEs follows the following estimations.
• DSP Slices: 1 DSP slices is mapped to every FC unit dedicated to multiplying and accumulating weights assigned to the unit with incoming feature map elements.For an FC layer with FC Out output heads, FC Out DSP slices are required, an additional FC Out is needed to calculate dX assuming the FP slices are reused to calculate dF.• LUT slices: An upward of ~ 400 is dedicated to every FC head as logic.10 LUT Slices for dX, ~ 200 for dF, and ~ 100 for the memory control and scheduler.An upward of ~ 20 is dedicated to every FC PE as logic.• Block RAM: One 18 K BRAM tile is required to save one feature map vector of 32 × 32 (one data channel) at 16 bits.

Proposed softmax and loss function design
Softmax is optional to include, it is implemented using MathWorks's Native Floating Point library [35] and requires 80 DSP slices, 20 K LUT slices, and no BRAM.

Scheduler
The new scheduler is implemented as an automated block using Simulink.It uses conditional switches to route input data streams between the FP and BP pipelines.The streaming architecture allows for variable stream sizes as the data is always accompanied by a control signal.The MAC Core size however is fixed and presents a challenge since the BP path convolutions would have different size Conv operation "filters".This is circumvented by buffering the results and reusing the existing MAC core as needed.It is possible to compile the MAC core to support BP convolutions and reduce the latency of that stage.For our experiments, we use MAC cores that only support the original FP size.A BP_En signal acts as the main synchronizing signal for switching between the two modes.This signal enables the data reshaping, rotating the F(t − 1) weight matrix, buffering the dL gradients for the MAC operations, and updating the weight and bias matrices with the new gradients.The scheduler block receives a BP_En signal, learning rate variable, dL, dF, and dX plus their control signals when needed.

Automated design generation using MOGA
When all processing elements are generated from the highlevel architecture data (number of layers, network graph, layer-specific parameters), they are used to populate a design space.The process to generate a design space for an architecture requires three steps: Parse an input architecture for parameters and connections in MATLAB.
Programmatically generate generic PEs based on the data in MATLAB/Simulink.Use the PEs, the latency, resources estimation equations, and the architecture data to model, explore and generate a design space.
Each architecture has a high-level description which will include the number of layers, the connections between them, and PEs initialized using filter size, stride, and padding parameters.The high-level description also allows for optionally including the SoftMax layer and training pipeline or just implementing an inference pipeline.Lastly, the FPGA board's working frequency and resources in terms of available DSP slices, BRAM, and logic are considered.The design space exploration process is treated as a multi-objective optimization problem.Our goal is to find different configurations of the high-level architecture using intra-layer parallelism.Multi-objective optimization involves minimizing multiple objective functions subject to a set of constraints, for our case; the latency and FPGA resources.The constraints will depend on the architecture and user input.Figure 10 shows how the convolutional PEs are mapped to the pipeline, pooling and non-linearity follow each layer but are not included in the figure and the number of FC PEs can be adjusted to parallelize the FC operations.PEs are dedicated to each active data channel streaming a feature map from the previous layer.The number of active PEs tasked with applying the layer's filter is adjusted to reduce the resource overhead.For instance, a direct mapping means every filter will have PEs is what we aim to optimize, this is done by exploring the design space.Our optimization algorithm of choice is a genetic multi-objective optimization algorithm [36].We provide the algorithm with the latency estimation equations per layer and PE, the resource estimation equations per PE, the number of convolutional layers, the number of filters per layer, the size of the filters, and the number of DSP slices available on the chip.The optimization goal is to minimize the number of PEs needed to implement an architecture while minimizing latency.
We provide the algorithm with three input arguments, a fitness function, an input vector whose size is equal to the number of variables to be optimized in the problem, and bounding constraints.The output returned by the MOGA is the points on the Pareto front, and the objective function values at the found Pareto front.For each iteration, the population in the current generation is the number of samples the MOGA picks for optimization returning a fitness score for each variable, these scores inform the sampling of the next generation's population.With a large population size, the genetic algorithm searches the solution space more thoroughly, thereby reducing the chance that the algorithm does not find a global minimum.However, a large population size also causes the algorithm to run more slowly, for our tests, we use higher population sizes for the deeper networks.The number of variables in the optimization is set to the number of convolutional layers, with an input vector P of size n elements constrained by a lower bound of 1 and an upper bound of ub.where ub is the number of convolutional filters in that layer.The optimization objectives are described by y, a vector of size four, where latency, DSP, LUT slices, and BRAM are optimized.DSP objective optimization is described by Eqs.(19) and (20).k is a vector of size n where each element is the dimension of the filters per layer (3 × 3 for example).L is a vector of size n, it holds the values of the maximum PEs needed to fully parallelize computations as calculated in Eq. (19).L(i) therefore holds the number of PEs required for layer i in the current configuration.For instance, if the number of input channels to L(2) is P(1) = 3, and P( 2) is randomly set by the optimizer to 3 then L(2) is 9, meaning 9 PEs are required for the fully parallelized computations in that layer.

MOGA results
The architectures used for all experiments in this section and Sects.5.2 and 5.3 are listed in Tables 1 and 2. Four custom architectures with differing depths and number of operations are used to simulate possible implementations of different complexity and computational load.For instance, a custom network of the shape a1 − a2 − a3 − a4 represents a network with four convolutional layers with a filter each layer, these layers are each followed by non-linearity ReLU, a pooling layer (average pooling was used for all the architectures in ( 19) Y(DSP), Y(LUT), and Y(BRAM) use the resource estimation data from the previous section and the latency equations Eqs. ( 14) and (18) to model the active configuration's latency estimation.This is based on the number of C PEs used per convolution layer and the number of FC PEs mapped to parallelize the FC layer.We use DSP slices as an optimizable objective alone with latency in the examples below because we have noticed that DSP slices estimation is the most accurate, unlike LUT slices and BRAM.This is because DSP slices are easier to predict as they are directly mapped for multiplications which can be estimated with high accuracy.When DSP slices are used for optimizing resources against latency, the other metrics consistently satisfy the constraints and conditions set by the user as well.Figure 11 shows the MOGA results, a clear trade-off between latency and DSP slices can be seen in all architectures.For MNIST 8-8, the optimizer found fewer configurations bounded between a minimal mapping and a direct mapping which is possible for such a small architecture, unlike the other three which were all constrained by the maximum number of DSP slices available on the Zynq-7100 which is 2020 slices.

FPGA experimental setup
In Sects.5.2 and 5.3, we aim to perform place and route implementations to assess performance metrics of the proposed pipelines, latency and resources will be used to explore the performance trade-offs between the different designs, also to estimate the resources overhead of including the learning pipeline.Multiple designs are generated for each architecture using the MOGA, these designs will provide a trade-off between throughput and resources by reducing the number of active data channels per layer.Four designs for each architecture will be used for the experimental setup.Training is achieved using stochastic gradient descent.We specify a mini-batch size (B s ) which sets the size of the subset batch we use for training from all available data for an iteration.We update weights after passing the data samples in each mini-batch, meaning the gradients are calculated once for each mini-batch.We stream the minibatch through the network one sample at a time, this does not affect the final training results.We choose to do this for two reasons.First, streaming mini-batch samples lends its self well to our main design objective which is generating CNN inference and online training designs, continues learning inherently necessitates small sample size processing for training due to the limited availability of new data, this stands in contrast to having access to a full dataset off-chip and reading a mini-batch to train offline.Second, smaller mini-batch sample sizes, starting from one, allow for successfully training on smaller designs which only use on-chip memory and can require less than 2% of the chip's DSP slices as our results will show in Sect.5.3, smaller input sample sizes allow for on-board training with limited resources.Larger sample sizes can be used depending on a design's requirements but for our experiments, all sample sizes were set to one to simulate a continuous learning scenario where the design is deployed on an embedded setting and new data is available in smaller samples for fine-tuning an already-trained network.Training a model from scratch using an offline dataset is not the objective of our work, for such scenarios discrete GPU setups remain to be more suitable.It is worth noting that the new data made available is of known classes, continuous learning of new classes is not possible using our setup since it would require an online change to the architecture's output layers, this is possible using the FPGA's online reconfiguration capabilities but not supported in this work.Tables 3, 4, 5, 6 show different implementations of the designs generated from the architectures listed in Table 1 trained on the datasets in Table 2.The generated designs can support FP inference alone or include backpropagation.Multiple designs with differing resource requirements and throughput performance were used for each architecture to highlight the different design options a user can pick from based on hardware limitations and task-specific performance and objectives.All simulations and implementations were carried out on the Xilinx Zynq-7100 board.The post place and route implementations of the designs highlight the overhead of including backpropagation support in the designs.The overhead is consistent with the resource estimation details given in the previous section.Additional DSP slices are mapped to account for the additional parallelized processing needed to calculate local gradients and update weights.Additional BRAM blocks are dedicated to buffer intermediate data from the FP required for BP calculations and additional LUT slices are needed for the local schedulers and the overall new BP glue logic.Input data is sent to the testbench using the Vivado/Simulink co-simulation environment described in Sect.4.1.The architectures were trained on a single GeForce GTX 1050 Ti GPU and an i7-6700 3.40 GHz × 8 CPU, the hyperparameters used for each training session were replicated for the FPGA training, the results are reported in Tables 3, 4, 5, 6. GPUs performance is optimal on batches because of the GPU's built-in parallelizing capabilities.A different approach to comparing the performance based on our practical objectives is one-sample latency simulating data streams.Accuracies achieved using these hyperparameters specified in Tables 3

FPGA experimental results and analysis
All tables show the resources of the different designs for FP and BP highlighting the overhead of including a training pipeline into the design which pushes some implementations beyond the available on-chip resources (seen highlighted in red) in Tables 4 and 5.These designs were included to highlight the overhead and the possibility of implementation when a larger board is available, the latency results included however are estimations from pre place-and-route simulations.The different designs generated based on the number of dedicated PEs also highlight a clear trade-off between latency and resources, for instance, in Table 3 designs-1 and 4 show an 200× speed up for 9× more DSP slices allowing the user to tailor the design process to their specific application.In Training times, the GPU consistently performs better than the FPGA except for design-1 and 2 in Tables 3 and 4, this is because the FPGA is processing data as a stream one sample at a time while the GPU is built to process batches in parallel.A more accurate comparison can be seen in the stream-mode simulation on the GPU, it shows that the FPGA consistently outperforms the GPU on single-sample scenarios like online continuous learning.Latency results highlighted in green indicate that the FPGA outperformed the GPU for that specific design configuration and dataset, orange indicates the opposite.The results highlight FPGA platforms as strong candidates for future online learning applications, the per-sample latency, and energy overhead consistently outperform GPUs in online training scenarios with limited input data rates.Table 7 shows comparisons with related works, there is a clear lack of research in the area of online training of CNNs on FPGA, to the best of our knowledge [27] and [30] are the only works that have implemented training architectures on FPGAs.In [27] and [30] Stratix-V and Startix-10 boards were used with 150 MHz and 240 MHz frequencies, 32-bit and 16-bit data representations with 1963 and 5760 DSP usage to implement LeNet-5 MNIST and CIFAR-10 custom architecture respectively.Our training pipeline allows for new data to be processed as early as it becomes available while previous samples are still being processed, however the results we reported were for single samples, meaning the resources for the already processed areas of the pipeline become idle.When data is provided as early as an area of the pipeline becomes idle, we achieved an effective improvement in latencies upward of 4× times for deeper architectures.It is also worth mentioning that [30] used 5760 DSP slices at 240 MHz against our 2020 DSP slices at 200 MHz.

Conclusion
In this work, we have presented a fully automated design backpropagation pipeline for CNNs with a focus on online training.The pipeline uses a streaming interface and a modular design approach, generic PEs are generated and used to populate a design space based on the user's specific requirements including the option to compile a training pipeline.We explore the design space using a multi-objective genetic algorithm and an analytical model of the network's latency and estimated resources, these allow for optimizing the hardware model by exploiting CNN's intralayer parallelism.Latency trade-offs of 95× for MNIST, 71× for CIFAR-10, and 18× for SVHN were achieved.Trade-offs in resource utilization in terms of DSP Slices were 44× for MNIST, 52× for SVHN, and 24× for CIFAR-10.The training pipeline is generated based on an overlap in computation between the forward and backward passes.We translate the overlap into hardware by reusing most of the forward pass pipeline reducing the resources overhead.The design also minimizes the need for off-chip memory by utilizing BRAM to buffer in parameters and intermediate feature maps when needed, feature maps are streamed during the FP.We minimize data movement by placing BRAM memory near PEs, this can be used to buffer in data when external memory is required for larger designs which was not covered in this work as we only used small to medium scale architectures.The results show that the FPGA implementations of these architectures outperform their GPU and CPU counterparts in most online learning scenarios where samples are streamed at a limited rate, A 2.8×, 5.8×, and 3× speed up over GPU was achieved on three deeper architectures trained on MNIST, SVHN, and CIFAR-10 respectively.Using batch processing, GPUs consistently outperform FPGA for larger designs.Our main objective in developing these tools is to support the onboard deployment of deep learning models for both inference and online learning.This tool can provide deep learning engineers with an easy and accessible design cycle.There is a difficult challenge in scaling these tools efficiently to account for the rapid progress in modern CNNs, both in terms of the shape and size of new architectures.Future works in the field of on-board learning should tackle the limitations

Fig. 2 Fig. 3
Fig. 2 Demonstration of how ∂L/∂F = Convolution of input matrix X and loss gradient ∂L/∂O • Padding: Time needed to implement vertical and horizontal padding on input data and move it to the output.• Tap Out : Time needed to move data from the line buffer into registers which taps out relevant elements needed for the multiplier, K Clk Cycles.• Multiply Out : Time needed to move data from Tap_out registers to Multipliers, K Clk Cycles.• Add -Tree Out : Time needed to move Multiply_out results from registers to adder tree.Takes (N add stages × Clk) + 2 delay Cycles to put intermediate data in registers.• In Delay , Out Delay ∶ Time needed to move data into reg- isters and give the system enough time for processing.Set to 4 Clk each, In Delay is only needed for the first layer, Out Delay is included for every convolutional PE. • ReLU: Time needed to apply rectified linear unit on elements, a conditional switch is used, it takes 1 clock cycle per element.

Fig. 7 4 . 1 . 1 . 2 Fig. 8
Fig. 7 Backward pass diagram for the Convolutional Unit, T for the FP cycle, T + 1 for backpropagation, T − 1 for the input data saved from the previous cycle

Fig. 9
Fig. 9 Backward pass diagram for the Fully Connected Unit

Table 1
Architectures used for validation

Table 7
this work and the ones that preceded it, this includes the support for deeper architectures, and the different layers that are present in modern CNNs, automating the design and implementation of external memory protocols is, therefore, necessary as well.Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http:// creat iveco mmons.org/ licen ses/ by/4.0/. of