Recursive SQL and GPU-support for in-database machine learning

In machine learning, continuously retraining a model guarantees accurate predictions based on the latest data as training input. But to retrieve the latest data from a database, time-consuming extraction is necessary as database systems have rarely been used for operations such as matrix algebra and gradient descent. In this work, we demonstrate that SQL with recursive tables makes it possible to express a complete machine learning pipeline out of data preprocessing, model training and its validation. To facilitate the specification of loss functions, we extend the code-generating database system Umbra by an operator for automatic differentiation for use within recursive tables: With the loss function expressed in SQL as a lambda function, Umbra generates machine code for each partial derivative. We further use automatic differentiation for a dedicated gradient descent operator, which generates LLVM code to train a user-specified model on GPUs. We fine-tune GPU kernels at hardware level to allow a higher throughput and propose non-blocking synchronisation of multiple units. In our evaluation, automatic differentiation accelerated the runtime by the number of cached subexpressions compared to compiling each derivative separately. Our GPU kernels with independent models allowed maximal throughput even for small batch sizes, making machine learning pipelines within SQL more competitive.


Introduction
Typically, steps of machine learning pipelines-that consist of data preprocessing [1,2], model training/validation [3] and finally its deployment on unlabelled data [4]-are embedded in Python scripts that call up specialised tools such as and measure the performance of generated GPU kernels. In particular, this work's contributions are • machine learning pipelines expressed in pure SQL, • automatic differentiation in SQL that uses a lambda function [52][53][54] to derive the gradient and generates LLVM code, • the integration of gradient descent as a database operator, • fine-tuned GPU kernels that maximise the GPU specific throughput even for small and medium batch sizes, • and an evaluation of strategies for synchronising gradient descent on processing units with different throughput.
The paper first summarises subsidiary work on machine learning pipelines, GPU coprocessing and in-database machine learning (Sect. 2), before it proceeds with the integration of gradient descent inside a database system. In detail, we focus on data preprocessing for machine learning pipelines and recursive computation of gradient descent within the code generating database system Umbra (Sect. 3). During codegeneration, an operator for automatic differentiation compiles the gradient from a lambda expression (Sect. 4). This allows to express arbitrary loss functions including matrix operations as used for a neural network (Sect. 5). Based on automatic differentiation and a dedicated operator for gradient descent, we compile LLVM code directly for GPUs. The generated code processes mini batches on GPUs and synchronises parallel workers on multiple devices as well as multiple learners on a single GPU (Sect. 6). We will evaluate CPU and GPU-only approaches in terms of performance and accuracy using a NUMA-server cluster with multiple GPUs (Sect. 7).

Related work
This work incorporates past research on deploying continuous machine learning pipelines, GPU co-processing and in-database machine learning, which is here introduced.  an approximated value m ( ) and the given label y, for example, mean squared error (Eq. 2): To minimise l X ( ) , gradient descent updates the weights per iteration by subtracting the loss function's gradient times the learning rate until the optimal weights ∞ are approximated (Eq. 6). Stochastic gradient descent takes one tuple for each step (Eq. 4), whereas batch gradient descent considers all tuples per iteration and averages the loss (Eq. 5): Smaller batch sizes, mini-batches, are mandatory when the entire input does not fit into memory and allows parallelism later on. Therefore, mini-batch gradient descent splits an input dataset X into disjoint mini-batches X = X 0 ⊎ ⋯ ⊎ X o . Using a recursive table, gradient descent can be expressed in SQL. Listing 1 shows five iterations based on an exemplary loss function with two weights (Eq. 7): First, the weights get initialised (line 2), then each iteration updates the weights (Eq. 5, line 3) based on manually derived gradients (Eq. 8) and = 0.05: (4) t+1 = t − ⋅ ∇l ,y ( t ), To avoid overfitting, one mini-batch or an explicitly given input will serve as the validation dataset for accuracy measurements. The remaining mini-batches will be used as input for every iteration of parallel and distributed mini-batch gradient descent.
To avoid underestimation of the predicted values, Derakhshan et al. [30] propose root mean squared logarithmic error (RMSLE) on the validation dataset:

Machine learning pipeline in SQL
We argue that SQL offers all components needed for data preprocessing, and recursive tables allow gradient descent to be performed. Thus, we reimplemented the components of a machine learning pipeline (see Fig. 2) proposed by Derakhshan et al. [30] in SQL (see Fig. 3): (log((m ( ) + 1) − log(y + 1)) 2

Fig. 2
Components of a machine learning pipeline: chunked input will be processed independently (potentially off-loaded to GPUs). After every iteration, the weights (blue) are synchronised • The Input parser parses input CSV files and stores the data in chunks using row-major format to allow batched processing of mini-batch gradient descent. In SQL, this corresponds to a simple table scan. In Umbra, we can also use a foreign table as input for continuous views (table taxidata). • The Feature extractor extracts features from data chunks, which is a simple projection in SQL. For example, day and hour are extracted from timestamps, distance metrics from given coordinates (view processed). • The Anomaly detector deletes tuples of a chunk on anomalies. An anomaly occurs when at least one attribute in a tuple passes over or under a predefined threshold. For anomalies, we filter for user-defined limits in a selection (view normalised). • The Standard scaler scales all attributes in the range [0, 1] to equal each attribute's impact on the model, this corresponds again to a projection and nested subqueries to extract the attribute's extrema (view normalised). • The Scheduler manages gradient descent iterations until the weights converge.
This can be either done using recursive tables or using an operator that off-loads work to GPU.
Listing 2 shows the resulting SQL queries using a taxi dataset as exemplary input and a linear function to predict a trip's duration based on its day, hour, distance and bearing. In this example, we perform 50 iterations of mini-batch gradient descent based on a sample size of ten tuples (tablesample reservoir (10)) and a learning rate of 0.001. In every iteration, we subtract the average gradient from the weights (line 7-14), which we finally use to compute the loss (line 15/16). As computing each partial derivative manually can be bothersome and error-prone for , bear max(bear) σ d<d limit ∧h<h limit ∧hav<hav limit ∧bear<bear limit

Fig. 3
Operator plan inside of a database system with linear regression on the New York taxi dataset in relational algebra: a projection extracts the features as haversine (hav) distance or bearing (bear), anomalies are deleted using predefined thresholds (denoted as limit) complex loss functions, we proceed with an operator for automatic differentiation in the next section.

Database operators for machine learning
This section describes the operators in Umbra, we created to facilitate machine learning in SQL. Modern database systems like Umbra generate code for processing chunks of tuples in parallel pipelines, so we first explain code generation before presenting the operators for automatic differentiation and gradient descent. The algorithm for automatic differentiation presents how to parse an expression to calculate the derivatives and is expanded for matrix operations in Sect. 5.

Code generation
With Umbra as the integration platform, an operator follows the concept of a codegenerating database system. It achieves parallelism by starting as many pipelines as threads available and expects each operator in a query plan to generate code for processing chunks of tuples. Unary operators can process tuples within a pipeline, whereas binary operators have to materialise at least the result of one incoming child node first before pipelined processing begins. Each operator of Umbra, similar to HyPer [18], provides two functions, produce() and consume() to generate code. On the topmost operator of an operator tree, produce() is called, which recursively calls the same method on its child operators. Arriving at a leaf operator, it registers pipelines for parallel execution and calls consume() on the parent node. Within these pipelines, the generated code processes data inside registers without overhead. An operator for gradient descent is a pipeline breaker, as it accesses batches of tuples multiple times until the weights converge, whereas an operator for automatic differentiation is part of a pipeline as it just adds the partial derivatives per tuple.

Automatic differentiation
Reverse mode automatic differentiation first evaluates an expression, then it propagates back the partial derivative in reverse order by applying the chain rule (see Fig. 4). Each partial derivative is the product of the parent one (or 1 for the topmost node) and the derived function with its original arguments as input (see Fig. 5). This allows computing each expression's derivative in one pass by reusing each subexpression. Algorithm 1 parses an arithmetic expression to perform reverse mode automatic differentiation. It uses the evaluation of the original arguments to calculate the partial derivatives. Arriving at a variable, the propagated value is added to a hashtable with the variable as key. The hashtable allows retrieving the derivatives, which is the sum of propagated values per variable, afterwards.
As Umbra compiles arithmetic expressions to machine code as well, it is perfectly suited for automatic differentiation. Similar to how an arithmetic SQL expression is compiled during code generation, we created a function that can be used to generate the expression's partial derivatives: Once a partial derivative has been compiled, its subexpressions will be cached inside an LLVM register that can be reused to generate the remaining partial derivatives. This accelerates the runtime during execution. To specify the expression, we integrated lambda functions as introduced in HyPer into Umbra. Lambda functions are used to inject user-defined SQL expressions into table operators. Originally developed to parametrise distance metrics in clustering algorithms or edges for the PageRank algorithm, lambda functions are expressed inside SQL queries and allow "variation points" [67] in otherwise inflexible operators. In this way, lambda expressions broaden the application of the default algorithms without the need to modify the database system's core. Furthermore, SQL with lambda functions substitutes any new query language, offers the flexibility and variety of algorithms needed by data scientists, and ensures usability for non-expert database users. In addition, lambda functions allow user-friendly function specification, as the database system automatically deduces the lambda expressions' input and output data types from the previously defined table's attributes. Lambda functions consist of arguments to define names for the tuples (but whose scope is operator specific) and the expression itself. All provided operations on SQL types, even on arrays, are allowed: We expose automatic differentiation as a unary table operator called derivation that derives an SQL expression with respect to every affected column reference and adds its value as a further column to the tuple (see Listing 3). We can use the operator within a recursive table to perform gradient descent (see Listing 4,5). This eliminates the need to derive complex functions manually and accelerates the computation with a rising number of attributes, as each subexpression is evaluated only once.

Gradient descent operator
Our operator for gradient descent materialises incoming tuples, performs gradient descent and produces the optimal weights for labelling unlabelled data. The proposed operator is considered a pipeline breaker as it needs to materialise all input tuples beforehand to perform multiple training iterations. This section focuses on the operator characteristics, the design with its input queries and the actual implementation, with linear regression as an example.

Operator design
We design an operator for gradient descent, which requires one input for the training, one for the initial weights and optionally one for the validation set, and returns the optimal weights. If no input is given as validation set, a fraction of the training 1 3 set will be used for validation. The user can set the parameters for the batch size, the number of iterations and the learning rate as arguments inside the operator call (see Listing 6). Figure 6 shows gradient descent inside of an operator tree: it expects the training data set as parallel input pipelines and returns the optimal weights. These might serve as input for a query that labels a test data set. In addition, SQL lambda functions, which allow users to inject arbitrary code into operators, specify the loss function to be used for gradient descent. Gradient descent benefits from generated code as it allows user-defined model functions to be derived at compile time to compute its gradient without impairing query runtime.
This implies three parts for the integration of gradient descent: consuming all input tuples in parallel pipelines, performing gradient descent with a call to the GPU kernels and producing the weights in a new pipeline. This first separation is necessary, as we need to know the number of tuples in advance to determine when one training epoch ends. Specific to Umbra, we cannot assume the same number of available threads for training as for the parallel pipelines; we have to merge all materialised tuples before we start new parallel threads for the training iterations afterwards.

Implementation
The generated code runs gradient descent iterations in parallel. Devoted to batched processing on GPUs, we deduce a parallel mini-batch gradient descent operator. First, it materialises the input tuples thread locally (generated by consume()) and Fig. 6 Operator plan inside of a database system with one operator for training and a query for predicting labels. Dashed lines illustrate code generation, solid lines compiled code. The gradient descent operator materialises input from parallel pipelines within local threads, performs iterations and returns the optimal weights merges them globally. Afterwards, each thread picks one mini-batch and maintains a local copy of the global weights.
Algorithm 2 depicts the training procedure without GPU support. Again, for simplicity, the validation phase with the corresponding validation input is omitted. Inside of the two loops (lines 5-9), one is unrolled during compile time in order to dispatch tasks to parallel threads, and one executed at runtime to manage gradient descent iterations, we can later off-load work to GPUs. Inside such a code fragment, we start as many CPU threads as GPU units are available with whom one CPU thread is associated.

Neural network
As the previous sections presented gradient descent and automatic differentiation on scalar values, this section expands their use to matrix operations for training a neural network. Mini-batch gradient descent can be used to train a neural network with an adjusted model function. We assume every input with m attributes serialised as one input vector x ∈ ℝ 1×m together with a categorical label y ∈ L . This corresponds to a database tuple as we later store one image as one tuple with one attribute per pixel and colour.
We consider a fully connected neural network with one hidden layer of size h, consequently we gain two weight matrices w xh ∈ ℝ m×h and w ho ∈ ℝ h×|L| . With sigmoid (Eq. 10) as activation function (applied elementwise) we obtain a model 1 3 function m w xh ,w ho (x) ∈ ℝ 1×|L| , that produces an output vector of probabilities, also known as forward pass. The output vector is compared to the one-hot-encoded categorical label ( y ones ). The index of the entry being one should be equal to the index of the highest probability.
Although neural networks can be specified in SQL-92, the corresponding query will consist of nested subqueries, that are not intuitive to create. As we created an array data type in Umbra [54], nested subqueries can be avoided by using this data type extended by matrix algebra.

Neural network in SQL-92
Expressing neural networks in SQL-92 is possible having one relation for the weights and one for the input tuples (Listing 7). The weights relation will contain the values in normal form as a coordinate list. If one relation contains all weight matrices, it will also contain one attribute (id) to identify the matrix.
We implement the forward pass in SQL as a query on the validation dataset that returns the probabilities for each category. It uses nested queries to extract the weights by an index and arithmetic expressions for the activation function. Listing 7 shows the forward pass with one single layer and two attributes ( m = 2 ) as input. For simplicity, we use SQL functions for nested subqueries and for the sigmoid function.

Neural network with an array data type
Expressing matrix operations in SQL-92 has the downside of manually specifying each elementwise multiplication. For this reason, Umbra and HyPer provide an array data type that is similar to the one in PostgreSQL and allows algebraic expressions as matrix operations.
In Listing 8, we first construct the weight matrices from its relational representation and apply the sigmoid function on arrays as a user-defined function. Hence, the forward-pass for a single layer consists of the matrix multiplication and the sigmoid function on arrays.
An array data type allows transforming a categorical data type L into a one-hotencoded vector even if the number of categories |L| is not known at compile time. In Listing 9, we first create a dictionary to assign a consistent number to each category (line 3). Afterwards, we can create an array that corresponds to a one-hot-encoded vector, whose entry is one at the corresponding index with leading and subsequent zeros (line 4).

Training a neural network
Automatic differentiation allows training a neural network when the derivatives of matrix multiplication [68] (see Algorithm 3) and activation functions (see Table 1) are implemented. To integrate the derivatives of matrix multiplication in our existing implementation, we need to extend the derivation rule for multiplications ( Z ′ ⋅ Y T instead of z ′ ⋅ y and Z ′T ⋅ X instead of z ′ ⋅ x ) and overload the transpose operator internally, so that transpose, when called on a scalar such as real numbers like floating-point values, will be ignored. Furthermore, we need the Hadamard product Table 1 Implemented activation functions and their derivatives Name Hyperbolic tangent tanh X•Y (elementwise matrix multiplication) for the derivatives of the activation functions and the Hadamard power function X •Y (elementwise exponentiation) for mean squared error.
These adaptations allow our operator for automatic differentiation to train the weights for a neural network ( m = 4, h = 20, |L| = 3 ) within a recursive table (see Listing 10): We first initialise the weight matrices with random values (line 2) and then update the weights in each iterative step (line 4) using mean squared error Instead of relying on an operator for automatic differentiation, we can train a neural network by hand when applying the rules for automatic differentiation. With mean squared error, the loss is equal to the difference of labels and predicted probabilities (Eq. 12). The factor 2 can be omitted when the learning rate is doubled. To train the neural network for a given input vector, we have to backpropagate the loss and update the weights as follows: Figure 7 shows the corresponding computational graph and Listing 11 the corresponding code in SQL when the Hadamard product ( • ) is exposed as SQL operation (**).
To process a batch of input tuples instead of a single one, the multiplications are applied on matrices instead of vectors. The multiplication during the last steps, the weight updates (Eqs. 16,17), then sums up the delta for every tuple, so a simple division by the number of tuples is needed to construct the average gradient. This is helpful when vectorising mini-batch gradient descent in Sect. 6.1.2.

Multi-GPU gradient descent
This section explains our CUDA kernels for linear regression and neural networks, which one CPU worker thread starts once per GPU. We describe blocking and non-blocking algorithms so as not not to hinder faster GPUs from continuing their computation while waiting for the slower ones to finish. To synchronise multiple workers, we either average the gradient after each iteration or maintain local models as proposed by Crossbow [69]. We adapt this synchronisation technique to maximise the throughput of a single GPU as well. As a novelty, we implement learners at hardware level-each associated to one CUDA block-to maximise the throughput on a single GPU. Finally, we generate the kernels directly with LLVM to support lambda functions for model specification.

Kernel implementations
Developing code for NVIDIA GPUs requires another programming paradigm, as computation is vectorised to parallel threads that perform the same instructions simultaneously. Each GPU device owns one global memory (device memory) and an L2 cache. Core components are streaming multiprocessors with an attached shared memory (see Fig. 8) to execute specialised programs for CUDA devices (kernels). In these kernels, every thread receives a unique identifier, which is used to determine the data to process. 32 threads in a bundle are called a warp, multiple warps form a block and threads inside a block can communicate through shared memory and interfere with each other. To interfere with other threads, shuffling can be used to share or broadcast values with one or more threads within a warp.
To off-load gradient descent iterations to NVIDIA GPUs, we generate specialised kernels. In detail, we have to map batches to blocks; we can vary the number of threads per block and explicitly cache values in shared memory. In the following sections, we describe our kernel implementations for gradient descent with linear regression and a fully-connected neural network, and we will introduce independent learners at block-size granularity.

Linear regression
As linear regression is not a compute-bound but a memory-intensive application, we initially transfer as much training data into device memory as possible. If data exceeds the memory and more GPUs are available for training, we will partition the data proportionally to multiple devices. Otherwise, we reload the mini-batches on demand.
Each thread handles one input tuple and stores the resulting gradient after each iteration in shared memory. Each iteration utilises all available GPU threads, wherefore the size of a mini-batch must be greater or equal to the number of threads per block, to ensure that compute resources are not wasted. When the batch size is larger than a block, each thread processes multiple tuples and maintains a thread-local intermediate result, which does not require any synchronisation with other threads. After a mini-batch is processed, shuffling operations summarise the gradient to compute the average for updating the weights (tree reduction).

Neural network
Our initial approach was to adapt the gradient descent kernel for linear regression to train a neural network and to spread each tuple of a batch to one thread. As training neural networks is based on matrix operations, we rely on libraries for basic linear algebra subprograms for CUDA devices (cuBLAS 2 ), which provide highly optimised implementations. Our implementation uses the cuBLAS API for all operations on matrices or vectors. For example, the forward pass in a neural network uses matrix-vector multiplications (cublasDger()) for a single input tuple and, when a mini-batch is processed, matrix-matrix multiplications respectively (cubl-asDgemm()). To apply and derive the activation function, handwritten kernels are used that vectorise over the number of attributes. These kernels plus the library calls plus handwritten code build the foundation for parallelising to multiple GPUs.

Multiple learners per GPU
To utilise all GPU threads even with small batch sizes, we implement multiple workers on a single GPU. These are called learners [69] and ensure a higher throughput. Crossbow offers a coarse-grained approach as every learner launches multiple kernels, which limits its overall number. By contrast, our light-weight approach launches only one instance of a fine-grained kernel for one entire GPU. This enables full utilisation of the GPU as the number of learners could be much higher.
In our implementation (see Fig. 9), each learner corresponds to one GPU block. We can set the block size adaptively, by which the number of learners results. Consequently, one learner works on batch sizes of at least one warp, that is the minimum block size with 32 threads, or multiple warps. Hence, the most learners that are allowed is the number of warps that can be processed per GPU.
After each learner has finished its assigned batches, the first block synchronises with the other ones to update the global weights. But for multi GPU processing as well as for multiple learners per GPU, we need to synchronise each unit.

Synchronisation methods
As we intend to run gradient descent in parallel on heterogeneous hardware, we have to synchronise parallel gradient descent iterations. Based on a single-threaded naive gradient descent implementation, we propose novel synchronisation methods to compare their performance to existing ones and benchmark different hardware.
The naive implementation uses a constant fraction of its input data for validation and the rest for training. The training dataset is split into fixed-sized mini-batches. After a specified number of mini-batches but no later than after one epoch when the whole training set has been processed once, the loss function is evaluated on the validation set and the current weights. The minimal loss is updated and the next epoch starts. We terminate when the loss falls below a threshold l stop or a maximum number of processed batches ctr max . Also, we terminate if the loss has not changed within the last 10 iterations. Fig. 9 Multiple learners per GPU: Each block corresponds to one learner, each learner maintains local weights local and the difference local to the global weights . Each input tuple is stored in device memory and is scheduled to one GPU thread Based on the naive implementation, this section presents three parallelisation techniques, a blocking but synchronised one and two using worker threads with multiple local models or only one global one.

Synchronised iterations
At the beginning of each synchronised iteration, we propagate the same weights with an individual mini-batch to the processing unit. After running gradient descent, the main worker collects the calculated gradients and takes their average to update the weights.
Algorithm 4 shows this gradient descent function, taking as input the labelled dataset X, a learning rate , the batch size n and the parameter ctr max that limits the number of iterations. In each iteration, multiple parallel workers pick a mini-batch and return the locally computed gradient. Afterwards, the weights are updated. For simplicity, the validation pass is not displayed: When the calculated loss has improved, the minimal weights together with the minimal loss are set and terminate the computation when a minimal loss l min has been reached.
When synchronising a global model after each iteration, workers who may have finished their mini-batches earlier, are idle and waiting for input (see Fig. 10a). To maximise the throughput, independent workers have to fetch their mini-batches on their own. These independent workers either require local weights to be synchronised frequently (see Fig. 10c) or update global weights centrally (see Fig. 10b).

Worker threads with global updates (bulk synchronous parallel)
In Algorithm 5, we see worker threads that fetch the next batch independently and update a global model. Each worker increments a global atomic counter as a batch identifier and selects the corresponding batch consisting of the attributes and the labels. The current weights are used to compute the gradient; afterwards, the weights are updated globally. Besides, the first thread is responsible for managing the minimal weights. Assuming a low learning rate, we suppose the weights are changing marginally and we omit locks similar to HogWild [70]. Otherwise, the critical section (line 5)-gradient calculation and weights update-has to be locked, which would result in a single-threaded process as in Algorithm 4.

Worker threads with local models (model average)
To overcome race conditions when updating the weights, we adapt local models known from Crossbow [69] to work with worker threads. Crossbow adjusts the number of parallel so-called learners adaptively to fully utilise the throughput on (c) Local models Fig. 10 Scheduling mini-batches on four different workers: a shows worker threads whose weights are synchronised globally after each iteration and whose averaged gradient is used to update the global weights w; workers are idle when others have still not finished. b shows worker threads that update weights globally without any synchronisation; each worker is responsible for fetching the next batch on its own. To overcome race conditions, the worker threads in (c) maintain their local model that is synchronised lazily when every worker is done with at least one iteration different GPUs. Each learner maintains its local weights and the difference from the global weights. A global vector variable for every learner t called corrections t stores this difference, divided by the number of all learners. In each iteration, the weights are updated locally and these corrections are subtracted. After each iteration, the corrections of all learners are summed up to form the global weights. Algorithm 6 shows its adaption for worker threads. The main thread manages the update of the global model (line 11) that is the summation of all corrections. The critical section now consists of the computation of the corrections (lines 7-9) only, so the gradient can be computed on multiple units in parallel.

JIT compiling to GPU
The normal way to use the CUDA interface is to write the CUDA code, which is C++ with additional language elements to support kernel declarations. The compiled CUDA code can be invoked from the host as a special function invocation through the CUDA API. With a just-in-time architecture, which compiles the GPU code, one can keep the advantages of modularisation but also allow for more optimisations to take place during compile time. Similar to gradient computation on the CPU, the lambda function can be passed directly to customised model-specific kernels as it generates the gradient of a user-defined model function during compile time without impairing query time.

LLVM [71] is a framework that allows for just-in-time compilation via an intermediate representation in Static Single Assignment (SSA) form also called LLVM IR.
There are two ways of using LLVM for generating code for GPUs (see Fig. 11). One way is to use the API that NVIDIA itself provides called libNVVM. 3 libNVVM takes NVVM IR, 4 which is based on LLVM IR, and compiles it to PTX, 5 , an assembly language for NVIDIA GPUs, which can subsequently be run on the GPU. A disadvantage of this approach to using LLVM with CUDA is that the LLVM version NVVM IR is based on is older than the official version. The current NVVM IR is based on LLVM 7.0.1, while the latest version as of now is 12.0.1. 6 The alternative is using LLVM directly for generating LLVM IR. Using the NVPTX 7 backend, we can generate PTX, which can be run on the GPU. This is the approach we can take to reuse the Umbra IR to LLVM IR backend to have a highlevel API for our code generation. Fig. 11 Different existing paths of code generation for NVIDIA GPUs and how the translation from one layer to the next happens. cubin is the binary format loadable by the NVIDIA driver 3 https:// docs. nvidia. com/ cuda/ libnv vm-api/ index. html. 4 https:// docs. nvidia. com/ cuda/ nvvm-ir-spec/ index. html. 5 https:// docs. nvidia. com/ cuda/ paral lel-thread-execu tion/ index. html. 6 https:// lists. llvm. org/ piper mail/ llvm-annou nce/ 2021-July/ 000093. html. 7 https:// llvm. org/ docs/ NVPTX Usage. html.

LLVM NVIDIA specifics
Using LLVM for NVIDIA GPU code generation introduces some specifics. In C++ we have a special keyword __shared__ to mark memory as being shared, while the NVPTX backend is using a separate address space as can be seen in Fig. 12.
Furthermore, there are functions in LLVM IR specific to CUDA. For example in C++, the thread id in one dimension is saved as a special built-in variable 8 called threadIdx.x, while in LLVM IR the thread id can be retained via a function called @llvm.nvvm.read.ptx.sreg.tid.x, which gets subsequently mapped onto a special register called %tid.x in PTX again.

CUDA fast math optimizations
Previously, we only looked at optimisations that happen on an LLVM IR level. In a second step, we can additionally consider how to optimise the PTX that is generated from our LLVM IR. Floating-point operations normally cannot change in ordering or kind by optimisation. However, we can relax these rules and thus allow the optimiser to reorder and replace floating-point operations.
To allow this relaxation, LLVM IR provides fast math flags 9 that allow the specification of how the optimiser is allowed to modify the order of floating-point operations. The most interesting two flags are reassoc and contract. reassoc means that floating-point operations can be treated as associative. This is part of LLVM IR already, since operations are reordered by the optimiser. contract allows to combine multiple floating-point operations. There are no combined floating-point operations in LLVM IR, so setting this flag is just information for the specific backend behind LLVM (in our case, of course, NVPTX) to allow for this optimisation. Figure 14a shows the PTX code generated from the LLVM IR shown in Fig. 13. Here we do not have any fast math flags enabled. Highlighted in blue are all the floating-point calculations that match their LLVM IR equivalents roughly. For comparison, Fig. 14b shows the resulting PTX from Fig. 13 when the contract fast math flag is set. We see indeed that four floating-point operations, two multiplies Fig. 12 Comparison of the declaration of shared memory in C++ and LLVM IR. addrspace (3) indicates the pointer as pointing to memory in address space 3, which the NVPTX backend reserves for shared memory 1 3 (mul.rn.f64) and two adds (add.rn.f64), have been fused together to two fused-multiply-and-add (fma.rn.f64) operations.

CUB
Another feature that is used by our kernels is a library called CUB. 10 CUB provides primitives to simplify block-wide synchronisations to synchronise all threads currently running in a kernel. This can be used to efficiently execute tasks by using all currently available threads within a kernel. CUB facilitates this by offering primitives utilising some or all available threads. CUB is a header-only library, so there is no code to link. To use the features of CUB-for example, to sum up all the corrections in our sum_corrections kernel-we create a pre-compiled LLVM module to make this primitive available to be used by our kernel.  form of dedicated tensor cores. 11 We can use tensor cores for array operations within gradient descent while still allowing for just-in-time compilation. WMMA is an CUDA API to access the tensor cores since the Volta architecture. 12 Instead of allowing for arbitrary matrix computations, WMMA only supports matrix-matrix multiplication of specific sizes depending on the type. Table 2 shows some of the supported dimensions for different floating-point types available in CUDA. The API works by loading data into matrix fragments. Those fragments can be seen as special registers that hold the data for the tensor cores. After loading data into those fragments, we can execute the multiplication and store the result from a fragment into memory again. The WMMA instructions block all the threads in a warp. Therefore we can have one matrix multiplication per warp at the same time. Hence, one matrix multiplication per block as we use the warp size as block size.
For our simple example by Derakhshan et al. [30] with four weights, we can calculate the model or derivation for more than one tuple by loading the data in the appropriate shape into fragment. Examples with bigger dimensions are possible to be mapped onto WMMA because matrix multiplication is generally decomposable, for example, via Strassen's algorithm [72].

Evaluation
This section presents the evaluation of the operators in Umbra, in detail the performance increase through automatic differentiation in Umbra and code-generation for GPU, and the performance of our fine-grained learners on hardware level.

Automatic differentiation in umbra
Using synthetic data, we first compare three CPU-only approaches to compute batch gradient descent (the batch size corresponds to the number of tuples) on a linear model within SQL: Recursive tables with either manually or automatically derived gradients, and a dedicated (single-threaded) operator. All experiments were run multi-threaded on a 20-core Ubuntu 20.04.01 machine (Intel Xeon E5-2660 v2 CPU) with hyper-threading, running at 2.20 GHz with 256 GB DDR4 RAM. Figure 15 shows the compilation and execution time depending on the number of involved attributes. As we see, automatically deriving the partial derivatives speeds up compilation time, as fewer expressions have to be compiled, as well as execution time, as subexpressions are cached in registers for reuse. This performance benefit is also visible when the batch size, the number of iterations or the number of threads is varied (see Figs. 16,17). Furthermore, we observe the approach using recursive 1 3   tables computes aggregations in parallel, which accelerates computation on many input tuples with each additional thread.
The implemented derivation rules for automatic differentiation also allow deriving the gradient for arbitrary loss functions. Therefore, we also tested two different loss functions for linear regression, namely mean squared error (MSE, Eq. 2) and mean squared logarithmic error (RMSLE, Eq. 9). Figure 18 shows that the choose of the loss function does not influence the runtime significantly.
Using a recursive table, we can also solve logistic regression in SQL (see Listing 12). Using the sigmoid function (Eq. 18), we know its derivative (see Table 1), which we can use for numeric differentiation: The evaluation in Fig. 19 shows similar results as for linear regression, although function calls for the sigmoid function slowed down the runtime especially when derived manually.
To benchmark training a neural network, we are using the Fisher's Iris flower dataset [73] and a fully connected, two-layered neural network. In this experiment, we rather focus on the performance characteristics of the operator for automatic differentiation embedded in relational algebra than on the quality of different models. Figure 20 shows the compilation and execution time when training the neural network depending on the size of the hidden layer. As the size of the hidden layer depends on the weight matrices created at runtime, it does not affect the compile time. Whereas the runtime directly depends on the number of operations involved in matrix multiplications and thus increases with the size of the matrices. When we compare automatic differentiation to manually derived gradients, we applied the backpropagation rules by hand. So fewer operations were executed when using automatic differentiation, which currently computes the derivative for every variable involved. Hence, both approaches show similar performance when varying the size of the hidden layer, the batch size, the number of iterations and threads (see Fig. 21). Nevertheless, automatic differentiation eliminates the need for manually derived gradients and nested subqueries. Figure 22 displays further experiments such as varying the number of hidden layers and the training time depending on the batch size for one epoch, so when all tuples have been processed once, using the MNIST dataset. As expected, the runtime increases with additional layers and the training time for one epoch decreases with a higher batch size. We discuss the influence of the batch size on the statistical efficiency in Sect. 7.

Linear regression (GPU)
We benchmark linear regression with synthetic data and the New York taxi dataset, and a feed-forward neural network with a single hidden layer for image recognition (see Table 3). We take 2.65 GiB of the New York taxi dataset 13 (January 2015), on which we perform linear regression to forecast the taxi trip duration from the trip's distance and bearing, the day and the hour of the ride's beginning (four attributes). We tested on servers with four Intel Xeon Gold 5120 processors, each with 14 CPUs (2.20 GHz) running Ubuntu 20.04.01 LTS with 256 GiB RAM. Each server is connected to either four GPUs (NVIDIA GeForce GTX 1080 Ti/RTX 2080 Ti) or one NVIDIA Tesla V100-PCIE-32GB.
We measure the performance and the quality of the different parallelisation models on the CPU as well as the GPU according to the following metrics: 1. Throughput measures the size of processed tuples per time. It includes tuples used for training as well as for validation. 2. Time-to-loss, similarly to time-to-accuracy [74] for classification problems, measures the minimal loss on the validation dataset depending on the computation time. 3. Tuples-to-loss describes how many tuples are needed to reach a certain minimal loss. In comparison to time-to-loss, it is agnostic to the hardware throughput and measures the quality of parallelisation and synchronisation methods.
We perform gradient descent with a constant learning rate of 0.5 to gain the optimal weights. After a predefined validation frequency, every 3,000 batches, the current loss is computed on a constant validation set of 20 % the size of the original one. We vary the hyper-parameters of our implementation, i.e., the batch size and the number of workers. A thread records the current state every second to gather loss metrics.

Throughput vs. statistical efficiency
To measure the throughput for linear regression on different hardware, we consider batch sizes of up to 100 MiB. We compare the performance of our kernels to that when stochastic gradient descent of the TensorFlow (version 1.15.0) library is called (see Fig. 23).
The higher the batch size, the better the throughput when running gradient descent on GPUs as all concurrent threads can be utilised. Hardware-dependent, the maximal throughput converges to either 150 GiB/s (GeForce GTX 1080 Ti), 250 GiB/s (GeForce RTX 2080 Ti) or more than 300 GiB/s (Tesla V100). As developed for batched processing, our dedicated kernels (see Fig. 23a) can exploit 1 3 available hardware more effectively than the Python implementation (see Fig. 23b). As the latter calls stochastic gradient descent, this excels on better hardware only when a larger model has to be trained.
Nevertheless, training with large batch sizes does not imply statistical efficiency in terms of the volume of processed data that is needed for convergence (see Fig. 24a) and the lowest reachable minimal loss (see Fig. 24b). For that reason, to allow the highest throughput even for small batch sizes, we implement multiple learners per GPU.

Multiple learners per GPU
As the GPU is only fully utilised when the number of concurrently processed tuples is greater or equal to the number of parallel GPU threads, we benchmark multiple learners per GPU. As each learner corresponds to one GPU block consisting of a multiple of 32 threads, our implementation allows the highest throughput for every batch size, as a multiple of the block size. Therefore, we vary the number of threads per block (equal to a learner) between 32 and 1,024 and measure the throughput dependent on the batch size in multiples of 32 threads.
The observation in Fig. 25 corresponds to the expectation that a small number of threads per learner allows a higher throughput for small batch sizes. When the batch size is equal to a multiple of the chosen number of threads, the throughput reaches a local maximum. Otherwise, the GPU is underutilised. These local maxima are visible as spikes in all curves except for 32 threads per block, as we increase the batch size by 32 tuples. Nevertheless, on all devices, the throughput soon converges at the possible maximum, which shows the efficiency of learners in the granularity of GPU warps.

Scalability
When running gradient descent in parallel, we benchmark the four implementations for synchronising weights: no synchronisation with global updates (global updates), maintaining local models either with locking of the critical section (local models (locks)) or without locking (local models (dirty)), or synchronised updates that block until every worker has finished (synchronised (blocking)). We ran the experiments on the CPU as well as the GPU.
When parallelising on the CPU, each additional thread allows a linear speed-up when no synchronisation takes place (see Fig. 26a). Maintaining local models costs  Whereas parallelising for GPUs behaves differently (see Fig. 26b/c): the larger the batch size, the higher the scale-up. This is obvious, as less synchronisation is necessary for larger batch sizes and the parallel workers can compute the gradients independently. Also on GPUs, the implementation without any synchronisation and global updates scales best, even though not as linearly as on CPUs. In all implementations, one additional GPU allows a noticeably higher throughput. Maintaining local models requires inter-GPU communication of the local corrections to form the global weights, which decreases the performance significantly with the third additional device. To minimise this effect, the weight computation could be split up hierarchically.

Neural network (GPU)
To benchmark the feed-forward neural network, we perform image classification using the MNIST and Fashion-MNIST [75] dataset. We train the neural network with one hidden layer of size 200 to recognise a written digit (Fashion-MNIST: a piece of clothing) given as a single tuple representing an image with 784 pixels. We take 0.025 as learning rate, perform a validation pass every epoch and measure the throughput and the time to reach a certain accuracy (with the loss defined as the number of incorrectly classified tuples).

Throughput vs. statistical efficiency
Even though stochastic gradient descent using Keras (version 2.2.4) with Tensor-Flow allows a higher bandwidth than for linear regression due to more attributes per tuple (see Fig. 27b), our implementations, which call the cuBLAS library, process tuples batch-wise, which results in a higher bandwidth. As training a neural network is compute-bound involving multiple matrix multiplications, the throughput is significantly lower than for linear regression (see Fig. 27a), but allows a higher throughput, the larger the batch size.
As is the case for linear regression, training models with small batch sizes results in a higher accuracy (see Fig. 28b). This once again makes the case for multiple learners per single GPU. Nevertheless, the larger the chosen batch size is, the faster training iterations converge (see Fig. 28a).

Scalability
The scalability of parallel workers computing backpropagation resembles the scalability for training linear regression on GPUs: one additional worker increases the throughput, for any further workers, the inter-GPU communication decreases the runtime (see Fig. 29). For small batch sizes, training on two GPU devices has the best results, while for larger batch sizes, every additional device allows a higher throughput.

Time/tuples-to-loss
Regarding the time to reach a certain accuracy (see Fig. 30), all implementations perform similarly when running on a single worker. As the MNIST dataset converges fast, adding a GPU device for computation has no significant impact. Whereas the Fashion-MNIST dataset converges slower, the higher throughput when training with an additional worker results in the minimal loss being reached faster. We train with a small batch size as it allows faster convergence. Hereby, a scale-up is only measurable when training with up to two devices.

Integration into umbra
We want to evaluate code generation for GPU into Umbra in several ways. We evaluate code-generation having multiple learners and the performance increase for matrix multiplication using WMMA.

Code-generation: multiple learners
To investigate multiple learners per GPU, we want to compare the implementation generated by Umbra IR against the CUDA C++ implementation, which trains the taxi model. That way, we confirm that we reach a similar performance when using code-generation with Umbra IR. Figure 31 compares three different versions of the Crossbow implementation, having 3328 workers, 32 threads per block, one batch per worker, 25199507 tuples on a single NVIDIA GTX 970. The CUDA C++ implementation corresponds to the Crossbow approach but prebuilt in C++ via CUDA. The naive implementation is a one-to-one translation of the CUDA C++ kernels into Umbra IR. Finally, the optimised version builds upon the naive version by integrating optimisations like loop unrolling. For testing, we focus on the training part of Crossbow. As we can see, we come very close to the CUDA C++ implementation with our optimised solution.

WMMA
We want to evaluate whether the matrix multiplication of WMMA is more feasible than cuBLAS for training. Figure 32 shows the performance comparison of throughput for the taxi model matrix multiplication using WMMA and cuBLAS with __ half as data type on a NVIDIA GeForce RTX 3050 Ti. The cuBLAS performance is worse than using doubles, which might be explained by the fact the cuBLAS is not working well with the __half type in this case. For the evaluation of the taxi model, we had to arrange the matrix in a specific way to maximise the number of tuples evaluated per tensor core invocation. Despite this rearrangement, we achieve a much higher throughput using WMMA.  using Keras to a corresponding operator tree within the database system Umbra, either using a recursive table with a manually or automatically derived gradient or using an operator that off-loads to GPU. The pipeline consists of data loading from CSV, feature extraction (only for the New York taxi data) and normalisation either with NumPy or SQL-92 queries, and training. Measurements using the MNIST dataset were performed on a Ubuntu 20.04 LTS machine with four cores of Intel i7-7700HQ CPU, running at 2.80 GHz clock frequency each and 16 GiB RAM. For the taxi data, a machine with Intel Xeon Gold 5120 processors, each with 14 CPUs (2.20 GHz) and 256 GiB RAM, was used.

End-to-end analysis
We observe that much time is spent on data loading and preprocessing. These tasks are either no longer required if the data is already stored inside the database system, or can easily be processed in parallel pipelines. Furthermore, gradient descent using recursive tables showed comparable performance to library functions used, which is still outperformed by our operator that off-loads training to GPU.

Conclusion
This paper has created an in-database machine learning pipeline expressed in pure SQL based on sampling, continuous views and recursive tables. To facilitate gradient descent, we proposed an operator for automatic differentiation and one for Tuple Size Throughput (GB/s) WMMA cuBLAS gradient descent. The derivation rules allow deriving arbitrary models needed for training, for example, deep neural networks with multiple hidden layer. We used the operator for automatic differentiation for training within a recursive table in SQL. SQL provided flexibility as the iteration variable allows varying the learning rate or when accessing further tables for user-specific parameters. To off-load training to GPU units, we have implemented training algorithms as GPU kernels and fine-tuned learners at hardware level to increase the learning throughput. These kernels were integrated inside the code-generating database system Umbra. In comparison to handwritten derivatives, automatic differentiation as a database operator accelerated both the compile time and the execution time by the number of cached expressions. Furthermore, our evaluation benchmarked GPU kernels on different hardware, as well as parallelisation techniques with multiple GPUs. The evaluation has shown that GPUs traditionally excel the bigger the chosen batch sizes, which was only worthwhile when a slow-converging model was being trained. In addition, larger batch sizes interfered with statistical efficiency. For that reason, our fine-tuned learners at hardware level allowed the highest possible throughput for small batch sizes equal to a multiple of a GPU warp, so at least 32 threads. Our synchronisation techniques scaled up learning with every additional worker, even though this was not as linear for multiple GPU devices as for parallel CPU threads. Finally, our end-to-end machine learning pipeline in SQL showed comparable performance to traditional machine learning frameworks. In the future, translation from Python into SQL could increase the acceptance for in-database machine learning. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/ licenses/by/4.0/.